bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next 0/5] xdp: hints via kfuncs
@ 2022-10-27 20:00 Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
                   ` (5 more replies)
  0 siblings, 6 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

This is an RFC for the alternative approach suggested by Martin and
Jakub. I've tried to CC most of the people from the original discussion,
feel free to add more if you think I've missed somebody.

Summary:
- add new BPF_F_XDP_HAS_METADATA program flag and abuse
  attr->prog_ifindex to pass target device ifindex at load time
- at load time, find appropriate ndo_unroll_kfunc and call
  it to unroll/inline kfuncs; kfuncs have the default "safe"
  implementation if unrolling is not supported by a particular
  device
- rewrite xskxceiver test to use C bpf program and extend
  it to export rx_timestamp (plus add rx timestamp to veth driver)

I've intentionally kept it small and hacky to see whether the approach is
workable or not.

Pros:
- we avoid BTF complexity; the BPF programs themselves are now responsible
  for agreeing on the metadata layout with the AF_XDP consumer
- the metadata is free if not used
- the metadata should, in theory, be cheap if used; kfuncs should be
  unrolled to the same code as if the metadata was pre-populated and
  passed with a BTF id
- it's not all or nothing; users can use small subset of metadata which
  is more efficient than the BTF id approach where all metadata has to be
  exposed for every frame (and selectively consumed by the users)

Cons:
- forwarding has to be handled explicitly; the BPF programs have to
  agree on the metadata layout (IOW, the forwarding program
  has to be aware of the final AF_XDP consumer metadata layout)
- TX picture is not clear; but it's not clear with BTF ids as well;
  I think we've agreed that just reusing whatever we have at RX
  won't fly at TX; seems like TX XDP program might be the answer
  here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
  back into the kernel)

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (5):
  bpf: Support inlined/unrolled kfuncs for xdp metadata
  veth: Support rx timestamp metadata for xdp
  libbpf: Pass prog_ifindex via bpf_object_open_opts
  selftests/bpf: Convert xskxceiver to use custom program
  selftests/bpf: Test rx_timestamp metadata in xskxceiver

 drivers/net/veth.c                            |  31 +++++
 include/linux/bpf.h                           |   1 +
 include/linux/btf.h                           |   1 +
 include/linux/btf_ids.h                       |   4 +
 include/linux/netdevice.h                     |   3 +
 include/net/xdp.h                             |  22 ++++
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/syscall.c                          |  28 ++++-
 kernel/bpf/verifier.c                         |  60 +++++++++
 net/core/dev.c                                |   7 ++
 net/core/xdp.c                                |  28 +++++
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf.h                        |   6 +-
 tools/testing/selftests/bpf/Makefile          |   1 +
 .../testing/selftests/bpf/progs/xskxceiver.c  |  43 +++++++
 tools/testing/selftests/bpf/xskxceiver.c      | 119 +++++++++++++++---
 tools/testing/selftests/bpf/xskxceiver.h      |   5 +-
 18 files changed, 348 insertions(+), 22 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c

-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
@ 2022-10-27 20:00 ` Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Kfuncs have to be defined with KF_UNROLL for an attempted unroll.
For now, only XDP programs can have their kfuncs unrolled, but
we can extend this later on if more programs would like to use it.

For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
implements all possible metatada kfuncs. Not all devices have to
implement them. If unrolling is not supported by the target device,
the default implementation is called instead. We might also unroll
this default implementation for performance sake in the future, but
that's something I'm putting off for now.

Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
we treat prog_index as target device for kfunc unrolling.
net_device_ops gains new ndo_unroll_kfunc which does the actual
dirty work per device.

The kfunc unrolling itself largely follows the existing map_gen_lookup
unrolling example, so there is nothing new here.

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h            |  1 +
 include/linux/btf.h            |  1 +
 include/linux/btf_ids.h        |  4 +++
 include/linux/netdevice.h      |  3 ++
 include/net/xdp.h              | 22 +++++++++++++
 include/uapi/linux/bpf.h       |  5 +++
 kernel/bpf/syscall.c           | 28 +++++++++++++++-
 kernel/bpf/verifier.c          | 60 ++++++++++++++++++++++++++++++++++
 net/core/dev.c                 |  7 ++++
 net/core/xdp.c                 | 28 ++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++
 11 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9fd68b0b3e9c..54983347e20e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1176,6 +1176,7 @@ struct bpf_prog_aux {
 		struct work_struct work;
 		struct rcu_head	rcu;
 	};
+	const struct net_device_ops *xdp_kfunc_ndo;
 };
 
 struct bpf_prog {
diff --git a/include/linux/btf.h b/include/linux/btf.h
index f9aababc5d78..23ad5f8313e4 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -51,6 +51,7 @@
 #define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
 #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
 #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
+#define KF_UNROLL       (1 << 7) /* kfunc unrolling can be attempted */
 
 /*
  * Return the name of the passed struct, if exists, or halt the build if for
diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index c9744efd202f..eb448e9c79bb 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -195,6 +195,10 @@ asm(							\
 __BTF_ID_LIST(name, local)				\
 __BTF_SET8_START(name, local)
 
+#define BTF_SET8_START_GLOBAL(name)			\
+__BTF_ID_LIST(name, global)				\
+__BTF_SET8_START(name, global)
+
 #define BTF_SET8_END(name)				\
 asm(							\
 ".pushsection " BTF_IDS_SECTION ",\"a\";      \n"	\
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a36edb0ec199..90052271a502 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -73,6 +73,7 @@ struct udp_tunnel_info;
 struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
+struct bpf_insn;
 struct xdp_buff;
 
 void synchronize_net(void);
@@ -1609,6 +1610,8 @@ struct net_device_ops {
 	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
 						  const struct skb_shared_hwtstamps *hwtstamps,
 						  bool cycles);
+	int                     (*ndo_unroll_kfunc)(struct bpf_prog *prog,
+						    struct bpf_insn *insn);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..b465001936ac 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -7,6 +7,7 @@
 #define __LINUX_NET_XDP_H__
 
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <linux/btf_ids.h> /* btf_id_set8 */
 
 /**
  * DOC: XDP RX-queue information
@@ -83,6 +84,7 @@ struct xdp_buff {
 	struct xdp_txq_info *txq;
 	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
 	u32 flags; /* supported values defined in xdp_buff_flags */
+	void *priv;
 };
 
 static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
@@ -409,4 +411,24 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_HAVE_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_have_rx_timestamp) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+
+extern struct btf_id_set8 xdp_metadata_kfunc_ids;
+
+enum {
+#define XDP_METADATA_KFUNC(name, str) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+static inline u32 xdp_metadata_kfunc_id(int id)
+{
+	return xdp_metadata_kfunc_ids.pairs[id].id;
+}
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 94659f6b3395..6938fc4f1ec5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 11df90962101..5376604961bc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2461,6 +2461,20 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 /* last field in 'union bpf_attr' used by this command */
 #define	BPF_PROG_LOAD_LAST_FIELD core_relo_rec_size
 
+static int xdp_resolve_netdev(struct bpf_prog *prog, int ifindex)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev;
+
+	for_each_netdev(net, dev) {
+		if (dev->ifindex == ifindex) {
+			prog->aux->xdp_kfunc_ndo = dev->netdev_ops;
+			return 0;
+		}
+	}
+	return -EINVAL;
+}
+
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 {
 	enum bpf_prog_type type = attr->prog_type;
@@ -2478,7 +2492,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_HAS_METADATA))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2566,6 +2581,17 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	if (attr->prog_flags & BPF_F_XDP_HAS_METADATA) {
+		/* Reuse prog_ifindex to carry request to unroll
+		 * metadata kfuncs.
+		 */
+		prog->aux->offload_requested = false;
+
+		err = xdp_resolve_netdev(prog, attr->prog_ifindex);
+		if (err < 0)
+			goto free_prog;
+	}
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8f849a763b79..3e734e15ffa3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13864,6 +13864,37 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	return err;
 }
 
+static int unroll_kfunc_call(struct bpf_verifier_env *env,
+			     struct bpf_insn *insn)
+{
+	enum bpf_prog_type prog_type;
+	struct bpf_prog_aux *aux;
+	struct btf *desc_btf;
+	u32 *kfunc_flags;
+	u32 func_id;
+
+	desc_btf = find_kfunc_desc_btf(env, insn->off);
+	if (IS_ERR(desc_btf))
+		return PTR_ERR(desc_btf);
+
+	prog_type = resolve_prog_type(env->prog);
+	func_id = insn->imm;
+
+	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, prog_type, func_id);
+	if (!kfunc_flags)
+		return 0;
+	if (!(*kfunc_flags & KF_UNROLL))
+		return 0;
+	if (prog_type != BPF_PROG_TYPE_XDP)
+		return 0;
+
+	aux = env->prog->aux;
+	if (!aux->xdp_kfunc_ndo)
+		return 0;
+
+	return aux->xdp_kfunc_ndo->ndo_unroll_kfunc(env->prog, insn);
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env,
 			    struct bpf_insn *insn)
 {
@@ -14027,6 +14058,35 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_CALL)
 			continue;
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
+			if (bpf_prog_is_dev_bound(env->prog->aux)) {
+				verbose(env, "no metadata kfuncs offload\n");
+				return -EINVAL;
+			}
+
+			insn_buf[0] = *insn;
+
+			cnt = unroll_kfunc_call(env, insn_buf);
+			if (cnt < 0) {
+				verbose(env, "failed to unroll kfunc with func_id=%d\n", insn->imm);
+				return cnt;
+			}
+			if (cnt > 0) {
+				if (cnt >= ARRAY_SIZE(insn_buf)) {
+					verbose(env, "bpf verifier is misconfigured (kfunc unroll)\n");
+					return -EINVAL;
+				}
+
+				new_prog = bpf_patch_insn_data(env, i + delta,
+							       insn_buf, cnt);
+				if (!new_prog)
+					return -ENOMEM;
+
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				continue;
+			}
+
 			ret = fixup_kfunc_call(env, insn);
 			if (ret)
 				return ret;
diff --git a/net/core/dev.c b/net/core/dev.c
index fa53830d0683..8d3e34d073a8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9254,6 +9254,13 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			return -EOPNOTSUPP;
 		}
 
+		if (new_prog &&
+		    new_prog->aux->xdp_kfunc_ndo &&
+		    new_prog->aux->xdp_kfunc_ndo != dev->netdev_ops) {
+			NL_SET_ERR_MSG(extack, "Target device was specified at load time; can only attach to the same device type");
+			return -EINVAL;
+		}
+
 		err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
 		if (err)
 			return err;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..ea4ff00cb22b 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +710,30 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+noinline int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx)
+{
+	return false;
+}
+
+noinline __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx)
+{
+	return 0;
+}
+
+BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID_FLAGS(func, str, KF_UNROLL)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 94659f6b3395..6938fc4f1ec5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
@ 2022-10-27 20:00 ` Stanislav Fomichev
  2022-10-28  8:40   ` Jesper Dangaard Brouer
  2022-10-27 20:00 ` [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts Stanislav Fomichev
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

xskxceiver conveniently setups up veth pairs so it seems logical
to use veth as an example for some of the metadata handling.

We timestamp skb right when we "receive" it, store its
pointer in xdp_buff->priv and generate BPF bytecode to
reach it from the BPF program.

This largely follows the idea of "store some queue context in
the xdp_buff/xdp_frame so the metadata can be reached out
from the BPF program".

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 09682ea3354e..35396dd73de0 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -597,6 +597,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 
 		xdp_convert_frame_to_buff(frame, &xdp);
 		xdp.rxq = &rq->xdp_rxq;
+		xdp.priv = NULL;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -820,6 +821,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 	orig_data = xdp.data;
 	orig_data_end = xdp.data_end;
+	xdp.priv = skb;
 
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -936,6 +938,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct sk_buff *skb = ptr;
 
 			stats->xdp_bytes += skb->len;
+			__net_timestamp(skb);
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -1595,6 +1598,33 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int veth_unroll_kfunc(struct bpf_prog *prog, struct bpf_insn *insn)
+{
+	u32 func_id = insn->imm;
+
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_HAVE_RX_TIMESTAMP)) {
+		/* return true; */
+		insn[0] = BPF_MOV64_IMM(BPF_REG_0, 1);
+		return 1;
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		/* r1 = ((struct xdp_buff *)r1)->priv; [skb] */
+		insn[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1,
+				      offsetof(struct xdp_buff, priv));
+		/* if (r1 == NULL) { */
+		insn[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1);
+		/*	return 0; */
+		insn[2] = BPF_MOV64_IMM(BPF_REG_0, 0);
+		/* } else { */
+		/*	return ((struct sk_buff *)r1)->tstamp; */
+		insn[3] = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+				      offsetof(struct sk_buff, tstamp));
+		/* } */
+		return 4;
+	}
+
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1614,6 +1644,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_unroll_kfunc       = veth_unroll_kfunc,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-10-27 20:00 ` Stanislav Fomichev
  2022-10-27 20:05   ` Andrii Nakryiko
  2022-10-27 20:00 ` [RFC bpf-next 4/5] selftests/bpf: Convert xskxceiver to use custom program Stanislav Fomichev
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Allow passing prog_ifindex to BPF_PROG_LOAD. This patch is
not XDP metadata specific but it's here because we (ab)use
prog_ifindex as "target metadata" device during loading.
We can figure out proper UAPI story if we decide to go forward
with the kfunc approach.

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/lib/bpf/libbpf.c | 1 +
 tools/lib/bpf/libbpf.h | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5d7819edf074..61bc37006fe4 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -7190,6 +7190,7 @@ static int bpf_object_init_progs(struct bpf_object *obj, const struct bpf_object
 
 		prog->type = prog->sec_def->prog_type;
 		prog->expected_attach_type = prog->sec_def->expected_attach_type;
+		prog->prog_ifindex = opts->prog_ifindex;
 
 		/* sec_def can have custom callback which should be called
 		 * after bpf_program is initialized to adjust its properties
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index eee883f007f9..4a40b7623099 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -170,9 +170,13 @@ struct bpf_object_open_opts {
 	 */
 	__u32 kernel_log_level;
 
+	/* Optional ifindex of netdev for offload purposes.
+	 */
+	int prog_ifindex;
+
 	size_t :0;
 };
-#define bpf_object_open_opts__last_field kernel_log_level
+#define bpf_object_open_opts__last_field prog_ifindex
 
 LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
 
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC bpf-next 4/5] selftests/bpf: Convert xskxceiver to use custom program
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2022-10-27 20:00 ` [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts Stanislav Fomichev
@ 2022-10-27 20:00 ` Stanislav Fomichev
  2022-10-27 20:00 ` [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver Stanislav Fomichev
  2022-10-28 15:58 ` [RFC bpf-next 0/5] xdp: hints via kfuncs John Fastabend
  5 siblings, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

No functional changes (in theory): convert libxsk-generated
program bytecode to the C code to better illustrate kfunc
metadata (see next patch in the series).

There is also a bunch of unrelated changes, ignore them for the
sake of demo:
- stats.rx_dopped == 2048 vs 2047 ?
- buggy ksft_print_msg calls
- test is limited only to
  TEST_MODE_DRV+TEST_TYPE_RUN_TO_COMPLETION_SINGLE_PKT

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |  1 +
 .../testing/selftests/bpf/progs/xskxceiver.c  | 21 ++++
 tools/testing/selftests/bpf/xskxceiver.c      | 98 +++++++++++++++----
 tools/testing/selftests/bpf/xskxceiver.h      |  5 +-
 4 files changed, 105 insertions(+), 20 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 79edef1dbda4..3cab2e1b0e74 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -378,6 +378,7 @@ linked_maps.skel.h-deps := linked_maps1.bpf.o linked_maps2.bpf.o
 test_subskeleton.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o test_subskeleton.bpf.o
 test_subskeleton_lib.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o
 test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
+xskxceiver-deps := xskxceiver.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c b/tools/testing/selftests/bpf/progs/xskxceiver.c
new file mode 100644
index 000000000000..b135daddad3a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 681a5db80dae..066bd691db13 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -399,6 +399,58 @@ static void usage(const char *prog)
 	ksft_print_msg(str, prog);
 }
 
+static void bpf_update_xsk_map(struct ifobject *ifobject, __u32 queue_id)
+{
+	int map_fd;
+	int sock_fd;
+	int ret;
+
+	map_fd = bpf_map__fd(ifobject->bpf_obj->maps.xsk);
+	sock_fd = xsk_socket__fd(ifobject->xsk->xsk);
+
+	(void)bpf_map_delete_elem(map_fd, &queue_id);
+	ret = bpf_map_update_elem(map_fd, &queue_id, &sock_fd, 0);
+	if (ret)
+		exit_with_error(-ret);
+}
+
+static int bpf_attach(struct ifobject *ifobject)
+{
+	__u32 prog_id = 0;
+
+	bpf_xdp_query_id(ifobject->ifindex, ifobject->xdp_flags, &prog_id);
+	if (prog_id)
+		return 0;
+
+	int ret = bpf_xdp_attach(ifobject->ifindex,
+				 bpf_program__fd(ifobject->bpf_obj->progs.rx),
+				 ifobject->xdp_flags, NULL);
+	if (ret < 0) {
+		if (errno != EEXIST && errno != EBUSY) {
+			exit_with_error(errno);
+		}
+	}
+
+	bpf_update_xsk_map(ifobject, 0);
+
+	return 0;
+}
+
+static void bpf_detach(struct ifobject *ifobject)
+{
+	int my_ns = open("/proc/self/ns/net", O_RDONLY);
+
+	/* Make sure we're in the right namespace when detaching.
+	 * Relevant only for TEST_TYPE_BIDI.
+	 */
+	if (ifobject->ns_fd > 0)
+		setns(ifobject->ns_fd, 0);
+
+	bpf_xdp_detach(ifobject->ifindex, ifobject->xdp_flags, NULL);
+
+	setns(my_ns, 0);
+}
+
 static int switch_namespace(const char *nsname)
 {
 	char fqns[26] = "/var/run/netns/";
@@ -1141,9 +1193,10 @@ static int validate_rx_dropped(struct ifobject *ifobject)
 	if (err)
 		return TEST_FAILURE;
 
-	if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2)
+	if (stats.rx_dropped == ifobject->pkt_stream->nb_pkts / 2 - 1)
 		return TEST_PASS;
 
+	printf("%lld != %d\n", stats.rx_dropped, ifobject->pkt_stream->nb_pkts / 2 - 1);
 	return TEST_FAILURE;
 }
 
@@ -1239,7 +1292,6 @@ static void thread_common_ops_tx(struct test_spec *test, struct ifobject *ifobje
 {
 	xsk_configure_socket(test, ifobject, test->ifobj_rx->umem, true);
 	ifobject->xsk = &ifobject->xsk_arr[0];
-	ifobject->xsk_map_fd = test->ifobj_rx->xsk_map_fd;
 	memcpy(ifobject->umem, test->ifobj_rx->umem, sizeof(struct xsk_umem_info));
 }
 
@@ -1284,6 +1336,14 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 
 	ifobject->ns_fd = switch_namespace(ifobject->nsname);
 
+	ifindex = if_nametoindex(ifobject->ifname);
+	if (!ifindex)
+		exit_with_error(errno);
+
+	ifobject->bpf_obj = xskxceiver__open_and_load();
+	if (libbpf_get_error(ifobject->bpf_obj))
+		exit_with_error(libbpf_get_error(ifobject->bpf_obj));
+
 	if (ifobject->umem->unaligned_mode)
 		mmap_flags |= MAP_HUGETLB;
 
@@ -1307,11 +1367,8 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	if (!ifobject->rx_on)
 		return;
 
-	ifindex = if_nametoindex(ifobject->ifname);
-	if (!ifindex)
-		exit_with_error(errno);
-
-	ret = xsk_setup_xdp_prog_xsk(ifobject->xsk->xsk, &ifobject->xsk_map_fd);
+	ifobject->ifindex = ifindex;
+	ret = bpf_attach(ifobject);
 	if (ret)
 		exit_with_error(-ret);
 
@@ -1321,19 +1378,17 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 
 	if (ifobject->xdp_flags & XDP_FLAGS_SKB_MODE) {
 		if (opts.attach_mode != XDP_ATTACHED_SKB) {
-			ksft_print_msg("ERROR: [%s] XDP prog not in SKB mode\n");
+			ksft_print_msg("ERROR: XDP prog not in SKB mode\n");
 			exit_with_error(-EINVAL);
 		}
 	} else if (ifobject->xdp_flags & XDP_FLAGS_DRV_MODE) {
 		if (opts.attach_mode != XDP_ATTACHED_DRV) {
-			ksft_print_msg("ERROR: [%s] XDP prog not in DRV mode\n");
+			ksft_print_msg("ERROR: XDP prog not in DRV mode\n");
 			exit_with_error(-EINVAL);
 		}
 	}
 
-	ret = xsk_socket__update_xskmap(ifobject->xsk->xsk, ifobject->xsk_map_fd);
-	if (ret)
-		exit_with_error(-ret);
+	bpf_update_xsk_map(ifobject, 0);
 }
 
 static void *worker_testapp_validate_tx(void *arg)
@@ -1372,8 +1427,7 @@ static void *worker_testapp_validate_rx(void *arg)
 	if (test->current_step == 1) {
 		thread_common_ops(test, ifobject);
 	} else {
-		bpf_map_delete_elem(ifobject->xsk_map_fd, &id);
-		xsk_socket__update_xskmap(ifobject->xsk->xsk, ifobject->xsk_map_fd);
+		bpf_update_xsk_map(ifobject, id);
 	}
 
 	fds.fd = xsk_socket__fd(ifobject->xsk->xsk);
@@ -1481,6 +1535,8 @@ static int testapp_validate_traffic(struct test_spec *test)
 	if (test->total_steps == test->current_step || test->fail) {
 		xsk_socket__delete(ifobj_tx->xsk->xsk);
 		xsk_socket__delete(ifobj_rx->xsk->xsk);
+		bpf_detach(ifobj_tx);
+		bpf_detach(ifobj_rx);
 		testapp_clean_xsk_umem(ifobj_rx);
 		if (!ifobj_tx->shared_umem)
 			testapp_clean_xsk_umem(ifobj_tx);
@@ -1531,16 +1587,12 @@ static void testapp_bidi(struct test_spec *test)
 
 static void swap_xsk_resources(struct ifobject *ifobj_tx, struct ifobject *ifobj_rx)
 {
-	int ret;
-
 	xsk_socket__delete(ifobj_tx->xsk->xsk);
 	xsk_socket__delete(ifobj_rx->xsk->xsk);
 	ifobj_tx->xsk = &ifobj_tx->xsk_arr[1];
 	ifobj_rx->xsk = &ifobj_rx->xsk_arr[1];
 
-	ret = xsk_socket__update_xskmap(ifobj_rx->xsk->xsk, ifobj_rx->xsk_map_fd);
-	if (ret)
-		exit_with_error(-ret);
+	bpf_update_xsk_map(ifobj_tx, 0);
 }
 
 static void testapp_bpf_res(struct test_spec *test)
@@ -1635,6 +1687,8 @@ static bool testapp_unaligned(struct test_spec *test)
 {
 	if (!hugepages_present(test->ifobj_tx)) {
 		ksft_test_result_skip("No 2M huge pages present.\n");
+		bpf_detach(test->ifobj_tx);
+		bpf_detach(test->ifobj_rx);
 		return false;
 	}
 
@@ -1947,10 +2001,16 @@ int main(int argc, char **argv)
 
 	for (i = 0; i < modes; i++)
 		for (j = 0; j < TEST_TYPE_MAX; j++) {
+			if (j != TEST_TYPE_RUN_TO_COMPLETION_SINGLE_PKT) continue; // XXX
+			if (i != TEST_MODE_DRV) continue; // XXX
+
 			test_spec_init(&test, ifobj_tx, ifobj_rx, i);
 			run_pkt_test(&test, i, j);
 			usleep(USLEEP_MAX);
 
+			xskxceiver__destroy(ifobj_tx->bpf_obj);
+			xskxceiver__destroy(ifobj_rx->bpf_obj);
+
 			if (test.fail)
 				failed_tests++;
 		}
diff --git a/tools/testing/selftests/bpf/xskxceiver.h b/tools/testing/selftests/bpf/xskxceiver.h
index edb76d2def9f..c27dcbdb030f 100644
--- a/tools/testing/selftests/bpf/xskxceiver.h
+++ b/tools/testing/selftests/bpf/xskxceiver.h
@@ -5,6 +5,8 @@
 #ifndef XSKXCEIVER_H_
 #define XSKXCEIVER_H_
 
+#include "xskxceiver.skel.h"
+
 #ifndef SOL_XDP
 #define SOL_XDP 283
 #endif
@@ -134,6 +136,7 @@ typedef void *(*thread_func_t)(void *arg);
 struct ifobject {
 	char ifname[MAX_INTERFACE_NAME_CHARS];
 	char nsname[MAX_INTERFACES_NAMESPACE_CHARS];
+	struct xskxceiver *bpf_obj;
 	struct xsk_socket_info *xsk;
 	struct xsk_socket_info *xsk_arr;
 	struct xsk_umem_info *umem;
@@ -141,7 +144,7 @@ struct ifobject {
 	validation_func_t validation_func;
 	struct pkt_stream *pkt_stream;
 	int ns_fd;
-	int xsk_map_fd;
+	int ifindex;
 	u32 dst_ip;
 	u32 src_ip;
 	u32 xdp_flags;
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2022-10-27 20:00 ` [RFC bpf-next 4/5] selftests/bpf: Convert xskxceiver to use custom program Stanislav Fomichev
@ 2022-10-27 20:00 ` Stanislav Fomichev
  2022-10-28  6:22   ` Martin KaFai Lau
  2022-10-28 15:58 ` [RFC bpf-next 0/5] xdp: hints via kfuncs John Fastabend
  5 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:00 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Example on how the metadata is prepared from the BPF context
and consumed by AF_XDP:

- bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
  if not, I'm assuming verifier will remove this "if (0)" branch
- bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
  the program has to bpf_xdp_adjust_meta+memcpy it;
  maybe returning a pointer is better?
- af_xdp consumer grabs it from data-<expected_metadata_offset> and
  makes sure timestamp is not empty
- when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
 tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c b/tools/testing/selftests/bpf/progs/xskxceiver.c
index b135daddad3a..83c879aa3581 100644
--- a/tools/testing/selftests/bpf/progs/xskxceiver.c
+++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
@@ -12,9 +12,31 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) __ksym;
+extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
+
 SEC("xdp")
 int rx(struct xdp_md *ctx)
 {
+	void *data, *data_meta;
+	__u32 rx_timestamp;
+	int ret;
+
+	if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
+		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
+		if (ret != 0)
+			return XDP_DROP;
+
+		data = (void *)(long)ctx->data;
+		data_meta = (void *)(long)ctx->data_meta;
+
+		if (data_meta + sizeof(__u32) > data)
+			return XDP_DROP;
+
+		rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+		__builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
+	}
+
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
index 066bd691db13..ce82c89a432e 100644
--- a/tools/testing/selftests/bpf/xskxceiver.c
+++ b/tools/testing/selftests/bpf/xskxceiver.c
@@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info *umem, struct pkt_stream *pkt
 static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
 {
 	void *data = xsk_umem__get_data(buffer, addr);
+	void *data_meta = data - sizeof(__u32);
 	struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct ethhdr));
+	__u32 rx_timestamp = 0;
 
 	if (!pkt) {
 		ksft_print_msg("[%s] too many packets received\n", __func__);
@@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
 		return false;
 	}
 
+	memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
+	if (rx_timestamp == 0) {
+		ksft_print_msg("Invalid metadata received: ");
+		ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
+		return false;
+	}
+
 	return true;
 }
 
@@ -1331,6 +1340,7 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	u64 umem_sz = ifobject->umem->num_frames * ifobject->umem->frame_size;
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	LIBBPF_OPTS(bpf_xdp_query_opts, opts);
+	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
 	int ret, ifindex;
 	void *bufs;
 
@@ -1340,10 +1350,21 @@ static void thread_common_ops(struct test_spec *test, struct ifobject *ifobject)
 	if (!ifindex)
 		exit_with_error(errno);
 
-	ifobject->bpf_obj = xskxceiver__open_and_load();
+	open_opts.prog_ifindex = ifindex;
+
+	ifobject->bpf_obj = xskxceiver__open_opts(&open_opts);
 	if (libbpf_get_error(ifobject->bpf_obj))
 		exit_with_error(libbpf_get_error(ifobject->bpf_obj));
 
+	ret = bpf_program__set_flags(bpf_object__find_program_by_name(ifobject->bpf_obj->obj, "rx"),
+				     BPF_F_XDP_HAS_METADATA);
+	if (ret < 0)
+		exit_with_error(ret);
+
+	ret = xskxceiver__load(ifobject->bpf_obj);
+	if (ret < 0)
+		exit_with_error(ret);
+
 	if (ifobject->umem->unaligned_mode)
 		mmap_flags |= MAP_HUGETLB;
 
-- 
2.38.1.273.g43a17bfeac-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts
  2022-10-27 20:00 ` [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts Stanislav Fomichev
@ 2022-10-27 20:05   ` Andrii Nakryiko
  2022-10-27 20:10     ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Andrii Nakryiko @ 2022-10-27 20:05 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, Oct 27, 2022 at 1:00 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Allow passing prog_ifindex to BPF_PROG_LOAD. This patch is
> not XDP metadata specific but it's here because we (ab)use
> prog_ifindex as "target metadata" device during loading.
> We can figure out proper UAPI story if we decide to go forward
> with the kfunc approach.
>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  tools/lib/bpf/libbpf.c | 1 +
>  tools/lib/bpf/libbpf.h | 6 +++++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 5d7819edf074..61bc37006fe4 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -7190,6 +7190,7 @@ static int bpf_object_init_progs(struct bpf_object *obj, const struct bpf_object
>
>                 prog->type = prog->sec_def->prog_type;
>                 prog->expected_attach_type = prog->sec_def->expected_attach_type;
> +               prog->prog_ifindex = opts->prog_ifindex;
>
>                 /* sec_def can have custom callback which should be called
>                  * after bpf_program is initialized to adjust its properties
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index eee883f007f9..4a40b7623099 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -170,9 +170,13 @@ struct bpf_object_open_opts {
>          */
>         __u32 kernel_log_level;
>
> +       /* Optional ifindex of netdev for offload purposes.
> +        */
> +       int prog_ifindex;
> +

nope, don't do that, open_opts are for entire object, while this is
per-program thing

So bpf_program__set_ifindex() setter would be more appropriate


>         size_t :0;
>  };
> -#define bpf_object_open_opts__last_field kernel_log_level
> +#define bpf_object_open_opts__last_field prog_ifindex
>
>  LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
>
> --
> 2.38.1.273.g43a17bfeac-goog
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts
  2022-10-27 20:05   ` Andrii Nakryiko
@ 2022-10-27 20:10     ` Stanislav Fomichev
  0 siblings, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-27 20:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, Oct 27, 2022 at 1:05 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Oct 27, 2022 at 1:00 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Allow passing prog_ifindex to BPF_PROG_LOAD. This patch is
> > not XDP metadata specific but it's here because we (ab)use
> > prog_ifindex as "target metadata" device during loading.
> > We can figure out proper UAPI story if we decide to go forward
> > with the kfunc approach.
> >
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  tools/lib/bpf/libbpf.c | 1 +
> >  tools/lib/bpf/libbpf.h | 6 +++++-
> >  2 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index 5d7819edf074..61bc37006fe4 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -7190,6 +7190,7 @@ static int bpf_object_init_progs(struct bpf_object *obj, const struct bpf_object
> >
> >                 prog->type = prog->sec_def->prog_type;
> >                 prog->expected_attach_type = prog->sec_def->expected_attach_type;
> > +               prog->prog_ifindex = opts->prog_ifindex;
> >
> >                 /* sec_def can have custom callback which should be called
> >                  * after bpf_program is initialized to adjust its properties
> > diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> > index eee883f007f9..4a40b7623099 100644
> > --- a/tools/lib/bpf/libbpf.h
> > +++ b/tools/lib/bpf/libbpf.h
> > @@ -170,9 +170,13 @@ struct bpf_object_open_opts {
> >          */
> >         __u32 kernel_log_level;
> >
> > +       /* Optional ifindex of netdev for offload purposes.
> > +        */
> > +       int prog_ifindex;
> > +
>
> nope, don't do that, open_opts are for entire object, while this is
> per-program thing
>
> So bpf_program__set_ifindex() setter would be more appropriate

Oh, doh, not sure how I missed that. Thanks!

>
> >         size_t :0;
> >  };
> > -#define bpf_object_open_opts__last_field kernel_log_level
> > +#define bpf_object_open_opts__last_field prog_ifindex
> >
> >  LIBBPF_API struct bpf_object *bpf_object__open(const char *path);
> >
> > --
> > 2.38.1.273.g43a17bfeac-goog
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-27 20:00 ` [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver Stanislav Fomichev
@ 2022-10-28  6:22   ` Martin KaFai Lau
  2022-10-28 10:37     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Martin KaFai Lau @ 2022-10-28  6:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> Example on how the metadata is prepared from the BPF context
> and consumed by AF_XDP:
> 
> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>    if not, I'm assuming verifier will remove this "if (0)" branch
> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>    the program has to bpf_xdp_adjust_meta+memcpy it;
>    maybe returning a pointer is better?
> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>    makes sure timestamp is not empty
> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> 
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>   2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c b/tools/testing/selftests/bpf/progs/xskxceiver.c
> index b135daddad3a..83c879aa3581 100644
> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
> @@ -12,9 +12,31 @@ struct {
>   	__type(value, __u32);
>   } xsk SEC(".maps");
>   
> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) __ksym;
> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> +
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
>   {
> +	void *data, *data_meta;
> +	__u32 rx_timestamp;
> +	int ret;
> +
> +	if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
> +		ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
> +		if (ret != 0)
> +			return XDP_DROP;
> +
> +		data = (void *)(long)ctx->data;
> +		data_meta = (void *)(long)ctx->data_meta;
> +
> +		if (data_meta + sizeof(__u32) > data)
> +			return XDP_DROP;
> +
> +		rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> +		__builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> +	}

Thanks for the patches.  I took a quick look at patch 1 and 2 but haven't had a 
chance to look at the implementation details (eg. KF_UNROLL...etc), yet.

Overall (with the example here) looks promising.  There is a lot of flexibility 
on whether the xdp prog needs any hint at all, which hint it needs, and how to 
store it.

> +
>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }
>   
> diff --git a/tools/testing/selftests/bpf/xskxceiver.c b/tools/testing/selftests/bpf/xskxceiver.c
> index 066bd691db13..ce82c89a432e 100644
> --- a/tools/testing/selftests/bpf/xskxceiver.c
> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info *umem, struct pkt_stream *pkt
>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
>   {
>   	void *data = xsk_umem__get_data(buffer, addr);
> +	void *data_meta = data - sizeof(__u32);
>   	struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct ethhdr));
> +	__u32 rx_timestamp = 0;
>   
>   	if (!pkt) {
>   		ksft_print_msg("[%s] too many packets received\n", __func__);
> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, u32 len)
>   		return false;
>   	}
>   
> +	memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
> +	if (rx_timestamp == 0) {
> +		ksft_print_msg("Invalid metadata received: ");
> +		ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
> +		return false;
> +	}
> +
>   	return true;
>   }

>   


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp
  2022-10-27 20:00 ` [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-10-28  8:40   ` Jesper Dangaard Brouer
  2022-10-28 18:46     ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-10-28  8:40 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev


On 27/10/2022 22.00, Stanislav Fomichev wrote:
> xskxceiver conveniently setups up veth pairs so it seems logical
> to use veth as an example for some of the metadata handling.
> 
> We timestamp skb right when we "receive" it, store its
> pointer in xdp_buff->priv and generate BPF bytecode to
> reach it from the BPF program.
> 
> This largely follows the idea of "store some queue context in
> the xdp_buff/xdp_frame so the metadata can be reached out
> from the BPF program".
> 
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
>   1 file changed, 31 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 09682ea3354e..35396dd73de0 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -597,6 +597,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
>   
>   		xdp_convert_frame_to_buff(frame, &xdp);
>   		xdp.rxq = &rq->xdp_rxq;
> +		xdp.priv = NULL;

So, why doesn't this supported for normal XDP mode?!?
e.g. Where veth gets XDP redirected an xdp_frame.

My main use case (for veth) is to make NIC hardware hints available to
containers.  Thus, creating a flexible fast-path via XDP-redirect
directly into containers veth device.  (This is e.g. for replacing the
inflexible SR-IOV approach with SR-IOV net_devices in the container,
with a more cloud friendly approach).

How can we extend this approach to handle xdp_frame's from different 
net_device's ?


>   
>   		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>   
> @@ -820,6 +821,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>   
>   	orig_data = xdp.data;
>   	orig_data_end = xdp.data_end;
> +	xdp.priv = skb;
>   

So, enabling SKB based path only.

>   	act = bpf_prog_run_xdp(xdp_prog, &xdp);
>   
> @@ -936,6 +938,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>   			struct sk_buff *skb = ptr;
>   
>   			stats->xdp_bytes += skb->len;
> +			__net_timestamp(skb);
>   			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>   			if (skb) {
>   				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -1595,6 +1598,33 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   	}
>   }
>   
> +static int veth_unroll_kfunc(struct bpf_prog *prog, struct bpf_insn *insn)
> +{
> +	u32 func_id = insn->imm;
> +
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_HAVE_RX_TIMESTAMP)) {
> +		/* return true; */
> +		insn[0] = BPF_MOV64_IMM(BPF_REG_0, 1);
> +		return 1;
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> +		/* r1 = ((struct xdp_buff *)r1)->priv; [skb] */
> +		insn[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1,
> +				      offsetof(struct xdp_buff, priv));
> +		/* if (r1 == NULL) { */
> +		insn[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1);
> +		/*	return 0; */
> +		insn[2] = BPF_MOV64_IMM(BPF_REG_0, 0);
> +		/* } else { */
> +		/*	return ((struct sk_buff *)r1)->tstamp; */
> +		insn[3] = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
> +				      offsetof(struct sk_buff, tstamp));

Just to be clear, this skb->tstamp is a software timestamp, right?

> +		/* } */
> +		return 4;
> +	}

I'm slightly concerned with driver developers maintaining BPF-bytecode
on a per-driver bases, but I can certainly live with this if BPF
maintainers can.

> +
> +	return 0;
> +}
> +
>   static const struct net_device_ops veth_netdev_ops = {
>   	.ndo_init            = veth_dev_init,
>   	.ndo_open            = veth_open,
> @@ -1614,6 +1644,7 @@ static const struct net_device_ops veth_netdev_ops = {
>   	.ndo_bpf		= veth_xdp,
>   	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
>   	.ndo_get_peer_dev	= veth_peer_dev,
> +	.ndo_unroll_kfunc       = veth_unroll_kfunc,
>   };
>   
>   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-28  6:22   ` Martin KaFai Lau
@ 2022-10-28 10:37     ` Jesper Dangaard Brouer
  2022-10-28 18:46       ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-10-28 10:37 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: brouer, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf


On 28/10/2022 08.22, Martin KaFai Lau wrote:
> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
>> Example on how the metadata is prepared from the BPF context
>> and consumed by AF_XDP:
>>
>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>>    if not, I'm assuming verifier will remove this "if (0)" branch
>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>>    the program has to bpf_xdp_adjust_meta+memcpy it;
>>    maybe returning a pointer is better?
>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>>    makes sure timestamp is not empty
>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
>>
>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Willem de Bruijn <willemb@google.com>
>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>> Cc: xdp-hints@xdp-project.net
>> Cc: netdev@vger.kernel.org
>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> ---
>>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>>   2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c 
>> b/tools/testing/selftests/bpf/progs/xskxceiver.c
>> index b135daddad3a..83c879aa3581 100644
>> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
>> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
>> @@ -12,9 +12,31 @@ struct {
>>       __type(value, __u32);
>>   } xsk SEC(".maps");
>> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx) 
>> __ksym;
>> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
>> +
>>   SEC("xdp")
>>   int rx(struct xdp_md *ctx)
>>   {
>> +    void *data, *data_meta;
>> +    __u32 rx_timestamp;
>> +    int ret;
>> +
>> +    if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {

In current veth implementation, bpf_xdp_metadata_have_rx_timestamp()
will always return true here.

In the case of hardware timestamps, not every packet will contain a 
hardware timestamp.  (See my/Maryam ixgbe patch, where timestamps are 
read via HW device register, which isn't fast, and HW only support this 
for timesync protocols like PTP).

How do you imagine we can extend this?

>> +        ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));

IMHO sizeof() should come from a struct describing data_meta area see:
 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62


>> +        if (ret != 0)
>> +            return XDP_DROP;
>> +
>> +        data = (void *)(long)ctx->data;
>> +        data_meta = (void *)(long)ctx->data_meta;
>> +
>> +        if (data_meta + sizeof(__u32) > data)
>> +            return XDP_DROP;
>> +
>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));

So, this approach first stores hints on some other memory location, and 
then need to copy over information into data_meta area. That isn't good 
from a performance perspective.

My idea is to store it in the final data_meta destination immediately.

Do notice that in my approach, the existing ethtool config setting and 
socket options (for timestamps) still apply.  Thus, each individual 
hardware hint are already configurable. Thus we already have a config 
interface. I do acknowledge, that in-case a feature is disabled it still 
takes up space in data_meta areas, but importantly it is NOT stored into 
the area (for performance reasons).


>> +    }
> 
> Thanks for the patches.  I took a quick look at patch 1 and 2 but 
> haven't had a chance to look at the implementation details (eg. 
> KF_UNROLL...etc), yet.
> 

Yes, thanks for the patches, even-though I don't agree with the
approach, at-least until my concerns/use-case can be resolved.
IMHO the best way to convince people is through code. So, thank you for
the effort.  Hopefully we can use some of these ideas and I can also
change/adjust my XDP-hints ideas to incorporate some of this :-)


> Overall (with the example here) looks promising.  There is a lot of 
> flexibility on whether the xdp prog needs any hint at all, which hint it 
> needs, and how to store it.
> 

I do see the advantage that XDP prog only populates metadata it needs.
But how can we use/access this in __xdp_build_skb_from_frame() ?


>> +
>>       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>>   }
>> diff --git a/tools/testing/selftests/bpf/xskxceiver.c 
>> b/tools/testing/selftests/bpf/xskxceiver.c
>> index 066bd691db13..ce82c89a432e 100644
>> --- a/tools/testing/selftests/bpf/xskxceiver.c
>> +++ b/tools/testing/selftests/bpf/xskxceiver.c
>> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info 
>> *umem, struct pkt_stream *pkt
>>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr, 
>> u32 len)
>>   {
>>       void *data = xsk_umem__get_data(buffer, addr);
>> +    void *data_meta = data - sizeof(__u32);
>>       struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct 
>> ethhdr));
>> +    __u32 rx_timestamp = 0;
>>       if (!pkt) {
>>           ksft_print_msg("[%s] too many packets received\n", __func__);
>> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void 
>> *buffer, u64 addr, u32 len)
>>           return false;
>>       }
>> +    memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));

I acknowledge that it is too extensive to add to this patch, but in my 
AF_XDP-interaction example[1], I'm creating a struct xdp_hints_rx_time 
that gets BTF exported[1][2] to the userspace application, and userspace 
decodes the BTF and gets[3] a xsk_btf_member struct for members that 
simply contains a offset+size to read from.

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L47-L51

[2] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80

[3] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_user.c#L123-L129

>> +    if (rx_timestamp == 0) {
>> +        ksft_print_msg("Invalid metadata received: ");
>> +        ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
>> +        return false;
>> +    }
>> +
>>       return true;
>>   }
> 

Looking forward to collaborate :-)
--Jesper


^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2022-10-27 20:00 ` [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver Stanislav Fomichev
@ 2022-10-28 15:58 ` John Fastabend
  2022-10-28 18:04   ` Jakub Kicinski
  5 siblings, 1 reply; 50+ messages in thread
From: John Fastabend @ 2022-10-28 15:58 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev wrote:
> This is an RFC for the alternative approach suggested by Martin and
> Jakub. I've tried to CC most of the people from the original discussion,
> feel free to add more if you think I've missed somebody.
> 
> Summary:
> - add new BPF_F_XDP_HAS_METADATA program flag and abuse
>   attr->prog_ifindex to pass target device ifindex at load time
> - at load time, find appropriate ndo_unroll_kfunc and call
>   it to unroll/inline kfuncs; kfuncs have the default "safe"
>   implementation if unrolling is not supported by a particular
>   device
> - rewrite xskxceiver test to use C bpf program and extend
>   it to export rx_timestamp (plus add rx timestamp to veth driver)
> 
> I've intentionally kept it small and hacky to see whether the approach is
> workable or not.

Hi,

I need RX timestamps now as well so was going to work on some code
next week as well.

My plan was to simply put a kptr to the rx descriptor in the xdp
buffer. If I can read the rx descriptor I can read the timestamp,
the rxhash and any other metadata the NIC has completed. All the
drivers I've looked at stash the data here.

I'll inline pro/cons compared to this below as I see it.

> 
> Pros:
> - we avoid BTF complexity; the BPF programs themselves are now responsible
>   for agreeing on the metadata layout with the AF_XDP consumer

Same no BTF is needed in kernel side. Userspace and BPF progs get
to sort it out.

> - the metadata is free if not used

Same.

> - the metadata should, in theory, be cheap if used; kfuncs should be
>   unrolled to the same code as if the metadata was pre-populated and
>   passed with a BTF id

Same its just a kptr at this point. Also one more advantage would
be ability to read the data without copying it.

> - it's not all or nothing; users can use small subset of metadata which
>   is more efficient than the BTF id approach where all metadata has to be
>   exposed for every frame (and selectively consumed by the users)

Same.

> 
> Cons:
> - forwarding has to be handled explicitly; the BPF programs have to
>   agree on the metadata layout (IOW, the forwarding program
>   has to be aware of the final AF_XDP consumer metadata layout)

Same although IMO this is a PRO. You only get the bytes you need
and care about and can also augment it with extra good stuff so
calculation only happen once.

> - TX picture is not clear; but it's not clear with BTF ids as well;
>   I think we've agreed that just reusing whatever we have at RX
>   won't fly at TX; seems like TX XDP program might be the answer
>   here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
>   back into the kernel)


Agree TX is not addressed.


A bit of extra commentary. By exposing the raw kptr to the rx
descriptor we don't need driver writers to do anything. And
can easily support all the drivers out the gate with simple
one or two line changes. This pushes the interesting parts
into userspace and then BPF writers get to do the work without
bother driver folks and also if its not done today it doesn't
matter because user space can come along and make it work
later. So no scattered kernel dependencies which I really
would like to avoid here. Its actually very painful to have
to support clusters with N kernels and M devices if they
have different features. Doable but annoying and much nicer
if we just say 6.2 has support for kptr rx descriptor reading
and all XDP drivers support it. So timestamp, rxhash work
across the board.

To find the offset of fields (rxhash, timestamp) you can use
standard BTF relocations we have all this machinery built up
already for all the other structs we read, net_devices, task
structs, inodes, ... so its not a big hurdle at all IMO. We
can add userspace libs if folks really care, but its just
a read so I'm not even sure that is helpful.

I think its nicer than having kfuncs that need to be written
everywhere. My $.02 although I'll poke around with below
some as well. Feel free to just hang tight until I have some
code at the moment I have intel, mellanox drivers that I
would want to support.

> 
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> 
> Stanislav Fomichev (5):
>   bpf: Support inlined/unrolled kfuncs for xdp metadata
>   veth: Support rx timestamp metadata for xdp
>   libbpf: Pass prog_ifindex via bpf_object_open_opts
>   selftests/bpf: Convert xskxceiver to use custom program
>   selftests/bpf: Test rx_timestamp metadata in xskxceiver
> 
>  drivers/net/veth.c                            |  31 +++++
>  include/linux/bpf.h                           |   1 +
>  include/linux/btf.h                           |   1 +
>  include/linux/btf_ids.h                       |   4 +
>  include/linux/netdevice.h                     |   3 +
>  include/net/xdp.h                             |  22 ++++
>  include/uapi/linux/bpf.h                      |   5 +
>  kernel/bpf/syscall.c                          |  28 ++++-
>  kernel/bpf/verifier.c                         |  60 +++++++++
>  net/core/dev.c                                |   7 ++
>  net/core/xdp.c                                |  28 +++++
>  tools/include/uapi/linux/bpf.h                |   5 +
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   6 +-
>  tools/testing/selftests/bpf/Makefile          |   1 +
>  .../testing/selftests/bpf/progs/xskxceiver.c  |  43 +++++++
>  tools/testing/selftests/bpf/xskxceiver.c      | 119 +++++++++++++++---
>  tools/testing/selftests/bpf/xskxceiver.h      |   5 +-
>  18 files changed, 348 insertions(+), 22 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/xskxceiver.c
> 
> -- 
> 2.38.1.273.g43a17bfeac-goog
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-28 15:58 ` [RFC bpf-next 0/5] xdp: hints via kfuncs John Fastabend
@ 2022-10-28 18:04   ` Jakub Kicinski
  2022-10-28 18:46     ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Jakub Kicinski @ 2022-10-28 18:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, kpsingh, haoluo, jolsa, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> A bit of extra commentary. By exposing the raw kptr to the rx
> descriptor we don't need driver writers to do anything.
> And can easily support all the drivers out the gate with simple
> one or two line changes. This pushes the interesting parts
> into userspace and then BPF writers get to do the work without
> bother driver folks and also if its not done today it doesn't
> matter because user space can come along and make it work
> later. So no scattered kernel dependencies which I really
> would like to avoid here. Its actually very painful to have
> to support clusters with N kernels and M devices if they
> have different features. Doable but annoying and much nicer
> if we just say 6.2 has support for kptr rx descriptor reading
> and all XDP drivers support it. So timestamp, rxhash work
> across the board.

IMHO that's a bit of wishful thinking. Driver support is just a small
piece, you'll have different HW and FW versions, feature conflicts etc.
In the end kernel version is just one variable and there are many others
you'll already have to track.

And it's actually harder to abstract away inter HW generation
differences if the user space code has to handle all of it.

> To find the offset of fields (rxhash, timestamp) you can use
> standard BTF relocations we have all this machinery built up
> already for all the other structs we read, net_devices, task
> structs, inodes, ... so its not a big hurdle at all IMO. We
> can add userspace libs if folks really care, but its just a read so
> I'm not even sure that is helpful.
> 
> I think its nicer than having kfuncs that need to be written
> everywhere. My $.02 although I'll poke around with below
> some as well. Feel free to just hang tight until I have some
> code at the moment I have intel, mellanox drivers that I
> would want to support.

I'd prefer if we left the door open for new vendors. Punting descriptor
parsing to user space will indeed result in what you just said - major
vendors are supported and that's it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-28 18:04   ` Jakub Kicinski
@ 2022-10-28 18:46     ` Stanislav Fomichev
  2022-10-28 23:16       ` John Fastabend
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-28 18:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: John Fastabend, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	kpsingh, haoluo, jolsa, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> > A bit of extra commentary. By exposing the raw kptr to the rx
> > descriptor we don't need driver writers to do anything.
> > And can easily support all the drivers out the gate with simple
> > one or two line changes. This pushes the interesting parts
> > into userspace and then BPF writers get to do the work without
> > bother driver folks and also if its not done today it doesn't
> > matter because user space can come along and make it work
> > later. So no scattered kernel dependencies which I really
> > would like to avoid here. Its actually very painful to have
> > to support clusters with N kernels and M devices if they
> > have different features. Doable but annoying and much nicer
> > if we just say 6.2 has support for kptr rx descriptor reading
> > and all XDP drivers support it. So timestamp, rxhash work
> > across the board.
>
> IMHO that's a bit of wishful thinking. Driver support is just a small
> piece, you'll have different HW and FW versions, feature conflicts etc.
> In the end kernel version is just one variable and there are many others
> you'll already have to track.
>
> And it's actually harder to abstract away inter HW generation
> differences if the user space code has to handle all of it.

I've had the same concern:

Until we have some userspace library that abstracts all these details,
it's not really convenient to use. IIUC, with a kptr, I'd get a blob
of data and I need to go through the code and see what particular type
it represents for my particular device and how the data I need is
represented there. There are also these "if this is device v1 -> use
v1 descriptor format; if it's a v2->use this another struct; etc"
complexities that we'll be pushing onto the users. With kfuncs, we put
this burden on the driver developers, but I agree that the drawback
here is that we actually have to wait for the implementations to catch
up.

Jakub mentions FW and I haven't even thought about that; so yeah, bpf
programs might have to take a lot of other state into consideration
when parsing the descriptors; all those details do seem like they
belong to the driver code.

Feel free to send it early with just a handful of drivers implemented;
I'm more interested about bpf/af_xdp/user api story; if we have some
nice sample/test case that shows how the metadata can be used, that
might push us closer to the agreement on the best way to proceed.



> > To find the offset of fields (rxhash, timestamp) you can use
> > standard BTF relocations we have all this machinery built up
> > already for all the other structs we read, net_devices, task
> > structs, inodes, ... so its not a big hurdle at all IMO. We
> > can add userspace libs if folks really care, but its just a read so
> > I'm not even sure that is helpful.
> >
> > I think its nicer than having kfuncs that need to be written
> > everywhere. My $.02 although I'll poke around with below
> > some as well. Feel free to just hang tight until I have some
> > code at the moment I have intel, mellanox drivers that I
> > would want to support.
>
> I'd prefer if we left the door open for new vendors. Punting descriptor
> parsing to user space will indeed result in what you just said - major
> vendors are supported and that's it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-28 10:37     ` Jesper Dangaard Brouer
@ 2022-10-28 18:46       ` Stanislav Fomichev
  2022-10-31 14:20         ` Alexander Lobakin
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-28 18:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Martin KaFai Lau, brouer, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> >> Example on how the metadata is prepared from the BPF context
> >> and consumed by AF_XDP:
> >>
> >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> >>    if not, I'm assuming verifier will remove this "if (0)" branch
> >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> >>    maybe returning a pointer is better?
> >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> >>    makes sure timestamp is not empty
> >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> >>
> >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >> Cc: Jakub Kicinski <kuba@kernel.org>
> >> Cc: Willem de Bruijn <willemb@google.com>
> >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> >> Cc: xdp-hints@xdp-project.net
> >> Cc: netdev@vger.kernel.org
> >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> ---
> >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> >>   2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> b/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> index b135daddad3a..83c879aa3581 100644
> >> --- a/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> +++ b/tools/testing/selftests/bpf/progs/xskxceiver.c
> >> @@ -12,9 +12,31 @@ struct {
> >>       __type(value, __u32);
> >>   } xsk SEC(".maps");
> >> +extern int bpf_xdp_metadata_have_rx_timestamp(struct xdp_md *ctx)
> >> __ksym;
> >> +extern __u32 bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >> +
> >>   SEC("xdp")
> >>   int rx(struct xdp_md *ctx)
> >>   {
> >> +    void *data, *data_meta;
> >> +    __u32 rx_timestamp;
> >> +    int ret;
> >> +
> >> +    if (bpf_xdp_metadata_have_rx_timestamp(ctx)) {
>
> In current veth implementation, bpf_xdp_metadata_have_rx_timestamp()
> will always return true here.
>
> In the case of hardware timestamps, not every packet will contain a
> hardware timestamp.  (See my/Maryam ixgbe patch, where timestamps are
> read via HW device register, which isn't fast, and HW only support this
> for timesync protocols like PTP).
>
> How do you imagine we can extend this?

I'm always returning true for simplicity. In the real world, this
bytecode will look at the descriptors and return true/false depending
on whether the info is there or not.

> >> +        ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(__u32));
>
> IMHO sizeof() should come from a struct describing data_meta area see:
>
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62

I guess I should've used pointers for the return type instead, something like:

extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;

{
   ...
    __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
    if (rx_timestamp) {
        bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
        __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
    }
}

Does that look better?

> >> +        if (ret != 0)
> >> +            return XDP_DROP;
> >> +
> >> +        data = (void *)(long)ctx->data;
> >> +        data_meta = (void *)(long)ctx->data_meta;
> >> +
> >> +        if (data_meta + sizeof(__u32) > data)
> >> +            return XDP_DROP;
> >> +
> >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
>
> So, this approach first stores hints on some other memory location, and
> then need to copy over information into data_meta area. That isn't good
> from a performance perspective.
>
> My idea is to store it in the final data_meta destination immediately.

This approach doesn't have to store the hints in the other memory
location. xdp_buff->priv can point to the real hw descriptor and the
kfunc can have a bytecode that extracts the data from the hw
descriptors. For this particular RFC, we can think that 'skb' is that
hw descriptor for veth driver.

> Do notice that in my approach, the existing ethtool config setting and
> socket options (for timestamps) still apply.  Thus, each individual
> hardware hint are already configurable. Thus we already have a config
> interface. I do acknowledge, that in-case a feature is disabled it still
> takes up space in data_meta areas, but importantly it is NOT stored into
> the area (for performance reasons).

That should be the case with this rfc as well, isn't it? Worst case
scenario, that kfunc bytecode can explicitly check ethtool options and
return false if it's disabled?

> >> +    }
> >
> > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > haven't had a chance to look at the implementation details (eg.
> > KF_UNROLL...etc), yet.
> >
>
> Yes, thanks for the patches, even-though I don't agree with the
> approach, at-least until my concerns/use-case can be resolved.
> IMHO the best way to convince people is through code. So, thank you for
> the effort.  Hopefully we can use some of these ideas and I can also
> change/adjust my XDP-hints ideas to incorporate some of this :-)

Thank you for the feedback as well, appreciate it!
Definitely, looking forward to a v2 from you with some more clarity on
how those btf ids are handled by the bpf/af_xdp side!

> > Overall (with the example here) looks promising.  There is a lot of
> > flexibility on whether the xdp prog needs any hint at all, which hint it
> > needs, and how to store it.
> >
>
> I do see the advantage that XDP prog only populates metadata it needs.
> But how can we use/access this in __xdp_build_skb_from_frame() ?

I don't think __xdp_build_skb_from_frame is automagically solved by
either proposal?
For this proposal, there has to be some expected kernel metadata
format that bpf programs will prepare and the kernel will understand?
Think of it like xdp_hints_common from your proposal; the program will
have to put together xdp_hints_skb into xdp metadata with the parts
that can be populated into skb by the kernel.

For your btf ids proposal, it seems there has to be some extra kernel
code to parse all possible driver btf_if formats and copy the
metadata?





> >> +
> >>       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >>   }
> >> diff --git a/tools/testing/selftests/bpf/xskxceiver.c
> >> b/tools/testing/selftests/bpf/xskxceiver.c
> >> index 066bd691db13..ce82c89a432e 100644
> >> --- a/tools/testing/selftests/bpf/xskxceiver.c
> >> +++ b/tools/testing/selftests/bpf/xskxceiver.c
> >> @@ -871,7 +871,9 @@ static bool is_offset_correct(struct xsk_umem_info
> >> *umem, struct pkt_stream *pkt
> >>   static bool is_pkt_valid(struct pkt *pkt, void *buffer, u64 addr,
> >> u32 len)
> >>   {
> >>       void *data = xsk_umem__get_data(buffer, addr);
> >> +    void *data_meta = data - sizeof(__u32);
> >>       struct iphdr *iphdr = (struct iphdr *)(data + sizeof(struct
> >> ethhdr));
> >> +    __u32 rx_timestamp = 0;
> >>       if (!pkt) {
> >>           ksft_print_msg("[%s] too many packets received\n", __func__);
> >> @@ -907,6 +909,13 @@ static bool is_pkt_valid(struct pkt *pkt, void
> >> *buffer, u64 addr, u32 len)
> >>           return false;
> >>       }
> >> +    memcpy(&rx_timestamp, data_meta, sizeof(rx_timestamp));
>
> I acknowledge that it is too extensive to add to this patch, but in my
> AF_XDP-interaction example[1], I'm creating a struct xdp_hints_rx_time
> that gets BTF exported[1][2] to the userspace application, and userspace
> decodes the BTF and gets[3] a xsk_btf_member struct for members that
> simply contains a offset+size to read from.
>
> [1]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L47-L51
>
> [2]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>
> [3]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_user.c#L123-L129
>
> >> +    if (rx_timestamp == 0) {
> >> +        ksft_print_msg("Invalid metadata received: ");
> >> +        ksft_print_msg("got %08x, expected != 0\n", rx_timestamp);
> >> +        return false;
> >> +    }
> >> +
> >>       return true;
> >>   }
> >
>
> Looking forward to collaborate :-)
> --Jesper
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp
  2022-10-28  8:40   ` Jesper Dangaard Brouer
@ 2022-10-28 18:46     ` Stanislav Fomichev
  0 siblings, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-28 18:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Fri, Oct 28, 2022 at 1:40 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 27/10/2022 22.00, Stanislav Fomichev wrote:
> > xskxceiver conveniently setups up veth pairs so it seems logical
> > to use veth as an example for some of the metadata handling.
> >
> > We timestamp skb right when we "receive" it, store its
> > pointer in xdp_buff->priv and generate BPF bytecode to
> > reach it from the BPF program.
> >
> > This largely follows the idea of "store some queue context in
> > the xdp_buff/xdp_frame so the metadata can be reached out
> > from the BPF program".
> >
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
> >   1 file changed, 31 insertions(+)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 09682ea3354e..35396dd73de0 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -597,6 +597,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
> >
> >               xdp_convert_frame_to_buff(frame, &xdp);
> >               xdp.rxq = &rq->xdp_rxq;
> > +             xdp.priv = NULL;
>
> So, why doesn't this supported for normal XDP mode?!?
> e.g. Where veth gets XDP redirected an xdp_frame.

I wanted to have something simple for the demonstration purposes
(hence the re-usage of xskxceiver + veth without redirection).
But also see my cover letter:

Cons:
- forwarding has to be handled explicitly; the BPF programs have to
  agree on the metadata layout (IOW, the forwarding program
  has to be aware of the final AF_XDP consumer metadata layout)

> My main use case (for veth) is to make NIC hardware hints available to
> containers.  Thus, creating a flexible fast-path via XDP-redirect
> directly into containers veth device.  (This is e.g. for replacing the
> inflexible SR-IOV approach with SR-IOV net_devices in the container,
> with a more cloud friendly approach).
>
> How can we extend this approach to handle xdp_frame's from different
> net_device's ?

 So for this case, your forwarding program will have to call a bunch
of kfuncs and assemble the metadata.
It can also put some info about this metadata format. In theory, it
can even put some external btf-id for the struct that describes the
layout; or it can use some tlv format.
And then the final consumer will have to decide what to do with that metadata.

Or do you want xdp->skb conversion to also be transparently handled?
In this case, the last program will have to convert this to some new
xdp_hints_skb so the kernel can understand it. We might need some
extra helpers to signal those, but seems doable?

> >
> >               act = bpf_prog_run_xdp(xdp_prog, &xdp);
> >
> > @@ -820,6 +821,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
> >
> >       orig_data = xdp.data;
> >       orig_data_end = xdp.data_end;
> > +     xdp.priv = skb;
> >
>
> So, enabling SKB based path only.
>
> >       act = bpf_prog_run_xdp(xdp_prog, &xdp);
> >
> > @@ -936,6 +938,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> >                       struct sk_buff *skb = ptr;
> >
> >                       stats->xdp_bytes += skb->len;
> > +                     __net_timestamp(skb);
> >                       skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> >                       if (skb) {
> >                               if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> > @@ -1595,6 +1598,33 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       }
> >   }
> >
> > +static int veth_unroll_kfunc(struct bpf_prog *prog, struct bpf_insn *insn)
> > +{
> > +     u32 func_id = insn->imm;
> > +
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_HAVE_RX_TIMESTAMP)) {
> > +             /* return true; */
> > +             insn[0] = BPF_MOV64_IMM(BPF_REG_0, 1);
> > +             return 1;
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > +             /* r1 = ((struct xdp_buff *)r1)->priv; [skb] */
> > +             insn[0] = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1,
> > +                                   offsetof(struct xdp_buff, priv));
> > +             /* if (r1 == NULL) { */
> > +             insn[1] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1);
> > +             /*      return 0; */
> > +             insn[2] = BPF_MOV64_IMM(BPF_REG_0, 0);
> > +             /* } else { */
> > +             /*      return ((struct sk_buff *)r1)->tstamp; */
> > +             insn[3] = BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
> > +                                   offsetof(struct sk_buff, tstamp));
>
> Just to be clear, this skb->tstamp is a software timestamp, right?

Yes, see above, this is just to showcase how the bpf/af_xdp side will
look. The 1st patch and the last one are the interesting ones. The
rest is boring plumbing we can ignore for now.




> > +             /* } */
> > +             return 4;
> > +     }
>
> I'm slightly concerned with driver developers maintaining BPF-bytecode
> on a per-driver bases, but I can certainly live with this if BPF
> maintainers can.
>
> > +
> > +     return 0;
> > +}
> > +
> >   static const struct net_device_ops veth_netdev_ops = {
> >       .ndo_init            = veth_dev_init,
> >       .ndo_open            = veth_open,
> > @@ -1614,6 +1644,7 @@ static const struct net_device_ops veth_netdev_ops = {
> >       .ndo_bpf                = veth_xdp,
> >       .ndo_xdp_xmit           = veth_ndo_xdp_xmit,
> >       .ndo_get_peer_dev       = veth_peer_dev,
> > +     .ndo_unroll_kfunc       = veth_unroll_kfunc,
> >   };
> >
> >   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-28 18:46     ` Stanislav Fomichev
@ 2022-10-28 23:16       ` John Fastabend
  2022-10-29  1:14         ` Jakub Kicinski
  0 siblings, 1 reply; 50+ messages in thread
From: John Fastabend @ 2022-10-28 23:16 UTC (permalink / raw)
  To: Stanislav Fomichev, Jakub Kicinski
  Cc: John Fastabend, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	kpsingh, haoluo, jolsa, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev wrote:
> On Fri, Oct 28, 2022 at 11:05 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 28 Oct 2022 08:58:18 -0700 John Fastabend wrote:
> > > A bit of extra commentary. By exposing the raw kptr to the rx
> > > descriptor we don't need driver writers to do anything.
> > > And can easily support all the drivers out the gate with simple
> > > one or two line changes. This pushes the interesting parts
> > > into userspace and then BPF writers get to do the work without
> > > bother driver folks and also if its not done today it doesn't
> > > matter because user space can come along and make it work
> > > later. So no scattered kernel dependencies which I really
> > > would like to avoid here. Its actually very painful to have
> > > to support clusters with N kernels and M devices if they
> > > have different features. Doable but annoying and much nicer
> > > if we just say 6.2 has support for kptr rx descriptor reading
> > > and all XDP drivers support it. So timestamp, rxhash work
> > > across the board.
> >
> > IMHO that's a bit of wishful thinking. Driver support is just a small
> > piece, you'll have different HW and FW versions, feature conflicts etc.
> > In the end kernel version is just one variable and there are many others
> > you'll already have to track.

Agree.

> >
> > And it's actually harder to abstract away inter HW generation
> > differences if the user space code has to handle all of it.

I don't see how its any harder in practice though?

> 
> I've had the same concern:
> 
> Until we have some userspace library that abstracts all these details,
> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> of data and I need to go through the code and see what particular type
> it represents for my particular device and how the data I need is
> represented there. There are also these "if this is device v1 -> use
> v1 descriptor format; if it's a v2->use this another struct; etc"
> complexities that we'll be pushing onto the users. With kfuncs, we put
> this burden on the driver developers, but I agree that the drawback
> here is that we actually have to wait for the implementations to catch
> up.

I agree with everything there, you will get a blob of data and then
will need to know what field you want to read using BTF. But, we
already do this for BPF programs all over the place so its not a big
lift for us. All other BPF tracing/observability requires the same
logic. I think users of BPF in general perhaps XDP/tc are the only
place left to write BPF programs without thinking about BTF and
kernel data structures.

But, with proposed kptr the complexity lives in userspace and can be
fixed, added, updated without having to bother with kernel updates, etc.
From my point of view of supporting Cilium its a win and much preferred
to having to deal with driver owners on all cloud vendors, distributions,
and so on.

If vendor updates firmware with new fields I get those immediately.

> 
> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> programs might have to take a lot of other state into consideration
> when parsing the descriptors; all those details do seem like they
> belong to the driver code.

I would prefer to avoid being stuck on requiring driver writers to
be involved. With just a kptr I can support the device and any
firwmare versions without requiring help.

> 
> Feel free to send it early with just a handful of drivers implemented;
> I'm more interested about bpf/af_xdp/user api story; if we have some
> nice sample/test case that shows how the metadata can be used, that
> might push us closer to the agreement on the best way to proceed.

I'll try to do a intel and mlx implementation to get a cross section.
I have a good collection of nics here so should be able to show a
couple firmware versions. It could be fine I think to have the raw
kptr access and then also kfuncs for some things perhaps.


> 
> 
> 
> > > To find the offset of fields (rxhash, timestamp) you can use
> > > standard BTF relocations we have all this machinery built up
> > > already for all the other structs we read, net_devices, task
> > > structs, inodes, ... so its not a big hurdle at all IMO. We
> > > can add userspace libs if folks really care, but its just a read so
> > > I'm not even sure that is helpful.
> > >
> > > I think its nicer than having kfuncs that need to be written
> > > everywhere. My $.02 although I'll poke around with below
> > > some as well. Feel free to just hang tight until I have some
> > > code at the moment I have intel, mellanox drivers that I
> > > would want to support.
> >
> > I'd prefer if we left the door open for new vendors. Punting descriptor
> > parsing to user space will indeed result in what you just said - major
> > vendors are supported and that's it.

I'm not sure about why it would make it harder for new vendors? I think
the opposite, it would be easier because I don't need vendor support
at all. Thinking it over seems there could be room for both.


Thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-28 23:16       ` John Fastabend
@ 2022-10-29  1:14         ` Jakub Kicinski
  2022-10-31 14:10           ` [xdp-hints] " Bezdeka, Florian
  2022-10-31 17:01           ` John Fastabend
  0 siblings, 2 replies; 50+ messages in thread
From: Jakub Kicinski @ 2022-10-29  1:14 UTC (permalink / raw)
  To: John Fastabend
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, kpsingh, haoluo, jolsa, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > And it's actually harder to abstract away inter HW generation
> > > differences if the user space code has to handle all of it.  
> 
> I don't see how its any harder in practice though?

You need to find out what HW/FW/config you're running, right?
And all you have is a pointer to a blob of unknown type.

Take timestamps for example, some NICs support adjusting the PHC 
or doing SW corrections (with different versions of hw/fw/server
platforms being capable of both/one/neither).

Sure you can extract all this info with tracing and careful
inspection via uAPI. But I don't think that's _easier_.
And the vendors can't run the results thru their validation 
(for whatever that's worth).

> > I've had the same concern:
> > 
> > Until we have some userspace library that abstracts all these details,
> > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > of data and I need to go through the code and see what particular type
> > it represents for my particular device and how the data I need is
> > represented there. There are also these "if this is device v1 -> use
> > v1 descriptor format; if it's a v2->use this another struct; etc"
> > complexities that we'll be pushing onto the users. With kfuncs, we put
> > this burden on the driver developers, but I agree that the drawback
> > here is that we actually have to wait for the implementations to catch
> > up.  
> 
> I agree with everything there, you will get a blob of data and then
> will need to know what field you want to read using BTF. But, we
> already do this for BPF programs all over the place so its not a big
> lift for us. All other BPF tracing/observability requires the same
> logic. I think users of BPF in general perhaps XDP/tc are the only
> place left to write BPF programs without thinking about BTF and
> kernel data structures.
> 
> But, with proposed kptr the complexity lives in userspace and can be
> fixed, added, updated without having to bother with kernel updates, etc.
> From my point of view of supporting Cilium its a win and much preferred
> to having to deal with driver owners on all cloud vendors, distributions,
> and so on.
> 
> If vendor updates firmware with new fields I get those immediately.

Conversely it's a valid concern that those who *do* actually update
their kernel regularly will have more things to worry about.

> > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > programs might have to take a lot of other state into consideration
> > when parsing the descriptors; all those details do seem like they
> > belong to the driver code.  
> 
> I would prefer to avoid being stuck on requiring driver writers to
> be involved. With just a kptr I can support the device and any
> firwmare versions without requiring help.

1) where are you getting all those HW / FW specs :S
2) maybe *you* can but you're not exactly not an ex-driver developer :S

> > Feel free to send it early with just a handful of drivers implemented;
> > I'm more interested about bpf/af_xdp/user api story; if we have some
> > nice sample/test case that shows how the metadata can be used, that
> > might push us closer to the agreement on the best way to proceed.  
> 
> I'll try to do a intel and mlx implementation to get a cross section.
> I have a good collection of nics here so should be able to show a
> couple firmware versions. It could be fine I think to have the raw
> kptr access and then also kfuncs for some things perhaps.
> 
> > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > parsing to user space will indeed result in what you just said - major
> > > vendors are supported and that's it.  
> 
> I'm not sure about why it would make it harder for new vendors? I think
> the opposite, 

TBH I'm only replying to the email because of the above part :)
I thought this would be self evident, but I guess our perspectives 
are different.

Perhaps you look at it from the perspective of SW running on someone
else's cloud, an being able to move to another cloud, without having 
to worry if feature X is available in xdp or just skb.

I look at it from the perspective of maintaining a cloud, with people
writing random XDP applications. If I swap a NIC from an incumbent to a
(superior) startup, and cloud users are messing with raw descriptor -
I'd need to go find every XDP program out there and make sure it
understands the new descriptors.

There is a BPF foundation or whatnot now - what about starting a
certification program for cloud providers and making it clear what
features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?

> it would be easier because I don't need vendor support at all.

Can you support the enfabrica NIC on day 1? :) To an extent, its just
shifting the responsibility from the HW vendor to the middleware vendor.

> Thinking it over seems there could be room for both.

Are you thinking more or less Stan's proposal but with one of 
the callbacks being "give me the raw thing"? Probably as a ro dynptr?
Possible, but I don't think we need to hold off Stan's work.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-29  1:14         ` Jakub Kicinski
@ 2022-10-31 14:10           ` Bezdeka, Florian
  2022-10-31 15:28             ` Toke Høiland-Jørgensen
  2022-10-31 17:01           ` John Fastabend
  1 sibling, 1 reply; 50+ messages in thread
From: Bezdeka, Florian @ 2022-10-31 14:10 UTC (permalink / raw)
  To: kuba, john.fastabend
  Cc: alexandr.lobakin, anatoly.burakov, sdf, song, Deric, Nemanja,
	andrii, Kiszka, Jan, magnus.karlsson, willemb, ast, brouer, yhs,
	martin.lau, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo

Hi all,

I was closely following this discussion for some time now. Seems we
reached the point where it's getting interesting for me.

On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > > And it's actually harder to abstract away inter HW generation
> > > > differences if the user space code has to handle all of it.  
> > 
> > I don't see how its any harder in practice though?
> 
> You need to find out what HW/FW/config you're running, right?
> And all you have is a pointer to a blob of unknown type.
> 
> Take timestamps for example, some NICs support adjusting the PHC 
> or doing SW corrections (with different versions of hw/fw/server
> platforms being capable of both/one/neither).
> 
> Sure you can extract all this info with tracing and careful
> inspection via uAPI. But I don't think that's _easier_.
> And the vendors can't run the results thru their validation 
> (for whatever that's worth).
> 
> > > I've had the same concern:
> > > 
> > > Until we have some userspace library that abstracts all these details,
> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > > of data and I need to go through the code and see what particular type
> > > it represents for my particular device and how the data I need is
> > > represented there. There are also these "if this is device v1 -> use
> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> > > this burden on the driver developers, but I agree that the drawback
> > > here is that we actually have to wait for the implementations to catch
> > > up.  
> > 
> > I agree with everything there, you will get a blob of data and then
> > will need to know what field you want to read using BTF. But, we
> > already do this for BPF programs all over the place so its not a big
> > lift for us. All other BPF tracing/observability requires the same
> > logic. I think users of BPF in general perhaps XDP/tc are the only
> > place left to write BPF programs without thinking about BTF and
> > kernel data structures.
> > 
> > But, with proposed kptr the complexity lives in userspace and can be
> > fixed, added, updated without having to bother with kernel updates, etc.
> > From my point of view of supporting Cilium its a win and much preferred
> > to having to deal with driver owners on all cloud vendors, distributions,
> > and so on.
> > 
> > If vendor updates firmware with new fields I get those immediately.
> 
> Conversely it's a valid concern that those who *do* actually update
> their kernel regularly will have more things to worry about.
> 
> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > > programs might have to take a lot of other state into consideration
> > > when parsing the descriptors; all those details do seem like they
> > > belong to the driver code.  
> > 
> > I would prefer to avoid being stuck on requiring driver writers to
> > be involved. With just a kptr I can support the device and any
> > firwmare versions without requiring help.
> 
> 1) where are you getting all those HW / FW specs :S
> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> 
> > > Feel free to send it early with just a handful of drivers implemented;
> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> > > nice sample/test case that shows how the metadata can be used, that
> > > might push us closer to the agreement on the best way to proceed.  
> > 
> > I'll try to do a intel and mlx implementation to get a cross section.
> > I have a good collection of nics here so should be able to show a
> > couple firmware versions. It could be fine I think to have the raw
> > kptr access and then also kfuncs for some things perhaps.
> > 
> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > > parsing to user space will indeed result in what you just said - major
> > > > vendors are supported and that's it.  
> > 
> > I'm not sure about why it would make it harder for new vendors? I think
> > the opposite, 
> 
> TBH I'm only replying to the email because of the above part :)
> I thought this would be self evident, but I guess our perspectives 
> are different.
> 
> Perhaps you look at it from the perspective of SW running on someone
> else's cloud, an being able to move to another cloud, without having 
> to worry if feature X is available in xdp or just skb.
> 
> I look at it from the perspective of maintaining a cloud, with people
> writing random XDP applications. If I swap a NIC from an incumbent to a
> (superior) startup, and cloud users are messing with raw descriptor -
> I'd need to go find every XDP program out there and make sure it
> understands the new descriptors.

Here is another perspective:

As AF_XDP application developer I don't wan't to deal with the
underlying hardware in detail. I like to request a feature from the OS
(in this case rx/tx timestamping). If the feature is available I will
simply use it, if not I might have to work around it - maybe by falling
back to SW timestamping.

All parts of my application (BPF program included) should not be
optimized/adjusted for all the different HW variants out there.

My application might be run on bare metal/cloud/virtual systems. I do
not want to care about this scenarios differently.

I followed the idea of having a library for parsing the driver specific
meta information. That would mean that this library has to keep in sync
with the kernel, right? It doesn't help if a newer kernel provides XDP
hints support for more devices/drivers but the library is not updated.
That might be relevant for all the device update strategies out there.

In addition - and maybe even contrary - we care about zero copy (ZC)
support. Our current use case has to deal with a lot of small packets,
so we hope to benefit from that. If XDP hints support requires a copy
of the meta data - maybe to drive a HW independent interface - that
might be a bottle neck for us.

> 
> There is a BPF foundation or whatnot now - what about starting a
> certification program for cloud providers and making it clear what
> features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?
> 
> > it would be easier because I don't need vendor support at all.
> 
> Can you support the enfabrica NIC on day 1? :) To an extent, its just
> shifting the responsibility from the HW vendor to the middleware vendor.
> 
> > Thinking it over seems there could be room for both.
> 
> Are you thinking more or less Stan's proposal but with one of 
> the callbacks being "give me the raw thing"? Probably as a ro dynptr?
> Possible, but I don't think we need to hold off Stan's work.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-28 18:46       ` Stanislav Fomichev
@ 2022-10-31 14:20         ` Alexander Lobakin
  2022-10-31 14:29           ` Alexander Lobakin
  2022-10-31 17:00           ` Stanislav Fomichev
  0 siblings, 2 replies; 50+ messages in thread
From: Alexander Lobakin @ 2022-10-31 14:20 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, Martin KaFai Lau,
	brouer, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

From: Stanislav Fomichev <sdf@google.com>
Date: Fri, 28 Oct 2022 11:46:14 -0700

> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> > On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> > >> Example on how the metadata is prepared from the BPF context
> > >> and consumed by AF_XDP:
> > >>
> > >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> > >>    if not, I'm assuming verifier will remove this "if (0)" branch
> > >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> > >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> > >>    maybe returning a pointer is better?
> > >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> > >>    makes sure timestamp is not empty
> > >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> > >>
> > >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > >> Cc: Jakub Kicinski <kuba@kernel.org>
> > >> Cc: Willem de Bruijn <willemb@google.com>
> > >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> > >> Cc: xdp-hints@xdp-project.net
> > >> Cc: netdev@vger.kernel.org
> > >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > >> ---
> > >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> > >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> > >>   2 files changed, 44 insertions(+), 1 deletion(-)

[...]

> > IMHO sizeof() should come from a struct describing data_meta area see:
> >
> > https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> 
> I guess I should've used pointers for the return type instead, something like:
> 
> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> 
> {
>    ...
>     __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>     if (rx_timestamp) {
>         bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
>         __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
>     }
> }
> 
> Does that look better?

I guess it will then be resolved to a direct store, right?
I mean, to smth like

	if (rx_timestamp)
		*(u32 *)data_meta = *rx_timestamp;

, where *rx_timestamp points directly to the Rx descriptor?

> 
> > >> +        if (ret != 0)
> > >> +            return XDP_DROP;
> > >> +
> > >> +        data = (void *)(long)ctx->data;
> > >> +        data_meta = (void *)(long)ctx->data_meta;
> > >> +
> > >> +        if (data_meta + sizeof(__u32) > data)
> > >> +            return XDP_DROP;
> > >> +
> > >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> > >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> >
> > So, this approach first stores hints on some other memory location, and
> > then need to copy over information into data_meta area. That isn't good
> > from a performance perspective.
> >
> > My idea is to store it in the final data_meta destination immediately.
> 
> This approach doesn't have to store the hints in the other memory
> location. xdp_buff->priv can point to the real hw descriptor and the
> kfunc can have a bytecode that extracts the data from the hw
> descriptors. For this particular RFC, we can think that 'skb' is that
> hw descriptor for veth driver.

I really do think intermediate stores can be avoided with this
approach.
Oh, and BTW, if we plan to use a particular Hint in the BPF program
only, there's no need to place it in the metadata area at all, is
there? You only want to get it in your code, so just retrieve it to
the stack and that's it. data_meta is only for cases when you want
hints to appear in AF_XDP.

> 
> > Do notice that in my approach, the existing ethtool config setting and
> > socket options (for timestamps) still apply.  Thus, each individual
> > hardware hint are already configurable. Thus we already have a config
> > interface. I do acknowledge, that in-case a feature is disabled it still
> > takes up space in data_meta areas, but importantly it is NOT stored into
> > the area (for performance reasons).
> 
> That should be the case with this rfc as well, isn't it? Worst case
> scenario, that kfunc bytecode can explicitly check ethtool options and
> return false if it's disabled?

(to Jesper)

Once again, Ethtool idea doesn't work. Let's say you have roughly
50% of frames to be consumed by XDP, other 50% go to skb path and
the stack. In skb, I want as many HW data as possible: checksums,
hash and so on. Let's say in XDP prog I want only timestamp. What's
then? Disable everything but stamp and kill skb path? Enable
everything and kill XDP path?
Stanislav's approach allows you to request only fields you need from
the BPF prog directly, I don't see any reasons for adding one more
layer "oh no I won't give you checksum because it's disabled via
Ethtool".
Maybe I get something wrong, pls explain then :P

> 
> > >> +    }
> > >
> > > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > > haven't had a chance to look at the implementation details (eg.
> > > KF_UNROLL...etc), yet.
> > >
> >
> > Yes, thanks for the patches, even-though I don't agree with the
> > approach, at-least until my concerns/use-case can be resolved.
> > IMHO the best way to convince people is through code. So, thank you for
> > the effort.  Hopefully we can use some of these ideas and I can also
> > change/adjust my XDP-hints ideas to incorporate some of this :-)
> 
> Thank you for the feedback as well, appreciate it!
> Definitely, looking forward to a v2 from you with some more clarity on
> how those btf ids are handled by the bpf/af_xdp side!
> 
> > > Overall (with the example here) looks promising.  There is a lot of
> > > flexibility on whether the xdp prog needs any hint at all, which hint it
> > > needs, and how to store it.
> > >
> >
> > I do see the advantage that XDP prog only populates metadata it needs.
> > But how can we use/access this in __xdp_build_skb_from_frame() ?
> 
> I don't think __xdp_build_skb_from_frame is automagically solved by
> either proposal?

It's solved in my approach[0], so that __xdp_build_skb_from_frame()
are automatically get skb fields populated with the metadata if
available. But I always use a fixed generic structure, which can't
compete with your series in terms of flexibility (but solves stuff
like inter-vendor redirects and so on).
So in general I feel like there should be 2 options for metadata for
users:

1) I use one particular vendor and I always compile AF_XDP programs
   from fresh source code. I need to read/write only fields I want
   to. I'd go with kfunc or kptr here (but I don't think BPF progs
   should parse descriptor formats on their own, so your unroll NDO
   approach looks optimal for me for that case);
2) I use multiple vendors, pre-compiled AF_XDP programs or just old
   source code, I use veth and/or cpumap. So it's sorta
   back-forward-left-right-compatibility path. So here we could just
   use a fixed structure.

> For this proposal, there has to be some expected kernel metadata
> format that bpf programs will prepare and the kernel will understand?
> Think of it like xdp_hints_common from your proposal; the program will
> have to put together xdp_hints_skb into xdp metadata with the parts
> that can be populated into skb by the kernel.
> 
> For your btf ids proposal, it seems there has to be some extra kernel
> code to parse all possible driver btf_if formats and copy the
> metadata?

That's why I define a "generic" struct, so that its consumers
wouldn't have to if-else through a dozen of possible IDs :P

[...]

Great stuff from my PoV, I'd probably like to have some helpers for
writing this new NDO, so that small vendors wouldn't be afraid of
implementing it as Jakub mentioned. But still sorta optimal and
elegant for me, I'm not sure I want to post a "demo version" of my
series anymore :D
I feel like this way + one more "everything-compat-fixed" couple
would satisfy most of potential users.

Thanks,
Olek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-31 14:20         ` Alexander Lobakin
@ 2022-10-31 14:29           ` Alexander Lobakin
  2022-10-31 17:00           ` Stanislav Fomichev
  1 sibling, 0 replies; 50+ messages in thread
From: Alexander Lobakin @ 2022-10-31 14:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, Martin KaFai Lau,
	brouer, ast, daniel, andrii, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

From: Alexander Lobakin <alexandr.lobakin@intel.com>
Date: Mon, 31 Oct 2022 15:20:32 +0100

> From: Stanislav Fomichev <sdf@google.com>
> Date: Fri, 28 Oct 2022 11:46:14 -0700
> 
> > On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:

[...]

> > I don't think __xdp_build_skb_from_frame is automagically solved by
> > either proposal?
> 
> It's solved in my approach[0], so that __xdp_build_skb_from_frame()

Yeah sure forget to paste the link, why doncha?

[0] https://github.com/alobakin/linux/commit/a43a9d6895fa11f182becf3a7c202eeceb45a16a

> are automatically get skb fields populated with the metadata if
> available. But I always use a fixed generic structure, which can't
> compete with your series in terms of flexibility (but solves stuff
> like inter-vendor redirects and so on).

[...]

> Thanks,
> Olek

Thanks,
Olek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 14:10           ` [xdp-hints] " Bezdeka, Florian
@ 2022-10-31 15:28             ` Toke Høiland-Jørgensen
  2022-10-31 17:00               ` Stanislav Fomichev
  2022-10-31 19:36               ` Yonghong Song
  0 siblings, 2 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-10-31 15:28 UTC (permalink / raw)
  To: Bezdeka, Florian, kuba, john.fastabend
  Cc: alexandr.lobakin, anatoly.burakov, sdf, song, Deric, Nemanja,
	andrii, Kiszka, Jan, magnus.karlsson, willemb, ast, brouer, yhs,
	martin.lau, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo

"Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:

> Hi all,
>
> I was closely following this discussion for some time now. Seems we
> reached the point where it's getting interesting for me.
>
> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>> > > > And it's actually harder to abstract away inter HW generation
>> > > > differences if the user space code has to handle all of it.  
>> > 
>> > I don't see how its any harder in practice though?
>> 
>> You need to find out what HW/FW/config you're running, right?
>> And all you have is a pointer to a blob of unknown type.
>> 
>> Take timestamps for example, some NICs support adjusting the PHC 
>> or doing SW corrections (with different versions of hw/fw/server
>> platforms being capable of both/one/neither).
>> 
>> Sure you can extract all this info with tracing and careful
>> inspection via uAPI. But I don't think that's _easier_.
>> And the vendors can't run the results thru their validation 
>> (for whatever that's worth).
>> 
>> > > I've had the same concern:
>> > > 
>> > > Until we have some userspace library that abstracts all these details,
>> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>> > > of data and I need to go through the code and see what particular type
>> > > it represents for my particular device and how the data I need is
>> > > represented there. There are also these "if this is device v1 -> use
>> > > v1 descriptor format; if it's a v2->use this another struct; etc"
>> > > complexities that we'll be pushing onto the users. With kfuncs, we put
>> > > this burden on the driver developers, but I agree that the drawback
>> > > here is that we actually have to wait for the implementations to catch
>> > > up.  
>> > 
>> > I agree with everything there, you will get a blob of data and then
>> > will need to know what field you want to read using BTF. But, we
>> > already do this for BPF programs all over the place so its not a big
>> > lift for us. All other BPF tracing/observability requires the same
>> > logic. I think users of BPF in general perhaps XDP/tc are the only
>> > place left to write BPF programs without thinking about BTF and
>> > kernel data structures.
>> > 
>> > But, with proposed kptr the complexity lives in userspace and can be
>> > fixed, added, updated without having to bother with kernel updates, etc.
>> > From my point of view of supporting Cilium its a win and much preferred
>> > to having to deal with driver owners on all cloud vendors, distributions,
>> > and so on.
>> > 
>> > If vendor updates firmware with new fields I get those immediately.
>> 
>> Conversely it's a valid concern that those who *do* actually update
>> their kernel regularly will have more things to worry about.
>> 
>> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>> > > programs might have to take a lot of other state into consideration
>> > > when parsing the descriptors; all those details do seem like they
>> > > belong to the driver code.  
>> > 
>> > I would prefer to avoid being stuck on requiring driver writers to
>> > be involved. With just a kptr I can support the device and any
>> > firwmare versions without requiring help.
>> 
>> 1) where are you getting all those HW / FW specs :S
>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>> 
>> > > Feel free to send it early with just a handful of drivers implemented;
>> > > I'm more interested about bpf/af_xdp/user api story; if we have some
>> > > nice sample/test case that shows how the metadata can be used, that
>> > > might push us closer to the agreement on the best way to proceed.  
>> > 
>> > I'll try to do a intel and mlx implementation to get a cross section.
>> > I have a good collection of nics here so should be able to show a
>> > couple firmware versions. It could be fine I think to have the raw
>> > kptr access and then also kfuncs for some things perhaps.
>> > 
>> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
>> > > > parsing to user space will indeed result in what you just said - major
>> > > > vendors are supported and that's it.  
>> > 
>> > I'm not sure about why it would make it harder for new vendors? I think
>> > the opposite, 
>> 
>> TBH I'm only replying to the email because of the above part :)
>> I thought this would be self evident, but I guess our perspectives 
>> are different.
>> 
>> Perhaps you look at it from the perspective of SW running on someone
>> else's cloud, an being able to move to another cloud, without having 
>> to worry if feature X is available in xdp or just skb.
>> 
>> I look at it from the perspective of maintaining a cloud, with people
>> writing random XDP applications. If I swap a NIC from an incumbent to a
>> (superior) startup, and cloud users are messing with raw descriptor -
>> I'd need to go find every XDP program out there and make sure it
>> understands the new descriptors.
>
> Here is another perspective:
>
> As AF_XDP application developer I don't wan't to deal with the
> underlying hardware in detail. I like to request a feature from the OS
> (in this case rx/tx timestamping). If the feature is available I will
> simply use it, if not I might have to work around it - maybe by falling
> back to SW timestamping.
>
> All parts of my application (BPF program included) should not be
> optimized/adjusted for all the different HW variants out there.

Yes, absolutely agreed. Abstracting away those kinds of hardware
differences is the whole *point* of having an OS/driver model. I.e.,
it's what the kernel is there for! If people want to bypass that and get
direct access to the hardware, they can already do that by using DPDK.

So in other words, 100% agreed that we should not expect the BPF
developers to deal with hardware details as would be required with a
kptr-based interface.

As for the kfunc-based interface, I think it shows some promise.
Exposing a list of function names to retrieve individual metadata items
instead of a struct layout is sorta comparable in terms of developer UI
accessibility etc (IMO).

There are three main drawbacks, AFAICT:

1. It requires driver developers to write and maintain the code that
generates the unrolled BPF bytecode to access the metadata fields, which
is a non-trivial amount of complexity. Maybe this can be abstracted away
with some internal helpers though (like, e.g., a
bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
the required JMP/MOV/LDX instructions?

2. AF_XDP programs won't be able to access the metadata without using a
custom XDP program that calls the kfuncs and puts the data into the
metadata area. We could solve this with some code in libxdp, though; if
this code can be made generic enough (so it just dumps the available
metadata functions from the running kernel at load time), it may be
possible to make it generic enough that it will be forward-compatible
with new versions of the kernel that add new fields, which should
alleviate Florian's concern about keeping things in sync.

3. It will make it harder to consume the metadata when building SKBs. I
think the CPUMAP and veth use cases are also quite important, and that
we want metadata to be available for building SKBs in this path. Maybe
this can be resolved by having a convenient kfunc for this that can be
used for programs doing such redirects. E.g., you could just call
xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
would recursively expand into all the kfunc calls needed to extract the
metadata supported by the SKB path?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 15:28             ` Toke Høiland-Jørgensen
@ 2022-10-31 17:00               ` Stanislav Fomichev
  2022-10-31 22:57                 ` Martin KaFai Lau
  2022-10-31 19:36               ` Yonghong Song
  1 sibling, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-31 17:00 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, martin.lau, kpsingh,
	daniel, bpf, mtahhan, xdp-hints, netdev, jolsa, haoluo

On Mon, Oct 31, 2022 at 8:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>
> > Hi all,
> >
> > I was closely following this discussion for some time now. Seems we
> > reached the point where it's getting interesting for me.
> >
> > On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >> > > > And it's actually harder to abstract away inter HW generation
> >> > > > differences if the user space code has to handle all of it.
> >> >
> >> > I don't see how its any harder in practice though?
> >>
> >> You need to find out what HW/FW/config you're running, right?
> >> And all you have is a pointer to a blob of unknown type.
> >>
> >> Take timestamps for example, some NICs support adjusting the PHC
> >> or doing SW corrections (with different versions of hw/fw/server
> >> platforms being capable of both/one/neither).
> >>
> >> Sure you can extract all this info with tracing and careful
> >> inspection via uAPI. But I don't think that's _easier_.
> >> And the vendors can't run the results thru their validation
> >> (for whatever that's worth).
> >>
> >> > > I've had the same concern:
> >> > >
> >> > > Until we have some userspace library that abstracts all these details,
> >> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >> > > of data and I need to go through the code and see what particular type
> >> > > it represents for my particular device and how the data I need is
> >> > > represented there. There are also these "if this is device v1 -> use
> >> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> >> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> >> > > this burden on the driver developers, but I agree that the drawback
> >> > > here is that we actually have to wait for the implementations to catch
> >> > > up.
> >> >
> >> > I agree with everything there, you will get a blob of data and then
> >> > will need to know what field you want to read using BTF. But, we
> >> > already do this for BPF programs all over the place so its not a big
> >> > lift for us. All other BPF tracing/observability requires the same
> >> > logic. I think users of BPF in general perhaps XDP/tc are the only
> >> > place left to write BPF programs without thinking about BTF and
> >> > kernel data structures.
> >> >
> >> > But, with proposed kptr the complexity lives in userspace and can be
> >> > fixed, added, updated without having to bother with kernel updates, etc.
> >> > From my point of view of supporting Cilium its a win and much preferred
> >> > to having to deal with driver owners on all cloud vendors, distributions,
> >> > and so on.
> >> >
> >> > If vendor updates firmware with new fields I get those immediately.
> >>
> >> Conversely it's a valid concern that those who *do* actually update
> >> their kernel regularly will have more things to worry about.
> >>
> >> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >> > > programs might have to take a lot of other state into consideration
> >> > > when parsing the descriptors; all those details do seem like they
> >> > > belong to the driver code.
> >> >
> >> > I would prefer to avoid being stuck on requiring driver writers to
> >> > be involved. With just a kptr I can support the device and any
> >> > firwmare versions without requiring help.
> >>
> >> 1) where are you getting all those HW / FW specs :S
> >> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>
> >> > > Feel free to send it early with just a handful of drivers implemented;
> >> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> >> > > nice sample/test case that shows how the metadata can be used, that
> >> > > might push us closer to the agreement on the best way to proceed.
> >> >
> >> > I'll try to do a intel and mlx implementation to get a cross section.
> >> > I have a good collection of nics here so should be able to show a
> >> > couple firmware versions. It could be fine I think to have the raw
> >> > kptr access and then also kfuncs for some things perhaps.
> >> >
> >> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> >> > > > parsing to user space will indeed result in what you just said - major
> >> > > > vendors are supported and that's it.
> >> >
> >> > I'm not sure about why it would make it harder for new vendors? I think
> >> > the opposite,
> >>
> >> TBH I'm only replying to the email because of the above part :)
> >> I thought this would be self evident, but I guess our perspectives
> >> are different.
> >>
> >> Perhaps you look at it from the perspective of SW running on someone
> >> else's cloud, an being able to move to another cloud, without having
> >> to worry if feature X is available in xdp or just skb.
> >>
> >> I look at it from the perspective of maintaining a cloud, with people
> >> writing random XDP applications. If I swap a NIC from an incumbent to a
> >> (superior) startup, and cloud users are messing with raw descriptor -
> >> I'd need to go find every XDP program out there and make sure it
> >> understands the new descriptors.
> >
> > Here is another perspective:
> >
> > As AF_XDP application developer I don't wan't to deal with the
> > underlying hardware in detail. I like to request a feature from the OS
> > (in this case rx/tx timestamping). If the feature is available I will
> > simply use it, if not I might have to work around it - maybe by falling
> > back to SW timestamping.
> >
> > All parts of my application (BPF program included) should not be
> > optimized/adjusted for all the different HW variants out there.
>
> Yes, absolutely agreed. Abstracting away those kinds of hardware
> differences is the whole *point* of having an OS/driver model. I.e.,
> it's what the kernel is there for! If people want to bypass that and get
> direct access to the hardware, they can already do that by using DPDK.
>
> So in other words, 100% agreed that we should not expect the BPF
> developers to deal with hardware details as would be required with a
> kptr-based interface.
>
> As for the kfunc-based interface, I think it shows some promise.
> Exposing a list of function names to retrieve individual metadata items
> instead of a struct layout is sorta comparable in terms of developer UI
> accessibility etc (IMO).
>
> There are three main drawbacks, AFAICT:
>
> 1. It requires driver developers to write and maintain the code that
> generates the unrolled BPF bytecode to access the metadata fields, which
> is a non-trivial amount of complexity. Maybe this can be abstracted away
> with some internal helpers though (like, e.g., a
> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> the required JMP/MOV/LDX instructions?

Right, I hope we can have some helpers to abstract the raw instructions.
I might need to try to implement the actual metadata fetching for some
real devices and see how well it works in practice.

> 2. AF_XDP programs won't be able to access the metadata without using a
> custom XDP program that calls the kfuncs and puts the data into the
> metadata area. We could solve this with some code in libxdp, though; if
> this code can be made generic enough (so it just dumps the available
> metadata functions from the running kernel at load time), it may be
> possible to make it generic enough that it will be forward-compatible
> with new versions of the kernel that add new fields, which should
> alleviate Florian's concern about keeping things in sync.

Good point. I had to convert to a custom program to use the kfuncs :-(
But your suggestion sounds good; maybe libxdp can accept some extra
info about at which offset the user would like to place the metadata
and the library can generate the required bytecode?

> 3. It will make it harder to consume the metadata when building SKBs. I
> think the CPUMAP and veth use cases are also quite important, and that
> we want metadata to be available for building SKBs in this path. Maybe
> this can be resolved by having a convenient kfunc for this that can be
> used for programs doing such redirects. E.g., you could just call
> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> would recursively expand into all the kfunc calls needed to extract the
> metadata supported by the SKB path?

So this xdp_copy_metadata_for_skb will create a metadata layout that
the kernel will be able to understand when converting back to skb?
IIUC, the xdp program will look something like the following:

if (xdp packet is to be consumed by af_xdp) {
  // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
own metadata layout
  return bpf_redirect_map(xsk, ...);
} else {
  // if the packet is to be consumed by the kernel
  xdp_copy_metadata_for_skb(ctx);
  return bpf_redirect(...);
}

Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
put some magic number in the first byte(s) of the metadata so the
kernel can check whether xdp_copy_metadata_for_skb has been called
previously (or maybe xdp_frame can carry this extra signal, idk).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-31 14:20         ` Alexander Lobakin
  2022-10-31 14:29           ` Alexander Lobakin
@ 2022-10-31 17:00           ` Stanislav Fomichev
  2022-11-01 13:18             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-31 17:00 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Jesper Dangaard Brouer, Martin KaFai Lau, brouer, ast, daniel,
	andrii, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: Stanislav Fomichev <sdf@google.com>
> Date: Fri, 28 Oct 2022 11:46:14 -0700
>
> > On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> > >
> > >
> > > On 28/10/2022 08.22, Martin KaFai Lau wrote:
> > > > On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> > > >> Example on how the metadata is prepared from the BPF context
> > > >> and consumed by AF_XDP:
> > > >>
> > > >> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> > > >>    if not, I'm assuming verifier will remove this "if (0)" branch
> > > >> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> > > >>    the program has to bpf_xdp_adjust_meta+memcpy it;
> > > >>    maybe returning a pointer is better?
> > > >> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> > > >>    makes sure timestamp is not empty
> > > >> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> > > >>
> > > >> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > > >> Cc: Jakub Kicinski <kuba@kernel.org>
> > > >> Cc: Willem de Bruijn <willemb@google.com>
> > > >> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > > >> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > > >> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > > >> Cc: Maryam Tahhan <mtahhan@redhat.com>
> > > >> Cc: xdp-hints@xdp-project.net
> > > >> Cc: netdev@vger.kernel.org
> > > >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > >> ---
> > > >>   .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> > > >>   tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> > > >>   2 files changed, 44 insertions(+), 1 deletion(-)
>
> [...]
>
> > > IMHO sizeof() should come from a struct describing data_meta area see:
> > >
> > > https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> >
> > I guess I should've used pointers for the return type instead, something like:
> >
> > extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >
> > {
> >    ...
> >     __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >     if (rx_timestamp) {
> >         bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
> >         __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
> >     }
> > }
> >
> > Does that look better?
>
> I guess it will then be resolved to a direct store, right?
> I mean, to smth like
>
>         if (rx_timestamp)
>                 *(u32 *)data_meta = *rx_timestamp;
>
> , where *rx_timestamp points directly to the Rx descriptor?

Right. I should've used that form from the beginning, that memcpy is
confusing :-(

> >
> > > >> +        if (ret != 0)
> > > >> +            return XDP_DROP;
> > > >> +
> > > >> +        data = (void *)(long)ctx->data;
> > > >> +        data_meta = (void *)(long)ctx->data_meta;
> > > >> +
> > > >> +        if (data_meta + sizeof(__u32) > data)
> > > >> +            return XDP_DROP;
> > > >> +
> > > >> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> > > >> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> > >
> > > So, this approach first stores hints on some other memory location, and
> > > then need to copy over information into data_meta area. That isn't good
> > > from a performance perspective.
> > >
> > > My idea is to store it in the final data_meta destination immediately.
> >
> > This approach doesn't have to store the hints in the other memory
> > location. xdp_buff->priv can point to the real hw descriptor and the
> > kfunc can have a bytecode that extracts the data from the hw
> > descriptors. For this particular RFC, we can think that 'skb' is that
> > hw descriptor for veth driver.
>
> I really do think intermediate stores can be avoided with this
> approach.
> Oh, and BTW, if we plan to use a particular Hint in the BPF program
> only, there's no need to place it in the metadata area at all, is
> there? You only want to get it in your code, so just retrieve it to
> the stack and that's it. data_meta is only for cases when you want
> hints to appear in AF_XDP.

Correct.

> > > Do notice that in my approach, the existing ethtool config setting and
> > > socket options (for timestamps) still apply.  Thus, each individual
> > > hardware hint are already configurable. Thus we already have a config
> > > interface. I do acknowledge, that in-case a feature is disabled it still
> > > takes up space in data_meta areas, but importantly it is NOT stored into
> > > the area (for performance reasons).
> >
> > That should be the case with this rfc as well, isn't it? Worst case
> > scenario, that kfunc bytecode can explicitly check ethtool options and
> > return false if it's disabled?
>
> (to Jesper)
>
> Once again, Ethtool idea doesn't work. Let's say you have roughly
> 50% of frames to be consumed by XDP, other 50% go to skb path and
> the stack. In skb, I want as many HW data as possible: checksums,
> hash and so on. Let's say in XDP prog I want only timestamp. What's
> then? Disable everything but stamp and kill skb path? Enable
> everything and kill XDP path?
> Stanislav's approach allows you to request only fields you need from
> the BPF prog directly, I don't see any reasons for adding one more
> layer "oh no I won't give you checksum because it's disabled via
> Ethtool".
> Maybe I get something wrong, pls explain then :P

Agree, good point.

> > > >> +    }
> > > >
> > > > Thanks for the patches.  I took a quick look at patch 1 and 2 but
> > > > haven't had a chance to look at the implementation details (eg.
> > > > KF_UNROLL...etc), yet.
> > > >
> > >
> > > Yes, thanks for the patches, even-though I don't agree with the
> > > approach, at-least until my concerns/use-case can be resolved.
> > > IMHO the best way to convince people is through code. So, thank you for
> > > the effort.  Hopefully we can use some of these ideas and I can also
> > > change/adjust my XDP-hints ideas to incorporate some of this :-)
> >
> > Thank you for the feedback as well, appreciate it!
> > Definitely, looking forward to a v2 from you with some more clarity on
> > how those btf ids are handled by the bpf/af_xdp side!
> >
> > > > Overall (with the example here) looks promising.  There is a lot of
> > > > flexibility on whether the xdp prog needs any hint at all, which hint it
> > > > needs, and how to store it.
> > > >
> > >
> > > I do see the advantage that XDP prog only populates metadata it needs.
> > > But how can we use/access this in __xdp_build_skb_from_frame() ?
> >
> > I don't think __xdp_build_skb_from_frame is automagically solved by
> > either proposal?
>
> It's solved in my approach[0], so that __xdp_build_skb_from_frame()
> are automatically get skb fields populated with the metadata if
> available. But I always use a fixed generic structure, which can't
> compete with your series in terms of flexibility (but solves stuff
> like inter-vendor redirects and so on).
> So in general I feel like there should be 2 options for metadata for
> users:
>
> 1) I use one particular vendor and I always compile AF_XDP programs
>    from fresh source code. I need to read/write only fields I want
>    to. I'd go with kfunc or kptr here (but I don't think BPF progs
>    should parse descriptor formats on their own, so your unroll NDO
>    approach looks optimal for me for that case);
> 2) I use multiple vendors, pre-compiled AF_XDP programs or just old
>    source code, I use veth and/or cpumap. So it's sorta
>    back-forward-left-right-compatibility path. So here we could just
>    use a fixed structure.

For (2) I really like Toke's suggestion about some extra helper that
prepares the metadata that the kernel path will later on be able to
understand.
The only downside I see is that it has to be called explicitly, but if
we assume that libxdp can also abstract this detail, doesn't sound
like a huge issue to me?

> > For this proposal, there has to be some expected kernel metadata
> > format that bpf programs will prepare and the kernel will understand?
> > Think of it like xdp_hints_common from your proposal; the program will
> > have to put together xdp_hints_skb into xdp metadata with the parts
> > that can be populated into skb by the kernel.
> >
> > For your btf ids proposal, it seems there has to be some extra kernel
> > code to parse all possible driver btf_if formats and copy the
> > metadata?
>
> That's why I define a "generic" struct, so that its consumers
> wouldn't have to if-else through a dozen of possible IDs :P
>
> [...]
>
> Great stuff from my PoV, I'd probably like to have some helpers for
> writing this new NDO, so that small vendors wouldn't be afraid of
> implementing it as Jakub mentioned. But still sorta optimal and
> elegant for me, I'm not sure I want to post a "demo version" of my
> series anymore :D
> I feel like this way + one more "everything-compat-fixed" couple
> would satisfy most of potential users.

Awesome, thanks for the review and the feedback!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-29  1:14         ` Jakub Kicinski
  2022-10-31 14:10           ` [xdp-hints] " Bezdeka, Florian
@ 2022-10-31 17:01           ` John Fastabend
  1 sibling, 0 replies; 50+ messages in thread
From: John Fastabend @ 2022-10-31 17:01 UTC (permalink / raw)
  To: Jakub Kicinski, John Fastabend
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, kpsingh, haoluo, jolsa, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Jakub Kicinski wrote:
> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> > > > And it's actually harder to abstract away inter HW generation
> > > > differences if the user space code has to handle all of it.  
> > 
> > I don't see how its any harder in practice though?
> 
> You need to find out what HW/FW/config you're running, right?
> And all you have is a pointer to a blob of unknown type.

Yep. I guess I'm in the position of already having to do this
somewhat to collect stats from the device. Although its maybe
a bit more involved here by vendors that are versioning the
descriptors.

Also nit, its not unknown type we know the full type by BTF.

> 
> Take timestamps for example, some NICs support adjusting the PHC 
> or doing SW corrections (with different versions of hw/fw/server
> platforms being capable of both/one/neither).

Its worse actually.

Having started to do this timestamping it is not at all consistent
across nics so I think we are stuck having to know hw specifics
here regardless. Also some nics will timestamp all RX pkts, some
specific pkts, some require configuration to decide which mode
to run in and so on. You then end up with a matrix of features
supported by hw/fw/sw and desired state and I can't see any way
around this.

> 
> Sure you can extract all this info with tracing and careful
> inspection via uAPI. But I don't think that's _easier_.
> And the vendors can't run the results thru their validation 
> (for whatever that's worth).

I think you hit our point of view differences below. See I
don't want to depend on the vendor. I want access to the
fields otherwise I'm stuck working with vendors on their
time frames. You have the other perspective of supporting the
NIC and ability to update kernels where as I still live with
4.18/4.19 kernels (even 4.14 sometimes). So what we land now
needs to work in 5 years still.

> 
> > > I've had the same concern:
> > > 
> > > Until we have some userspace library that abstracts all these details,
> > > it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> > > of data and I need to go through the code and see what particular type
> > > it represents for my particular device and how the data I need is
> > > represented there. There are also these "if this is device v1 -> use
> > > v1 descriptor format; if it's a v2->use this another struct; etc"
> > > complexities that we'll be pushing onto the users. With kfuncs, we put
> > > this burden on the driver developers, but I agree that the drawback
> > > here is that we actually have to wait for the implementations to catch
> > > up.  
> > 
> > I agree with everything there, you will get a blob of data and then
> > will need to know what field you want to read using BTF. But, we
> > already do this for BPF programs all over the place so its not a big
> > lift for us. All other BPF tracing/observability requires the same
> > logic. I think users of BPF in general perhaps XDP/tc are the only
> > place left to write BPF programs without thinking about BTF and
> > kernel data structures.
> > 
> > But, with proposed kptr the complexity lives in userspace and can be
> > fixed, added, updated without having to bother with kernel updates, etc.
> > From my point of view of supporting Cilium its a win and much preferred
> > to having to deal with driver owners on all cloud vendors, distributions,
> > and so on.
> > 
> > If vendor updates firmware with new fields I get those immediately.
> 
> Conversely it's a valid concern that those who *do* actually update
> their kernel regularly will have more things to worry about.

I'm not sure if a kptr_func is any harder to write than a user space
relocation for that func? In tetragon and cilium we've done userspace
rewrites for some time. Happy to generalize that infra into kernel
repo if that helps. IMO having a libhw.h in kernel tree ./tools/bpf
directory would work.

> 
> > > Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> > > programs might have to take a lot of other state into consideration
> > > when parsing the descriptors; all those details do seem like they
> > > belong to the driver code.  
> > 
> > I would prefer to avoid being stuck on requiring driver writers to
> > be involved. With just a kptr I can support the device and any
> > firwmare versions without requiring help.
> 
> 1) where are you getting all those HW / FW specs :S

Most are public docs of course vendors have internal docs with more
details but what can you do. Also source code has the structs.

> 2) maybe *you* can but you're not exactly not an ex-driver developer :S

Sure :) but we put a libhw.h file in kernel and test with selftests
(which will be hard without hardware) and then not everyone needs
to be a driver internals expert.

> 
> > > Feel free to send it early with just a handful of drivers implemented;
> > > I'm more interested about bpf/af_xdp/user api story; if we have some
> > > nice sample/test case that shows how the metadata can be used, that
> > > might push us closer to the agreement on the best way to proceed.  
> > 
> > I'll try to do a intel and mlx implementation to get a cross section.
> > I have a good collection of nics here so should be able to show a
> > couple firmware versions. It could be fine I think to have the raw
> > kptr access and then also kfuncs for some things perhaps.
> > 
> > > > I'd prefer if we left the door open for new vendors. Punting descriptor
> > > > parsing to user space will indeed result in what you just said - major
> > > > vendors are supported and that's it.  
> > 
> > I'm not sure about why it would make it harder for new vendors? I think
> > the opposite, 
> 
> TBH I'm only replying to the email because of the above part :)
> I thought this would be self evident, but I guess our perspectives 
> are different.

Yep.

> 
> Perhaps you look at it from the perspective of SW running on someone
> else's cloud, an being able to move to another cloud, without having 
> to worry if feature X is available in xdp or just skb.

Exactly. I have SW running in a data center or cloud for a security
team or ops team and they don't own the platform usually. Anyways
the platform team is going to stay on a LTS kernel for at least a
year or two most likely. I maintain the SW and some 3rd party
nic vendor may not even know about my SW (cilium/tetragon).

> 
> I look at it from the perspective of maintaining a cloud, with people
> writing random XDP applications. If I swap a NIC from an incumbent to a
> (superior) startup, and cloud users are messing with raw descriptor -
> I'd need to go find every XDP program out there and make sure it
> understands the new descriptors.

I get it. Its interesting that you wouldn't tell the XDP programmers
to deal with it which is my case. My $.02 is a userspace lib could
abstract this easier than a kernel func and also add new features
without rolling new kernels.

> 
> There is a BPF foundation or whatnot now - what about starting a
> certification program for cloud providers and making it clear what
> features must be supported to be compatible with XDP 1.0, XDP 2.0 etc?

Maybe but still stuck on kernel versions.

> 
> > it would be easier because I don't need vendor support at all.
> 
> Can you support the enfabrica NIC on day 1? :) To an extent, its just
> shifting the responsibility from the HW vendor to the middleware vendor.

Yep. With important detail that I can run new features on old kernels
even if vendor didn't add rxhash or timestamp out the gate.

Nothing stops the hw vendor from contributing to this in kernel
source bpf lib with the features though.

> 
> > Thinking it over seems there could be room for both.
> 
> Are you thinking more or less Stan's proposal but with one of 
> the callbacks being "give me the raw thing"? Probably as a ro dynptr?
> Possible, but I don't think we need to hold off Stan's work.

Yeah that was my thinking. Both could coexist. OTOH doing it in
BPF program lib side seems cleaner to me from a kernel maitenance
and hardware support. We've had trouble getting drivers to support
XDP features so adding more requirements of entry seems problematic
to me when we can avoid it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 15:28             ` Toke Høiland-Jørgensen
  2022-10-31 17:00               ` Stanislav Fomichev
@ 2022-10-31 19:36               ` Yonghong Song
  2022-10-31 22:09                 ` Stanislav Fomichev
  1 sibling, 1 reply; 50+ messages in thread
From: Yonghong Song @ 2022-10-31 19:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba, john.fastabend
  Cc: alexandr.lobakin, anatoly.burakov, sdf, song, Deric, Nemanja,
	andrii, Kiszka, Jan, magnus.karlsson, willemb, ast, brouer, yhs,
	martin.lau, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo



On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> 
>> Hi all,
>>
>> I was closely following this discussion for some time now. Seems we
>> reached the point where it's getting interesting for me.
>>
>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>> differences if the user space code has to handle all of it.
>>>>
>>>> I don't see how its any harder in practice though?
>>>
>>> You need to find out what HW/FW/config you're running, right?
>>> And all you have is a pointer to a blob of unknown type.
>>>
>>> Take timestamps for example, some NICs support adjusting the PHC
>>> or doing SW corrections (with different versions of hw/fw/server
>>> platforms being capable of both/one/neither).
>>>
>>> Sure you can extract all this info with tracing and careful
>>> inspection via uAPI. But I don't think that's _easier_.
>>> And the vendors can't run the results thru their validation
>>> (for whatever that's worth).
>>>
>>>>> I've had the same concern:
>>>>>
>>>>> Until we have some userspace library that abstracts all these details,
>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>> of data and I need to go through the code and see what particular type
>>>>> it represents for my particular device and how the data I need is
>>>>> represented there. There are also these "if this is device v1 -> use
>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>> this burden on the driver developers, but I agree that the drawback
>>>>> here is that we actually have to wait for the implementations to catch
>>>>> up.
>>>>
>>>> I agree with everything there, you will get a blob of data and then
>>>> will need to know what field you want to read using BTF. But, we
>>>> already do this for BPF programs all over the place so its not a big
>>>> lift for us. All other BPF tracing/observability requires the same
>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>> place left to write BPF programs without thinking about BTF and
>>>> kernel data structures.
>>>>
>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>  From my point of view of supporting Cilium its a win and much preferred
>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>> and so on.
>>>>
>>>> If vendor updates firmware with new fields I get those immediately.
>>>
>>> Conversely it's a valid concern that those who *do* actually update
>>> their kernel regularly will have more things to worry about.
>>>
>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>> programs might have to take a lot of other state into consideration
>>>>> when parsing the descriptors; all those details do seem like they
>>>>> belong to the driver code.
>>>>
>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>> be involved. With just a kptr I can support the device and any
>>>> firwmare versions without requiring help.
>>>
>>> 1) where are you getting all those HW / FW specs :S
>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>
>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>> might push us closer to the agreement on the best way to proceed.
>>>>
>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>> I have a good collection of nics here so should be able to show a
>>>> couple firmware versions. It could be fine I think to have the raw
>>>> kptr access and then also kfuncs for some things perhaps.
>>>>
>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>> vendors are supported and that's it.
>>>>
>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>> the opposite,
>>>
>>> TBH I'm only replying to the email because of the above part :)
>>> I thought this would be self evident, but I guess our perspectives
>>> are different.
>>>
>>> Perhaps you look at it from the perspective of SW running on someone
>>> else's cloud, an being able to move to another cloud, without having
>>> to worry if feature X is available in xdp or just skb.
>>>
>>> I look at it from the perspective of maintaining a cloud, with people
>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>> (superior) startup, and cloud users are messing with raw descriptor -
>>> I'd need to go find every XDP program out there and make sure it
>>> understands the new descriptors.
>>
>> Here is another perspective:
>>
>> As AF_XDP application developer I don't wan't to deal with the
>> underlying hardware in detail. I like to request a feature from the OS
>> (in this case rx/tx timestamping). If the feature is available I will
>> simply use it, if not I might have to work around it - maybe by falling
>> back to SW timestamping.
>>
>> All parts of my application (BPF program included) should not be
>> optimized/adjusted for all the different HW variants out there.
> 
> Yes, absolutely agreed. Abstracting away those kinds of hardware
> differences is the whole *point* of having an OS/driver model. I.e.,
> it's what the kernel is there for! If people want to bypass that and get
> direct access to the hardware, they can already do that by using DPDK.
> 
> So in other words, 100% agreed that we should not expect the BPF
> developers to deal with hardware details as would be required with a
> kptr-based interface.
> 
> As for the kfunc-based interface, I think it shows some promise.
> Exposing a list of function names to retrieve individual metadata items
> instead of a struct layout is sorta comparable in terms of developer UI
> accessibility etc (IMO).

Looks like there are quite some use cases for hw_timestamp.
Do you think we could add it to the uapi like struct xdp_md?

The following is the current xdp_md:
struct xdp_md {
         __u32 data;
         __u32 data_end;
         __u32 data_meta;
         /* Below access go through struct xdp_rxq_info */
         __u32 ingress_ifindex; /* rxq->dev->ifindex */
         __u32 rx_queue_index;  /* rxq->queue_index  */

         __u32 egress_ifindex;  /* txq->dev->ifindex */
};

We could add  __u64 hw_timestamp to the xdp_md so user
can just do xdp_md->hw_timestamp to get the value.
xdp_md->hw_timestamp == 0 means hw_timestamp is not
available.

Inside the kernel, the ctx rewriter can generate code
to call driver specific function to retrieve the data.

The kfunc approach can be used to *less* common use cases?

> 
> There are three main drawbacks, AFAICT:
> 
> 1. It requires driver developers to write and maintain the code that
> generates the unrolled BPF bytecode to access the metadata fields, which
> is a non-trivial amount of complexity. Maybe this can be abstracted away
> with some internal helpers though (like, e.g., a
> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> the required JMP/MOV/LDX instructions?
> 
> 2. AF_XDP programs won't be able to access the metadata without using a
> custom XDP program that calls the kfuncs and puts the data into the
> metadata area. We could solve this with some code in libxdp, though; if
> this code can be made generic enough (so it just dumps the available
> metadata functions from the running kernel at load time), it may be
> possible to make it generic enough that it will be forward-compatible
> with new versions of the kernel that add new fields, which should
> alleviate Florian's concern about keeping things in sync.
> 
> 3. It will make it harder to consume the metadata when building SKBs. I
> think the CPUMAP and veth use cases are also quite important, and that
> we want metadata to be available for building SKBs in this path. Maybe
> this can be resolved by having a convenient kfunc for this that can be
> used for programs doing such redirects. E.g., you could just call
> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> would recursively expand into all the kfunc calls needed to extract the
> metadata supported by the SKB path?
> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 19:36               ` Yonghong Song
@ 2022-10-31 22:09                 ` Stanislav Fomichev
  2022-10-31 22:38                   ` Yonghong Song
  2022-11-01 17:31                   ` Martin KaFai Lau
  0 siblings, 2 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-31 22:09 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, martin.lau, kpsingh, daniel, bpf, mtahhan,
	xdp-hints, netdev, jolsa, haoluo

On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> > "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >
> >> Hi all,
> >>
> >> I was closely following this discussion for some time now. Seems we
> >> reached the point where it's getting interesting for me.
> >>
> >> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>> differences if the user space code has to handle all of it.
> >>>>
> >>>> I don't see how its any harder in practice though?
> >>>
> >>> You need to find out what HW/FW/config you're running, right?
> >>> And all you have is a pointer to a blob of unknown type.
> >>>
> >>> Take timestamps for example, some NICs support adjusting the PHC
> >>> or doing SW corrections (with different versions of hw/fw/server
> >>> platforms being capable of both/one/neither).
> >>>
> >>> Sure you can extract all this info with tracing and careful
> >>> inspection via uAPI. But I don't think that's _easier_.
> >>> And the vendors can't run the results thru their validation
> >>> (for whatever that's worth).
> >>>
> >>>>> I've had the same concern:
> >>>>>
> >>>>> Until we have some userspace library that abstracts all these details,
> >>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>> of data and I need to go through the code and see what particular type
> >>>>> it represents for my particular device and how the data I need is
> >>>>> represented there. There are also these "if this is device v1 -> use
> >>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>> this burden on the driver developers, but I agree that the drawback
> >>>>> here is that we actually have to wait for the implementations to catch
> >>>>> up.
> >>>>
> >>>> I agree with everything there, you will get a blob of data and then
> >>>> will need to know what field you want to read using BTF. But, we
> >>>> already do this for BPF programs all over the place so its not a big
> >>>> lift for us. All other BPF tracing/observability requires the same
> >>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>> place left to write BPF programs without thinking about BTF and
> >>>> kernel data structures.
> >>>>
> >>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>  From my point of view of supporting Cilium its a win and much preferred
> >>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>> and so on.
> >>>>
> >>>> If vendor updates firmware with new fields I get those immediately.
> >>>
> >>> Conversely it's a valid concern that those who *do* actually update
> >>> their kernel regularly will have more things to worry about.
> >>>
> >>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>> programs might have to take a lot of other state into consideration
> >>>>> when parsing the descriptors; all those details do seem like they
> >>>>> belong to the driver code.
> >>>>
> >>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>> be involved. With just a kptr I can support the device and any
> >>>> firwmare versions without requiring help.
> >>>
> >>> 1) where are you getting all those HW / FW specs :S
> >>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>
> >>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>> might push us closer to the agreement on the best way to proceed.
> >>>>
> >>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>> I have a good collection of nics here so should be able to show a
> >>>> couple firmware versions. It could be fine I think to have the raw
> >>>> kptr access and then also kfuncs for some things perhaps.
> >>>>
> >>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>> vendors are supported and that's it.
> >>>>
> >>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>> the opposite,
> >>>
> >>> TBH I'm only replying to the email because of the above part :)
> >>> I thought this would be self evident, but I guess our perspectives
> >>> are different.
> >>>
> >>> Perhaps you look at it from the perspective of SW running on someone
> >>> else's cloud, an being able to move to another cloud, without having
> >>> to worry if feature X is available in xdp or just skb.
> >>>
> >>> I look at it from the perspective of maintaining a cloud, with people
> >>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>> (superior) startup, and cloud users are messing with raw descriptor -
> >>> I'd need to go find every XDP program out there and make sure it
> >>> understands the new descriptors.
> >>
> >> Here is another perspective:
> >>
> >> As AF_XDP application developer I don't wan't to deal with the
> >> underlying hardware in detail. I like to request a feature from the OS
> >> (in this case rx/tx timestamping). If the feature is available I will
> >> simply use it, if not I might have to work around it - maybe by falling
> >> back to SW timestamping.
> >>
> >> All parts of my application (BPF program included) should not be
> >> optimized/adjusted for all the different HW variants out there.
> >
> > Yes, absolutely agreed. Abstracting away those kinds of hardware
> > differences is the whole *point* of having an OS/driver model. I.e.,
> > it's what the kernel is there for! If people want to bypass that and get
> > direct access to the hardware, they can already do that by using DPDK.
> >
> > So in other words, 100% agreed that we should not expect the BPF
> > developers to deal with hardware details as would be required with a
> > kptr-based interface.
> >
> > As for the kfunc-based interface, I think it shows some promise.
> > Exposing a list of function names to retrieve individual metadata items
> > instead of a struct layout is sorta comparable in terms of developer UI
> > accessibility etc (IMO).
>
> Looks like there are quite some use cases for hw_timestamp.
> Do you think we could add it to the uapi like struct xdp_md?
>
> The following is the current xdp_md:
> struct xdp_md {
>          __u32 data;
>          __u32 data_end;
>          __u32 data_meta;
>          /* Below access go through struct xdp_rxq_info */
>          __u32 ingress_ifindex; /* rxq->dev->ifindex */
>          __u32 rx_queue_index;  /* rxq->queue_index  */
>
>          __u32 egress_ifindex;  /* txq->dev->ifindex */
> };
>
> We could add  __u64 hw_timestamp to the xdp_md so user
> can just do xdp_md->hw_timestamp to get the value.
> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> available.
>
> Inside the kernel, the ctx rewriter can generate code
> to call driver specific function to retrieve the data.

If the driver generates the code to retrieve the data, how's that
different from the kfunc approach?
The only difference I see is that it would be a more strong UAPI than
the kfuncs?

> The kfunc approach can be used to *less* common use cases?

What's the advantage of having two approaches when one can cover
common and uncommon cases?

> > There are three main drawbacks, AFAICT:
> >
> > 1. It requires driver developers to write and maintain the code that
> > generates the unrolled BPF bytecode to access the metadata fields, which
> > is a non-trivial amount of complexity. Maybe this can be abstracted away
> > with some internal helpers though (like, e.g., a
> > bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> > the required JMP/MOV/LDX instructions?
> >
> > 2. AF_XDP programs won't be able to access the metadata without using a
> > custom XDP program that calls the kfuncs and puts the data into the
> > metadata area. We could solve this with some code in libxdp, though; if
> > this code can be made generic enough (so it just dumps the available
> > metadata functions from the running kernel at load time), it may be
> > possible to make it generic enough that it will be forward-compatible
> > with new versions of the kernel that add new fields, which should
> > alleviate Florian's concern about keeping things in sync.
> >
> > 3. It will make it harder to consume the metadata when building SKBs. I
> > think the CPUMAP and veth use cases are also quite important, and that
> > we want metadata to be available for building SKBs in this path. Maybe
> > this can be resolved by having a convenient kfunc for this that can be
> > used for programs doing such redirects. E.g., you could just call
> > xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> > would recursively expand into all the kfunc calls needed to extract the
> > metadata supported by the SKB path?
> >
> > -Toke
> >

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 22:09                 ` Stanislav Fomichev
@ 2022-10-31 22:38                   ` Yonghong Song
  2022-10-31 22:55                     ` Stanislav Fomichev
  2022-11-01 17:31                   ` Martin KaFai Lau
  1 sibling, 1 reply; 50+ messages in thread
From: Yonghong Song @ 2022-10-31 22:38 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, martin.lau, kpsingh, daniel, bpf, mtahhan,
	xdp-hints, netdev, jolsa, haoluo



On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> I was closely following this discussion for some time now. Seems we
>>>> reached the point where it's getting interesting for me.
>>>>
>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>>>> differences if the user space code has to handle all of it.
>>>>>>
>>>>>> I don't see how its any harder in practice though?
>>>>>
>>>>> You need to find out what HW/FW/config you're running, right?
>>>>> And all you have is a pointer to a blob of unknown type.
>>>>>
>>>>> Take timestamps for example, some NICs support adjusting the PHC
>>>>> or doing SW corrections (with different versions of hw/fw/server
>>>>> platforms being capable of both/one/neither).
>>>>>
>>>>> Sure you can extract all this info with tracing and careful
>>>>> inspection via uAPI. But I don't think that's _easier_.
>>>>> And the vendors can't run the results thru their validation
>>>>> (for whatever that's worth).
>>>>>
>>>>>>> I've had the same concern:
>>>>>>>
>>>>>>> Until we have some userspace library that abstracts all these details,
>>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>>>> of data and I need to go through the code and see what particular type
>>>>>>> it represents for my particular device and how the data I need is
>>>>>>> represented there. There are also these "if this is device v1 -> use
>>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>>>> this burden on the driver developers, but I agree that the drawback
>>>>>>> here is that we actually have to wait for the implementations to catch
>>>>>>> up.
>>>>>>
>>>>>> I agree with everything there, you will get a blob of data and then
>>>>>> will need to know what field you want to read using BTF. But, we
>>>>>> already do this for BPF programs all over the place so its not a big
>>>>>> lift for us. All other BPF tracing/observability requires the same
>>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>>>> place left to write BPF programs without thinking about BTF and
>>>>>> kernel data structures.
>>>>>>
>>>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>>>   From my point of view of supporting Cilium its a win and much preferred
>>>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>>>> and so on.
>>>>>>
>>>>>> If vendor updates firmware with new fields I get those immediately.
>>>>>
>>>>> Conversely it's a valid concern that those who *do* actually update
>>>>> their kernel regularly will have more things to worry about.
>>>>>
>>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>>>> programs might have to take a lot of other state into consideration
>>>>>>> when parsing the descriptors; all those details do seem like they
>>>>>>> belong to the driver code.
>>>>>>
>>>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>>>> be involved. With just a kptr I can support the device and any
>>>>>> firwmare versions without requiring help.
>>>>>
>>>>> 1) where are you getting all those HW / FW specs :S
>>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>>>
>>>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>>>> might push us closer to the agreement on the best way to proceed.
>>>>>>
>>>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>>>> I have a good collection of nics here so should be able to show a
>>>>>> couple firmware versions. It could be fine I think to have the raw
>>>>>> kptr access and then also kfuncs for some things perhaps.
>>>>>>
>>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>>>> vendors are supported and that's it.
>>>>>>
>>>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>>>> the opposite,
>>>>>
>>>>> TBH I'm only replying to the email because of the above part :)
>>>>> I thought this would be self evident, but I guess our perspectives
>>>>> are different.
>>>>>
>>>>> Perhaps you look at it from the perspective of SW running on someone
>>>>> else's cloud, an being able to move to another cloud, without having
>>>>> to worry if feature X is available in xdp or just skb.
>>>>>
>>>>> I look at it from the perspective of maintaining a cloud, with people
>>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>>>> (superior) startup, and cloud users are messing with raw descriptor -
>>>>> I'd need to go find every XDP program out there and make sure it
>>>>> understands the new descriptors.
>>>>
>>>> Here is another perspective:
>>>>
>>>> As AF_XDP application developer I don't wan't to deal with the
>>>> underlying hardware in detail. I like to request a feature from the OS
>>>> (in this case rx/tx timestamping). If the feature is available I will
>>>> simply use it, if not I might have to work around it - maybe by falling
>>>> back to SW timestamping.
>>>>
>>>> All parts of my application (BPF program included) should not be
>>>> optimized/adjusted for all the different HW variants out there.
>>>
>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>> differences is the whole *point* of having an OS/driver model. I.e.,
>>> it's what the kernel is there for! If people want to bypass that and get
>>> direct access to the hardware, they can already do that by using DPDK.
>>>
>>> So in other words, 100% agreed that we should not expect the BPF
>>> developers to deal with hardware details as would be required with a
>>> kptr-based interface.
>>>
>>> As for the kfunc-based interface, I think it shows some promise.
>>> Exposing a list of function names to retrieve individual metadata items
>>> instead of a struct layout is sorta comparable in terms of developer UI
>>> accessibility etc (IMO).
>>
>> Looks like there are quite some use cases for hw_timestamp.
>> Do you think we could add it to the uapi like struct xdp_md?
>>
>> The following is the current xdp_md:
>> struct xdp_md {
>>           __u32 data;
>>           __u32 data_end;
>>           __u32 data_meta;
>>           /* Below access go through struct xdp_rxq_info */
>>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>           __u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>           __u32 egress_ifindex;  /* txq->dev->ifindex */
>> };
>>
>> We could add  __u64 hw_timestamp to the xdp_md so user
>> can just do xdp_md->hw_timestamp to get the value.
>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>> available.
>>
>> Inside the kernel, the ctx rewriter can generate code
>> to call driver specific function to retrieve the data.
> 
> If the driver generates the code to retrieve the data, how's that
> different from the kfunc approach?
> The only difference I see is that it would be a more strong UAPI than
> the kfuncs?

Right. it is a strong uapi.

> 
>> The kfunc approach can be used to *less* common use cases?
> 
> What's the advantage of having two approaches when one can cover
> common and uncommon cases?

Beyond hw_timestamp, do we have any other fields ready to support?

If it ends up with lots of fields to be accessed by the bpf program,
and bpf program actually intends to access these fields,
using a strong uapi might be a good thing as it can make code
much streamlined.

> 
>>> There are three main drawbacks, AFAICT:
>>>
>>> 1. It requires driver developers to write and maintain the code that
>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>> with some internal helpers though (like, e.g., a
>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>> the required JMP/MOV/LDX instructions?
>>>
>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>> custom XDP program that calls the kfuncs and puts the data into the
>>> metadata area. We could solve this with some code in libxdp, though; if
>>> this code can be made generic enough (so it just dumps the available
>>> metadata functions from the running kernel at load time), it may be
>>> possible to make it generic enough that it will be forward-compatible
>>> with new versions of the kernel that add new fields, which should
>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>> think the CPUMAP and veth use cases are also quite important, and that
>>> we want metadata to be available for building SKBs in this path. Maybe
>>> this can be resolved by having a convenient kfunc for this that can be
>>> used for programs doing such redirects. E.g., you could just call
>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>> would recursively expand into all the kfunc calls needed to extract the
>>> metadata supported by the SKB path?
>>>
>>> -Toke
>>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 22:38                   ` Yonghong Song
@ 2022-10-31 22:55                     ` Stanislav Fomichev
  2022-11-01 14:23                       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-10-31 22:55 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, martin.lau, kpsingh, daniel, bpf, mtahhan,
	xdp-hints, netdev, jolsa, haoluo

On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song <yhs@meta.com> wrote:
>
>
>
> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
> >>
> >>
> >>
> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I was closely following this discussion for some time now. Seems we
> >>>> reached the point where it's getting interesting for me.
> >>>>
> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>>>> differences if the user space code has to handle all of it.
> >>>>>>
> >>>>>> I don't see how its any harder in practice though?
> >>>>>
> >>>>> You need to find out what HW/FW/config you're running, right?
> >>>>> And all you have is a pointer to a blob of unknown type.
> >>>>>
> >>>>> Take timestamps for example, some NICs support adjusting the PHC
> >>>>> or doing SW corrections (with different versions of hw/fw/server
> >>>>> platforms being capable of both/one/neither).
> >>>>>
> >>>>> Sure you can extract all this info with tracing and careful
> >>>>> inspection via uAPI. But I don't think that's _easier_.
> >>>>> And the vendors can't run the results thru their validation
> >>>>> (for whatever that's worth).
> >>>>>
> >>>>>>> I've had the same concern:
> >>>>>>>
> >>>>>>> Until we have some userspace library that abstracts all these details,
> >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>>>> of data and I need to go through the code and see what particular type
> >>>>>>> it represents for my particular device and how the data I need is
> >>>>>>> represented there. There are also these "if this is device v1 -> use
> >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>>>> this burden on the driver developers, but I agree that the drawback
> >>>>>>> here is that we actually have to wait for the implementations to catch
> >>>>>>> up.
> >>>>>>
> >>>>>> I agree with everything there, you will get a blob of data and then
> >>>>>> will need to know what field you want to read using BTF. But, we
> >>>>>> already do this for BPF programs all over the place so its not a big
> >>>>>> lift for us. All other BPF tracing/observability requires the same
> >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>>>> place left to write BPF programs without thinking about BTF and
> >>>>>> kernel data structures.
> >>>>>>
> >>>>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>>>   From my point of view of supporting Cilium its a win and much preferred
> >>>>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>>>> and so on.
> >>>>>>
> >>>>>> If vendor updates firmware with new fields I get those immediately.
> >>>>>
> >>>>> Conversely it's a valid concern that those who *do* actually update
> >>>>> their kernel regularly will have more things to worry about.
> >>>>>
> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>>>> programs might have to take a lot of other state into consideration
> >>>>>>> when parsing the descriptors; all those details do seem like they
> >>>>>>> belong to the driver code.
> >>>>>>
> >>>>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>>>> be involved. With just a kptr I can support the device and any
> >>>>>> firwmare versions without requiring help.
> >>>>>
> >>>>> 1) where are you getting all those HW / FW specs :S
> >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>>>
> >>>>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>>>> might push us closer to the agreement on the best way to proceed.
> >>>>>>
> >>>>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>>>> I have a good collection of nics here so should be able to show a
> >>>>>> couple firmware versions. It could be fine I think to have the raw
> >>>>>> kptr access and then also kfuncs for some things perhaps.
> >>>>>>
> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>>>> vendors are supported and that's it.
> >>>>>>
> >>>>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>>>> the opposite,
> >>>>>
> >>>>> TBH I'm only replying to the email because of the above part :)
> >>>>> I thought this would be self evident, but I guess our perspectives
> >>>>> are different.
> >>>>>
> >>>>> Perhaps you look at it from the perspective of SW running on someone
> >>>>> else's cloud, an being able to move to another cloud, without having
> >>>>> to worry if feature X is available in xdp or just skb.
> >>>>>
> >>>>> I look at it from the perspective of maintaining a cloud, with people
> >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>>>> (superior) startup, and cloud users are messing with raw descriptor -
> >>>>> I'd need to go find every XDP program out there and make sure it
> >>>>> understands the new descriptors.
> >>>>
> >>>> Here is another perspective:
> >>>>
> >>>> As AF_XDP application developer I don't wan't to deal with the
> >>>> underlying hardware in detail. I like to request a feature from the OS
> >>>> (in this case rx/tx timestamping). If the feature is available I will
> >>>> simply use it, if not I might have to work around it - maybe by falling
> >>>> back to SW timestamping.
> >>>>
> >>>> All parts of my application (BPF program included) should not be
> >>>> optimized/adjusted for all the different HW variants out there.
> >>>
> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware
> >>> differences is the whole *point* of having an OS/driver model. I.e.,
> >>> it's what the kernel is there for! If people want to bypass that and get
> >>> direct access to the hardware, they can already do that by using DPDK.
> >>>
> >>> So in other words, 100% agreed that we should not expect the BPF
> >>> developers to deal with hardware details as would be required with a
> >>> kptr-based interface.
> >>>
> >>> As for the kfunc-based interface, I think it shows some promise.
> >>> Exposing a list of function names to retrieve individual metadata items
> >>> instead of a struct layout is sorta comparable in terms of developer UI
> >>> accessibility etc (IMO).
> >>
> >> Looks like there are quite some use cases for hw_timestamp.
> >> Do you think we could add it to the uapi like struct xdp_md?
> >>
> >> The following is the current xdp_md:
> >> struct xdp_md {
> >>           __u32 data;
> >>           __u32 data_end;
> >>           __u32 data_meta;
> >>           /* Below access go through struct xdp_rxq_info */
> >>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
> >>           __u32 rx_queue_index;  /* rxq->queue_index  */
> >>
> >>           __u32 egress_ifindex;  /* txq->dev->ifindex */
> >> };
> >>
> >> We could add  __u64 hw_timestamp to the xdp_md so user
> >> can just do xdp_md->hw_timestamp to get the value.
> >> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> >> available.
> >>
> >> Inside the kernel, the ctx rewriter can generate code
> >> to call driver specific function to retrieve the data.
> >
> > If the driver generates the code to retrieve the data, how's that
> > different from the kfunc approach?
> > The only difference I see is that it would be a more strong UAPI than
> > the kfuncs?
>
> Right. it is a strong uapi.
>
> >
> >> The kfunc approach can be used to *less* common use cases?
> >
> > What's the advantage of having two approaches when one can cover
> > common and uncommon cases?
>
> Beyond hw_timestamp, do we have any other fields ready to support?
>
> If it ends up with lots of fields to be accessed by the bpf program,
> and bpf program actually intends to access these fields,
> using a strong uapi might be a good thing as it can make code
> much streamlined.

There are a bunch. Alexander's series has a good list:

https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8

We can definitely call some of them more "common" than the others, but
not sure how strong of a definition that would be.

> >
> >>> There are three main drawbacks, AFAICT:
> >>>
> >>> 1. It requires driver developers to write and maintain the code that
> >>> generates the unrolled BPF bytecode to access the metadata fields, which
> >>> is a non-trivial amount of complexity. Maybe this can be abstracted away
> >>> with some internal helpers though (like, e.g., a
> >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> >>> the required JMP/MOV/LDX instructions?
> >>>
> >>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>> custom XDP program that calls the kfuncs and puts the data into the
> >>> metadata area. We could solve this with some code in libxdp, though; if
> >>> this code can be made generic enough (so it just dumps the available
> >>> metadata functions from the running kernel at load time), it may be
> >>> possible to make it generic enough that it will be forward-compatible
> >>> with new versions of the kernel that add new fields, which should
> >>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>> think the CPUMAP and veth use cases are also quite important, and that
> >>> we want metadata to be available for building SKBs in this path. Maybe
> >>> this can be resolved by having a convenient kfunc for this that can be
> >>> used for programs doing such redirects. E.g., you could just call
> >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>> would recursively expand into all the kfunc calls needed to extract the
> >>> metadata supported by the SKB path?
> >>>
> >>> -Toke
> >>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 17:00               ` Stanislav Fomichev
@ 2022-10-31 22:57                 ` Martin KaFai Lau
  2022-11-01  1:59                   ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Martin KaFai Lau @ 2022-10-31 22:57 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo,
	Toke Høiland-Jørgensen

On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> 2. AF_XDP programs won't be able to access the metadata without using a
>> custom XDP program that calls the kfuncs and puts the data into the
>> metadata area. We could solve this with some code in libxdp, though; if
>> this code can be made generic enough (so it just dumps the available
>> metadata functions from the running kernel at load time), it may be
>> possible to make it generic enough that it will be forward-compatible
>> with new versions of the kernel that add new fields, which should
>> alleviate Florian's concern about keeping things in sync.
> 
> Good point. I had to convert to a custom program to use the kfuncs :-(
> But your suggestion sounds good; maybe libxdp can accept some extra
> info about at which offset the user would like to place the metadata
> and the library can generate the required bytecode?
> 
>> 3. It will make it harder to consume the metadata when building SKBs. I
>> think the CPUMAP and veth use cases are also quite important, and that
>> we want metadata to be available for building SKBs in this path. Maybe
>> this can be resolved by having a convenient kfunc for this that can be
>> used for programs doing such redirects. E.g., you could just call
>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> would recursively expand into all the kfunc calls needed to extract the
>> metadata supported by the SKB path?
> 
> So this xdp_copy_metadata_for_skb will create a metadata layout that

Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
Not sure where is the best point to specify this prog though.  Somehow during 
bpf_xdp_redirect_map?
or this prog belongs to the target cpumap and the xdp prog redirecting to this 
cpumap has to write the meta layout in a way that the cpumap is expecting?


> the kernel will be able to understand when converting back to skb?
> IIUC, the xdp program will look something like the following:
> 
> if (xdp packet is to be consumed by af_xdp) {
>    // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> own metadata layout
>    return bpf_redirect_map(xsk, ...);
> } else {
>    // if the packet is to be consumed by the kernel
>    xdp_copy_metadata_for_skb(ctx);
>    return bpf_redirect(...);
> }
> 
> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> put some magic number in the first byte(s) of the metadata so the
> kernel can check whether xdp_copy_metadata_for_skb has been called
> previously (or maybe xdp_frame can carry this extra signal, idk).


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 22:57                 ` Martin KaFai Lau
@ 2022-11-01  1:59                   ` Stanislav Fomichev
  2022-11-01 12:52                     ` Toke Høiland-Jørgensen
  2022-11-01 17:05                     ` Martin KaFai Lau
  0 siblings, 2 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-11-01  1:59 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo,
	Toke Høiland-Jørgensen

On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >> 2. AF_XDP programs won't be able to access the metadata without using a
> >> custom XDP program that calls the kfuncs and puts the data into the
> >> metadata area. We could solve this with some code in libxdp, though; if
> >> this code can be made generic enough (so it just dumps the available
> >> metadata functions from the running kernel at load time), it may be
> >> possible to make it generic enough that it will be forward-compatible
> >> with new versions of the kernel that add new fields, which should
> >> alleviate Florian's concern about keeping things in sync.
> >
> > Good point. I had to convert to a custom program to use the kfuncs :-(
> > But your suggestion sounds good; maybe libxdp can accept some extra
> > info about at which offset the user would like to place the metadata
> > and the library can generate the required bytecode?
> >
> >> 3. It will make it harder to consume the metadata when building SKBs. I
> >> think the CPUMAP and veth use cases are also quite important, and that
> >> we want metadata to be available for building SKBs in this path. Maybe
> >> this can be resolved by having a convenient kfunc for this that can be
> >> used for programs doing such redirects. E.g., you could just call
> >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >> would recursively expand into all the kfunc calls needed to extract the
> >> metadata supported by the SKB path?
> >
> > So this xdp_copy_metadata_for_skb will create a metadata layout that
>
> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> Not sure where is the best point to specify this prog though.  Somehow during
> bpf_xdp_redirect_map?
> or this prog belongs to the target cpumap and the xdp prog redirecting to this
> cpumap has to write the meta layout in a way that the cpumap is expecting?

We're probably interested in triggering it from the places where xdp
frames can eventually be converted into skbs?
So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
anything that's not XDP_DROP / AF_XDP redirect).
We can probably make it magically work, and can generate
kernel-digestible metadata whenever data == data_meta, but the
question - should we?
(need to make sure we won't regress any existing cases that are not
relying on the metadata)






> > the kernel will be able to understand when converting back to skb?
> > IIUC, the xdp program will look something like the following:
> >
> > if (xdp packet is to be consumed by af_xdp) {
> >    // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> > own metadata layout
> >    return bpf_redirect_map(xsk, ...);
> > } else {
> >    // if the packet is to be consumed by the kernel
> >    xdp_copy_metadata_for_skb(ctx);
> >    return bpf_redirect(...);
> > }
> >
> > Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> > put some magic number in the first byte(s) of the metadata so the
> > kernel can check whether xdp_copy_metadata_for_skb has been called
> > previously (or maybe xdp_frame can carry this extra signal, idk).

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01  1:59                   ` Stanislav Fomichev
@ 2022-11-01 12:52                     ` Toke Høiland-Jørgensen
  2022-11-01 13:43                       ` David Ahern
  2022-11-01 17:05                     ` Martin KaFai Lau
  1 sibling, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-01 12:52 UTC (permalink / raw)
  To: Stanislav Fomichev, Martin KaFai Lau
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo

Stanislav Fomichev <sdf@google.com> writes:

> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> >> 2. AF_XDP programs won't be able to access the metadata without using a
>> >> custom XDP program that calls the kfuncs and puts the data into the
>> >> metadata area. We could solve this with some code in libxdp, though; if
>> >> this code can be made generic enough (so it just dumps the available
>> >> metadata functions from the running kernel at load time), it may be
>> >> possible to make it generic enough that it will be forward-compatible
>> >> with new versions of the kernel that add new fields, which should
>> >> alleviate Florian's concern about keeping things in sync.
>> >
>> > Good point. I had to convert to a custom program to use the kfuncs :-(
>> > But your suggestion sounds good; maybe libxdp can accept some extra
>> > info about at which offset the user would like to place the metadata
>> > and the library can generate the required bytecode?
>> >
>> >> 3. It will make it harder to consume the metadata when building SKBs. I
>> >> think the CPUMAP and veth use cases are also quite important, and that
>> >> we want metadata to be available for building SKBs in this path. Maybe
>> >> this can be resolved by having a convenient kfunc for this that can be
>> >> used for programs doing such redirects. E.g., you could just call
>> >> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> >> would recursively expand into all the kfunc calls needed to extract the
>> >> metadata supported by the SKB path?
>> >
>> > So this xdp_copy_metadata_for_skb will create a metadata layout that
>>
>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> Not sure where is the best point to specify this prog though.  Somehow during
>> bpf_xdp_redirect_map?
>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>
> We're probably interested in triggering it from the places where xdp
> frames can eventually be converted into skbs?
> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> anything that's not XDP_DROP / AF_XDP redirect).
> We can probably make it magically work, and can generate
> kernel-digestible metadata whenever data == data_meta, but the
> question - should we?
> (need to make sure we won't regress any existing cases that are not
> relying on the metadata)

So I was thinking about whether we could have the kernel do this
automatically, and concluded that this was probably not feasible in
general, which is why I suggested the explicit helper. My reasoning was
as follows:

For straight XDP_PASS in the driver we don't actually need to do
anything today, as the driver itself will build the SKB and read any
metadata it needs from the HW descriptor[0].

This leaves packets that are redirected (either to a veth or a cpumap so
we build SKBs from them later); here the problem is that we buffer the
packets (for performance reasons) so that the redirect doesn't actually
happen until after the driver exits the NAPI loop. At which point we
don't have access to the HW descriptors anymore, so we can't actually
read the metadata.

This means that if we want to execute the metadata gathering
automatically, we'd have to do it in xdp_do_redirect(). Which means that
we'll have to figure out, at that point, whether the XDP frame is likely
to be converted to an SKB. This will add at least one branch (and
probably more) that will be in-path for every redirected frame.

Hence, making it up to the XDP program itself to decide whether it will
need the metadata for SKB conversion seems like a better choice, as long
as we make it easy for the XDP program to do this. Instead of a helper,
this could also simply be a new flag to the bpf_redirect{,_map}()
helpers (either opt-in or opt-out depending on the overhead), which
would be even simpler?

I.e.,

return bpf_redirect_map(&cpumap, 0, BPF_F_PREPARE_SKB_METADATA);

-Toke


[0] As an aside, in the future drivers may want to take advantage of the
XDP-specific metadata reading also when building SKBs (so it doesn't
have to implement it in both BPF and C code). For this, we could expose
a new internal helper function that the drivers could call to simply
execute the XDP-to-skb metadata helpers the same way the stack/helper
does.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-10-31 17:00           ` Stanislav Fomichev
@ 2022-11-01 13:18             ` Jesper Dangaard Brouer
  2022-11-01 20:12               ` Stanislav Fomichev
  2022-11-01 22:23               ` [xdp-hints] " Toke Høiland-Jørgensen
  0 siblings, 2 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-01 13:18 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexander Lobakin
  Cc: brouer, Jesper Dangaard Brouer, Martin KaFai Lau, ast, daniel,
	andrii, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf,
	John Fastabend



On 31/10/2022 18.00, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
>>
>> From: Stanislav Fomichev <sdf@google.com>
>> Date: Fri, 28 Oct 2022 11:46:14 -0700
>>
>>> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
>>> <jbrouer@redhat.com> wrote:
>>>>
>>>>
>>>> On 28/10/2022 08.22, Martin KaFai Lau wrote:
>>>>> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
>>>>>> Example on how the metadata is prepared from the BPF context
>>>>>> and consumed by AF_XDP:
>>>>>>
>>>>>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
>>>>>>     if not, I'm assuming verifier will remove this "if (0)" branch
>>>>>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
>>>>>>     the program has to bpf_xdp_adjust_meta+memcpy it;
>>>>>>     maybe returning a pointer is better?
>>>>>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
>>>>>>     makes sure timestamp is not empty
>>>>>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
>>>>>>
>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>> Cc: netdev@vger.kernel.org
>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>> ---
>>>>>>    .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
>>>>>>    tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
>>>>>>    2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> [...]
>>
>>>> IMHO sizeof() should come from a struct describing data_meta area see:
>>>>
>>>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
>>>
>>> I guess I should've used pointers for the return type instead, something like:
>>>
>>> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
>>>
>>> {
>>>     ...
>>>      __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>>      if (rx_timestamp) {
>>>          bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
>>>          __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
>>>      }
>>> }
>>>
>>> Does that look better?
>>
>> I guess it will then be resolved to a direct store, right?
>> I mean, to smth like
>>
>>          if (rx_timestamp)
>>                  *(u32 *)data_meta = *rx_timestamp;
>>
>> , where *rx_timestamp points directly to the Rx descriptor?
> 
> Right. I should've used that form from the beginning, that memcpy is
> confusing :-(
> 
>>>
>>>>>> +        if (ret != 0)
>>>>>> +            return XDP_DROP;
>>>>>> +
>>>>>> +        data = (void *)(long)ctx->data;
>>>>>> +        data_meta = (void *)(long)ctx->data_meta;
>>>>>> +
>>>>>> +        if (data_meta + sizeof(__u32) > data)
>>>>>> +            return XDP_DROP;
>>>>>> +
>>>>>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>>>>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
>>>>
>>>> So, this approach first stores hints on some other memory location, and
>>>> then need to copy over information into data_meta area. That isn't good
>>>> from a performance perspective.
>>>>
>>>> My idea is to store it in the final data_meta destination immediately.
>>>
>>> This approach doesn't have to store the hints in the other memory
>>> location. xdp_buff->priv can point to the real hw descriptor and the
>>> kfunc can have a bytecode that extracts the data from the hw
>>> descriptors. For this particular RFC, we can think that 'skb' is that
>>> hw descriptor for veth driver.

Once you point xdp_buff->priv to the real hw descriptor, then we also
need to have some additional data/pointers to NIC hardware info + HW
setup state. You will hit some of the same challenges as John, like
hardware/firmware revisions and chip models, that Jakub pointed out.
Because your approach stays with the driver code, I guess it will be a
bit easier code wise. Maybe we can store data/pointer needed for this in
xdp_rxq_info (xdp->rxq).

I would need to see some code that juggling this HW NCI state from the
kfunc expansion to be convinced this is the right approach.


>>
>> I really do think intermediate stores can be avoided with this
>> approach.
>> Oh, and BTW, if we plan to use a particular Hint in the BPF program
>> only, there's no need to place it in the metadata area at all, is
>> there? You only want to get it in your code, so just retrieve it to
>> the stack and that's it. data_meta is only for cases when you want
>> hints to appear in AF_XDP.
> 
> Correct.

It is not *only* AF_XDP that need data stored in data_meta.

The stores data_meta are also needed for veth and cpumap, because the HW
descriptor is "out-of-scope" and thus no-longer available.


> 
>>>> Do notice that in my approach, the existing ethtool config setting and
>>>> socket options (for timestamps) still apply.  Thus, each individual
>>>> hardware hint are already configurable. Thus we already have a config
>>>> interface. I do acknowledge, that in-case a feature is disabled it still
>>>> takes up space in data_meta areas, but importantly it is NOT stored into
>>>> the area (for performance reasons).
>>>
>>> That should be the case with this rfc as well, isn't it? Worst case
>>> scenario, that kfunc bytecode can explicitly check ethtool options and
>>> return false if it's disabled?
>>
>> (to Jesper)
>>
>> Once again, Ethtool idea doesn't work. Let's say you have roughly
>> 50% of frames to be consumed by XDP, other 50% go to skb path and
>> the stack. In skb, I want as many HW data as possible: checksums,
>> hash and so on. Let's say in XDP prog I want only timestamp. What's
>> then? Disable everything but stamp and kill skb path? Enable
>> everything and kill XDP path?
>> Stanislav's approach allows you to request only fields you need from
>> the BPF prog directly, I don't see any reasons for adding one more
>> layer "oh no I won't give you checksum because it's disabled via
>> Ethtool".
>> Maybe I get something wrong, pls explain then :P
> 
> Agree, good point.

Stanislav's (and John's proposal) is definitely focused and addressing
something else than my patchset.

I optimized the XDP-hints population (for i40e) down to 6 nanosec (on
3.6 GHz CPU = 21 cycles).  Plus, I added an ethtool switch to turn it
off for those XDP users that cannot live with this overhead.  Hoping
this would be fast-enough such that we didn't have to add this layer.
(If XDP returns XDP_PASS then this decoded info can be used for the SKB
creation. Thus, this is essentially just moving decoding RX-desc a bit
earlier in the the driver).

One of my use-cases is getting RX-checksum support in xdp_frame's and
transferring this to SKB creation time.  I have done a number of
measurements[1] to find out how much we can gain of performance for UDP
packets (1500 bytes) with/without RX-checksum.  Initial result showed I
saved 91 nanosec, but that was avoiding to touching data.  Doing full
userspace UDP delivery with a copy (or copy+checksum) showed the real
save was 54 nanosec.  In this context, the 6 nanosec was very small.
Thus, I didn't choose to pursue a BPF layer for individual fields.

  [1]
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org

Sure it is super cool if we can create this BPF layer that programmable
selects individual fields from the descriptor, and maybe we ALSO need that.
Could this layer could still be added after my patchset(?), as one could
disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
only fields need by the specific XDP-prog use-case.
Could they also co-exist(?), kfuncs/kptr could extend the
xdp_hints_rx_common struct (in data_meta area) with more advanced
offload-hints and then update the BTF-ID (yes, BPF can already resolve
its own BTF-IDs from BPF-prog code).

Great to see all the discussions and different oppinons :-)
--Jesper


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 12:52                     ` Toke Høiland-Jørgensen
@ 2022-11-01 13:43                       ` David Ahern
  2022-11-01 14:20                         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: David Ahern @ 2022-11-01 13:43 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev, Martin KaFai Lau
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo

On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>
>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>> info about at which offset the user would like to place the metadata
>>>> and the library can generate the required bytecode?
>>>>
>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>> metadata supported by the SKB path?
>>>>
>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>
>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>> Not sure where is the best point to specify this prog though.  Somehow during
>>> bpf_xdp_redirect_map?
>>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>>
>> We're probably interested in triggering it from the places where xdp
>> frames can eventually be converted into skbs?
>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> anything that's not XDP_DROP / AF_XDP redirect).
>> We can probably make it magically work, and can generate
>> kernel-digestible metadata whenever data == data_meta, but the
>> question - should we?
>> (need to make sure we won't regress any existing cases that are not
>> relying on the metadata)
> 
> So I was thinking about whether we could have the kernel do this
> automatically, and concluded that this was probably not feasible in
> general, which is why I suggested the explicit helper. My reasoning was
> as follows:
> 
> For straight XDP_PASS in the driver we don't actually need to do
> anything today, as the driver itself will build the SKB and read any
> metadata it needs from the HW descriptor[0].

The program can pop encap headers, mpls tags, ... and thus affect the
metadata in the descriptor (besides the timestamp).

> 
> This leaves packets that are redirected (either to a veth or a cpumap so
> we build SKBs from them later); here the problem is that we buffer the
> packets (for performance reasons) so that the redirect doesn't actually
> happen until after the driver exits the NAPI loop. At which point we
> don't have access to the HW descriptors anymore, so we can't actually
> read the metadata.
> 
> This means that if we want to execute the metadata gathering
> automatically, we'd have to do it in xdp_do_redirect(). Which means that
> we'll have to figure out, at that point, whether the XDP frame is likely
> to be converted to an SKB. This will add at least one branch (and
> probably more) that will be in-path for every redirected frame.

or forwarded to a tun device as an xdp frame and wanting to pass
metadata into a VM which may construct an skb in the guest. This case is
arguably aligned with the redirect from vendor1 to vendor2.

This thread (and others) seem to be focused on the Rx path, but the Tx
path is equally important with similar needs.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 13:43                       ` David Ahern
@ 2022-11-01 14:20                         ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-01 14:20 UTC (permalink / raw)
  To: David Ahern, Stanislav Fomichev, Martin KaFai Lau
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo

David Ahern <dsahern@gmail.com> writes:

> On 11/1/22 6:52 AM, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>> with new versions of the kernel that add new fields, which should
>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>> info about at which offset the user would like to place the metadata
>>>>> and the library can generate the required bytecode?
>>>>>
>>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>>> metadata supported by the SKB path?
>>>>>
>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>
>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>> Not sure where is the best point to specify this prog though.  Somehow during
>>>> bpf_xdp_redirect_map?
>>>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>>>> cpumap has to write the meta layout in a way that the cpumap is expecting?
>>>
>>> We're probably interested in triggering it from the places where xdp
>>> frames can eventually be converted into skbs?
>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>> anything that's not XDP_DROP / AF_XDP redirect).
>>> We can probably make it magically work, and can generate
>>> kernel-digestible metadata whenever data == data_meta, but the
>>> question - should we?
>>> (need to make sure we won't regress any existing cases that are not
>>> relying on the metadata)
>> 
>> So I was thinking about whether we could have the kernel do this
>> automatically, and concluded that this was probably not feasible in
>> general, which is why I suggested the explicit helper. My reasoning was
>> as follows:
>> 
>> For straight XDP_PASS in the driver we don't actually need to do
>> anything today, as the driver itself will build the SKB and read any
>> metadata it needs from the HW descriptor[0].
>
> The program can pop encap headers, mpls tags, ... and thus affect the
> metadata in the descriptor (besides the timestamp).

Hmm, right, good point. How does XDP_PASS deal with that today, though?

I guess this is an argument for making the "read HW metadata into SKB
format" thing be a kfunc/helper rather than a flag to bpf_redirect(),
then. Because then we can allow the XDP program to override/modify the
metadata afterwards, either by defining it as:

int xdp_copy_metadata_for_skb(struct xdp_md *ctx, struct xdp_skb_meta *override, int flags)

where the XDP program can fill in 'override' with new data that takes
precedence over the stuff from the HW (like a modified checksum or
offset or something).

Or we can just have xdp_copy_metadata_for_skb() into the regular XDP
metadata area, and let the XDP program modify it afterwards. I feel like
the override argument would be easier to use, though.

Also, having it be completely opaque *where* the metadata is stored when
using xdp_copy_metadata_for_skb() lets us be more flexible about it.
E.g., the helper could write the timestamp directly into
skb_shared_info, instead of stuffing it into the metadata area where it
then has to be copied out later.

>> This leaves packets that are redirected (either to a veth or a cpumap so
>> we build SKBs from them later); here the problem is that we buffer the
>> packets (for performance reasons) so that the redirect doesn't actually
>> happen until after the driver exits the NAPI loop. At which point we
>> don't have access to the HW descriptors anymore, so we can't actually
>> read the metadata.
>> 
>> This means that if we want to execute the metadata gathering
>> automatically, we'd have to do it in xdp_do_redirect(). Which means that
>> we'll have to figure out, at that point, whether the XDP frame is likely
>> to be converted to an SKB. This will add at least one branch (and
>> probably more) that will be in-path for every redirected frame.
>
> or forwarded to a tun device as an xdp frame and wanting to pass
> metadata into a VM which may construct an skb in the guest. This case is
> arguably aligned with the redirect from vendor1 to vendor2.
>
> This thread (and others) seem to be focused on the Rx path, but the Tx
> path is equally important with similar needs.

You're right, of course. Thinking a bit out loud here, but I actually
think the kfunc approach makes the TX side easier:

We already have to ability to execute a second "TX" XDP program inside
the devmaps. At which point that program is also tied to a particular
interface. So we could duplicate the RX-side kfunc trick, and expose a
set of *writer* kfuncs for metadata. So that an XDP program in the
devmap can simply do:

if (bpf_xdp_metadata_tx_timestamp_supported())
  bpf_xdp_metadata_tx_timestamp(ctx, tsval);

and those two kfuncs will be unrolled by the TX-side driver as well to
store them wherever they need to go to reach the wire.

The one complication here being, of course, that by the time the devmap
XDP program is executed, the driver hasn't seen the frame at all, yet,
so it doesn't have anywhere to store that data. We'd need to reuse the
frame metadata area for this (with some flag indicating that it's
valid), or we'd need a new area the driver could use as scratch space
specific to the xdp_frame (like the skb->cb field, I suppose).

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 22:55                     ` Stanislav Fomichev
@ 2022-11-01 14:23                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-01 14:23 UTC (permalink / raw)
  To: Stanislav Fomichev, Yonghong Song
  Cc: brouer, Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast, yhs,
	martin.lau, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo



On 31/10/2022 23.55, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 3:38 PM Yonghong Song<yhs@meta.com>  wrote:
>>
>> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
>>> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song<yhs@meta.com>  wrote:
>>>>
>>>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>>>> "Bezdeka, Florian"<florian.bezdeka@siemens.com>  writes:
>>>>>>
>>>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
[...]
>>>>>> All parts of my application (BPF program included) should not be
>>>>>> optimized/adjusted for all the different HW variants out there.
>>>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>>>> differences is the whole*point*  of having an OS/driver model. I.e.,
>>>>> it's what the kernel is there for! If people want to bypass that and get
>>>>> direct access to the hardware, they can already do that by using DPDK.
>>>>>
>>>>> So in other words, 100% agreed that we should not expect the BPF
>>>>> developers to deal with hardware details as would be required with a
>>>>> kptr-based interface.
>>>>>
>>>>> As for the kfunc-based interface, I think it shows some promise.
>>>>> Exposing a list of function names to retrieve individual metadata items
>>>>> instead of a struct layout is sorta comparable in terms of developer UI
>>>>> accessibility etc (IMO).
>>>> >>>> Looks like there are quite some use cases for hw_timestamp.
>>>> Do you think we could add it to the uapi like struct xdp_md?
>>>>
>>>> The following is the current xdp_md:
>>>> struct xdp_md {
>>>>            __u32 data;
>>>>            __u32 data_end;
>>>>            __u32 data_meta;
>>>>            /* Below access go through struct xdp_rxq_info */
>>>>            __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>>>            __u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>            __u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> };
>>>>
>>>> We could add  __u64 hw_timestamp to the xdp_md so user
>>>> can just do xdp_md->hw_timestamp to get the value.
>>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>>>> available.
>>>>
>>>> Inside the kernel, the ctx rewriter can generate code
>>>> to call driver specific function to retrieve the data.
>>> If the driver generates the code to retrieve the data, how's that
>>> different from the kfunc approach?
>>> The only difference I see is that it would be a more strong UAPI than
>>> the kfuncs?
>> Right. it is a strong uapi.
>>
>>>> The kfunc approach can be used to*less*  common use cases?
>>> What's the advantage of having two approaches when one can cover
>>> common and uncommon cases?
>>
>> Beyond hw_timestamp, do we have any other fields ready to support?
>>
>> If it ends up with lots of fields to be accessed by the bpf program,
>> and bpf program actually intends to access these fields,
>> using a strong uapi might be a good thing as it can make code
>> much streamlined.
> > There are a bunch. Alexander's series has a good list:
> 
> https://github.com/alobakin/linux/commit/31bfe8035c995fdf4f1e378b3429d24b96846cc8
> 

Below are the fields I've identified, which are close to what Alexander 
also found.

  struct xdp_hints_common {
	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	u16 rx_queue;
	u16 vlan_tci;
	u32 rx_hash32;
	u32 xdp_hints_flags;
	u64 btf_full_id; /* BTF object + type ID */
  } __attribute__((aligned(4))) __attribute__((packed));

Some of the fields are encoded via flags:

  enum xdp_hints_flags {
	HINT_FLAG_CSUM_TYPE_BIT0  = BIT(0),
	HINT_FLAG_CSUM_TYPE_BIT1  = BIT(1),
	HINT_FLAG_CSUM_TYPE_MASK  = 0x3,

	HINT_FLAG_CSUM_LEVEL_BIT0 = BIT(2),
	HINT_FLAG_CSUM_LEVEL_BIT1 = BIT(3),
	HINT_FLAG_CSUM_LEVEL_MASK = 0xC,
	HINT_FLAG_CSUM_LEVEL_SHIFT = 2,

	HINT_FLAG_RX_HASH_TYPE_BIT0 = BIT(4),
	HINT_FLAG_RX_HASH_TYPE_BIT1 = BIT(5),
	HINT_FLAG_RX_HASH_TYPE_MASK = 0x30,
	HINT_FLAG_RX_HASH_TYPE_SHIFT = 0x4,

	HINT_FLAG_RX_QUEUE = BIT(7),

	HINT_FLAG_VLAN_PRESENT            = BIT(8),
	HINT_FLAG_VLAN_PROTO_ETH_P_8021Q  = BIT(9),
	HINT_FLAG_VLAN_PROTO_ETH_P_8021AD = BIT(10),
	/* Flags from BIT(16) can be used by drivers */
  };

> We can definitely call some of them more "common" than the others, but
> not sure how strong of a definition that would be.

The important fields that would be worth considering as UAPI candidates
are: (1) RX-hash, (2) Hash-type and (3) RX-checksum.
With these three we can avoid calling the flow-dissector and GRO frame
aggregations works. (This currently hurts xdp_frame to SKB performance a
lot in practice).

*BUT* in it's current form above (incl. Alexanders approach/patch) it
would be a mistake to UAPI standardize the "(2) Hash-type" in this
simplified "reduced" form (which is what the SKB "needs").

There is a huge untapped potential in the Hash-type.  Thanks to
Microsoft almost all NIC hardware provided a Hash-type that gives us the
L3-protocol (IPv4 or IPv6) and the L4-protocol (UDP or TCP and sometimes
SCTP), plus info if extention-headers are provided. (Digging in
datasheets, we can often also get the header-size).

Think about how many cycles XDP BPF-prog can save parsing protocol
headers.  I'm also hoping we can leveregate this to allow SKBs created
from an xdp_frame to have skb->transport_header and skb->network_header
pre-populated (and skip some of these netstack layers).

--Jesper

p.s. in my patchset, I exposed the "raw" Hash-type bits from the 
descriptor in hope this would evolve.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01  1:59                   ` Stanislav Fomichev
  2022-11-01 12:52                     ` Toke Høiland-Jørgensen
@ 2022-11-01 17:05                     ` Martin KaFai Lau
  2022-11-01 20:12                       ` Stanislav Fomichev
  2022-11-02 14:06                       ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 50+ messages in thread
From: Martin KaFai Lau @ 2022-11-01 17:05 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo,
	Toke Høiland-Jørgensen

On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>> this code can be made generic enough (so it just dumps the available
>>>> metadata functions from the running kernel at load time), it may be
>>>> possible to make it generic enough that it will be forward-compatible
>>>> with new versions of the kernel that add new fields, which should
>>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>> info about at which offset the user would like to place the metadata
>>> and the library can generate the required bytecode?
>>>
>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>> this can be resolved by having a convenient kfunc for this that can be
>>>> used for programs doing such redirects. E.g., you could just call
>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>> would recursively expand into all the kfunc calls needed to extract the
>>>> metadata supported by the SKB path?
>>>
>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>
>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> Not sure where is the best point to specify this prog though.  Somehow during
>> bpf_xdp_redirect_map?
>> or this prog belongs to the target cpumap and the xdp prog redirecting to this
>> cpumap has to write the meta layout in a way that the cpumap is expecting?
> 
> We're probably interested in triggering it from the places where xdp
> frames can eventually be converted into skbs?
> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> anything that's not XDP_DROP / AF_XDP redirect).
> We can probably make it magically work, and can generate
> kernel-digestible metadata whenever data == data_meta, but the
> question - should we?
> (need to make sure we won't regress any existing cases that are not
> relying on the metadata)

Instead of having some kernel-digestible meta data, how about calling another 
bpf prog to initialize the skb fields from the meta area after 
__xdp_build_skb_from_frame() in the cpumap, so 
run_xdp_set_skb_fileds_from_metadata() may be a better name.

The xdp_prog@rx sets the meta data and then redirect.  If the xdp_prog@rx can 
also specify a xdp prog to initialize the skb fields from the meta area, then 
there is no need to have a kfunc to enforce a kernel-digestible layout.  Not 
sure what is a good way to specify this xdp_prog though...

>>> the kernel will be able to understand when converting back to skb?
>>> IIUC, the xdp program will look something like the following:
>>>
>>> if (xdp packet is to be consumed by af_xdp) {
>>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
>>> own metadata layout
>>>     return bpf_redirect_map(xsk, ...);
>>> } else {
>>>     // if the packet is to be consumed by the kernel
>>>     xdp_copy_metadata_for_skb(ctx);
>>>     return bpf_redirect(...);
>>> }
>>>
>>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
>>> put some magic number in the first byte(s) of the metadata so the
>>> kernel can check whether xdp_copy_metadata_for_skb has been called
>>> previously (or maybe xdp_frame can carry this extra signal, idk).


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-10-31 22:09                 ` Stanislav Fomichev
  2022-10-31 22:38                   ` Yonghong Song
@ 2022-11-01 17:31                   ` Martin KaFai Lau
  2022-11-01 20:12                     ` Stanislav Fomichev
  1 sibling, 1 reply; 50+ messages in thread
From: Martin KaFai Lau @ 2022-11-01 17:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo, Yonghong Song

On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> I was closely following this discussion for some time now. Seems we
>>>> reached the point where it's getting interesting for me.
>>>>
>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>>>> differences if the user space code has to handle all of it.
>>>>>>
>>>>>> I don't see how its any harder in practice though?
>>>>>
>>>>> You need to find out what HW/FW/config you're running, right?
>>>>> And all you have is a pointer to a blob of unknown type.
>>>>>
>>>>> Take timestamps for example, some NICs support adjusting the PHC
>>>>> or doing SW corrections (with different versions of hw/fw/server
>>>>> platforms being capable of both/one/neither).
>>>>>
>>>>> Sure you can extract all this info with tracing and careful
>>>>> inspection via uAPI. But I don't think that's _easier_.
>>>>> And the vendors can't run the results thru their validation
>>>>> (for whatever that's worth).
>>>>>
>>>>>>> I've had the same concern:
>>>>>>>
>>>>>>> Until we have some userspace library that abstracts all these details,
>>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>>>> of data and I need to go through the code and see what particular type
>>>>>>> it represents for my particular device and how the data I need is
>>>>>>> represented there. There are also these "if this is device v1 -> use
>>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>>>> this burden on the driver developers, but I agree that the drawback
>>>>>>> here is that we actually have to wait for the implementations to catch
>>>>>>> up.
>>>>>>
>>>>>> I agree with everything there, you will get a blob of data and then
>>>>>> will need to know what field you want to read using BTF. But, we
>>>>>> already do this for BPF programs all over the place so its not a big
>>>>>> lift for us. All other BPF tracing/observability requires the same
>>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>>>> place left to write BPF programs without thinking about BTF and
>>>>>> kernel data structures.
>>>>>>
>>>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>>>   From my point of view of supporting Cilium its a win and much preferred
>>>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>>>> and so on.
>>>>>>
>>>>>> If vendor updates firmware with new fields I get those immediately.
>>>>>
>>>>> Conversely it's a valid concern that those who *do* actually update
>>>>> their kernel regularly will have more things to worry about.
>>>>>
>>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>>>> programs might have to take a lot of other state into consideration
>>>>>>> when parsing the descriptors; all those details do seem like they
>>>>>>> belong to the driver code.
>>>>>>
>>>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>>>> be involved. With just a kptr I can support the device and any
>>>>>> firwmare versions without requiring help.
>>>>>
>>>>> 1) where are you getting all those HW / FW specs :S
>>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>>>
>>>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>>>> might push us closer to the agreement on the best way to proceed.
>>>>>>
>>>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>>>> I have a good collection of nics here so should be able to show a
>>>>>> couple firmware versions. It could be fine I think to have the raw
>>>>>> kptr access and then also kfuncs for some things perhaps.
>>>>>>
>>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>>>> vendors are supported and that's it.
>>>>>>
>>>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>>>> the opposite,
>>>>>
>>>>> TBH I'm only replying to the email because of the above part :)
>>>>> I thought this would be self evident, but I guess our perspectives
>>>>> are different.
>>>>>
>>>>> Perhaps you look at it from the perspective of SW running on someone
>>>>> else's cloud, an being able to move to another cloud, without having
>>>>> to worry if feature X is available in xdp or just skb.
>>>>>
>>>>> I look at it from the perspective of maintaining a cloud, with people
>>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>>>> (superior) startup, and cloud users are messing with raw descriptor -
>>>>> I'd need to go find every XDP program out there and make sure it
>>>>> understands the new descriptors.
>>>>
>>>> Here is another perspective:
>>>>
>>>> As AF_XDP application developer I don't wan't to deal with the
>>>> underlying hardware in detail. I like to request a feature from the OS
>>>> (in this case rx/tx timestamping). If the feature is available I will
>>>> simply use it, if not I might have to work around it - maybe by falling
>>>> back to SW timestamping.
>>>>
>>>> All parts of my application (BPF program included) should not be
>>>> optimized/adjusted for all the different HW variants out there.
>>>
>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>> differences is the whole *point* of having an OS/driver model. I.e.,
>>> it's what the kernel is there for! If people want to bypass that and get
>>> direct access to the hardware, they can already do that by using DPDK.
>>>
>>> So in other words, 100% agreed that we should not expect the BPF
>>> developers to deal with hardware details as would be required with a
>>> kptr-based interface.
>>>
>>> As for the kfunc-based interface, I think it shows some promise.
>>> Exposing a list of function names to retrieve individual metadata items
>>> instead of a struct layout is sorta comparable in terms of developer UI
>>> accessibility etc (IMO).
>>
>> Looks like there are quite some use cases for hw_timestamp.
>> Do you think we could add it to the uapi like struct xdp_md?
>>
>> The following is the current xdp_md:
>> struct xdp_md {
>>           __u32 data;
>>           __u32 data_end;
>>           __u32 data_meta;
>>           /* Below access go through struct xdp_rxq_info */
>>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>           __u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>           __u32 egress_ifindex;  /* txq->dev->ifindex */
>> };
>>
>> We could add  __u64 hw_timestamp to the xdp_md so user
>> can just do xdp_md->hw_timestamp to get the value.
>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>> available.
>>
>> Inside the kernel, the ctx rewriter can generate code
>> to call driver specific function to retrieve the data.
> 
> If the driver generates the code to retrieve the data, how's that
> different from the kfunc approach?
> The only difference I see is that it would be a more strong UAPI than
> the kfuncs?

Another thing may be worth considering, some hints for some HW/driver may be 
harder (or may not worth) to unroll/inline.  For example, I see driver is doing 
spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc 
and avoid the unroll/inline may be the right thing to do.

> 
>> The kfunc approach can be used to *less* common use cases?
> 
> What's the advantage of having two approaches when one can cover
> common and uncommon cases?
> 
>>> There are three main drawbacks, AFAICT:
>>>
>>> 1. It requires driver developers to write and maintain the code that
>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>> with some internal helpers though (like, e.g., a
>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>> the required JMP/MOV/LDX instructions?
>>>
>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>> custom XDP program that calls the kfuncs and puts the data into the
>>> metadata area. We could solve this with some code in libxdp, though; if
>>> this code can be made generic enough (so it just dumps the available
>>> metadata functions from the running kernel at load time), it may be
>>> possible to make it generic enough that it will be forward-compatible
>>> with new versions of the kernel that add new fields, which should
>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>> think the CPUMAP and veth use cases are also quite important, and that
>>> we want metadata to be available for building SKBs in this path. Maybe
>>> this can be resolved by having a convenient kfunc for this that can be
>>> used for programs doing such redirects. E.g., you could just call
>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>> would recursively expand into all the kfunc calls needed to extract the
>>> metadata supported by the SKB path?
>>>
>>> -Toke
>>>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 17:31                   ` Martin KaFai Lau
@ 2022-11-01 20:12                     ` Stanislav Fomichev
  2022-11-01 21:17                       ` Martin KaFai Lau
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-11-01 20:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo, Yonghong Song

On Tue, Nov 1, 2022 at 10:31 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
> >>
> >>
> >>
> >> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
> >>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I was closely following this discussion for some time now. Seems we
> >>>> reached the point where it's getting interesting for me.
> >>>>
> >>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
> >>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
> >>>>>>>> And it's actually harder to abstract away inter HW generation
> >>>>>>>> differences if the user space code has to handle all of it.
> >>>>>>
> >>>>>> I don't see how its any harder in practice though?
> >>>>>
> >>>>> You need to find out what HW/FW/config you're running, right?
> >>>>> And all you have is a pointer to a blob of unknown type.
> >>>>>
> >>>>> Take timestamps for example, some NICs support adjusting the PHC
> >>>>> or doing SW corrections (with different versions of hw/fw/server
> >>>>> platforms being capable of both/one/neither).
> >>>>>
> >>>>> Sure you can extract all this info with tracing and careful
> >>>>> inspection via uAPI. But I don't think that's _easier_.
> >>>>> And the vendors can't run the results thru their validation
> >>>>> (for whatever that's worth).
> >>>>>
> >>>>>>> I've had the same concern:
> >>>>>>>
> >>>>>>> Until we have some userspace library that abstracts all these details,
> >>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
> >>>>>>> of data and I need to go through the code and see what particular type
> >>>>>>> it represents for my particular device and how the data I need is
> >>>>>>> represented there. There are also these "if this is device v1 -> use
> >>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
> >>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
> >>>>>>> this burden on the driver developers, but I agree that the drawback
> >>>>>>> here is that we actually have to wait for the implementations to catch
> >>>>>>> up.
> >>>>>>
> >>>>>> I agree with everything there, you will get a blob of data and then
> >>>>>> will need to know what field you want to read using BTF. But, we
> >>>>>> already do this for BPF programs all over the place so its not a big
> >>>>>> lift for us. All other BPF tracing/observability requires the same
> >>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
> >>>>>> place left to write BPF programs without thinking about BTF and
> >>>>>> kernel data structures.
> >>>>>>
> >>>>>> But, with proposed kptr the complexity lives in userspace and can be
> >>>>>> fixed, added, updated without having to bother with kernel updates, etc.
> >>>>>>   From my point of view of supporting Cilium its a win and much preferred
> >>>>>> to having to deal with driver owners on all cloud vendors, distributions,
> >>>>>> and so on.
> >>>>>>
> >>>>>> If vendor updates firmware with new fields I get those immediately.
> >>>>>
> >>>>> Conversely it's a valid concern that those who *do* actually update
> >>>>> their kernel regularly will have more things to worry about.
> >>>>>
> >>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
> >>>>>>> programs might have to take a lot of other state into consideration
> >>>>>>> when parsing the descriptors; all those details do seem like they
> >>>>>>> belong to the driver code.
> >>>>>>
> >>>>>> I would prefer to avoid being stuck on requiring driver writers to
> >>>>>> be involved. With just a kptr I can support the device and any
> >>>>>> firwmare versions without requiring help.
> >>>>>
> >>>>> 1) where are you getting all those HW / FW specs :S
> >>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
> >>>>>
> >>>>>>> Feel free to send it early with just a handful of drivers implemented;
> >>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
> >>>>>>> nice sample/test case that shows how the metadata can be used, that
> >>>>>>> might push us closer to the agreement on the best way to proceed.
> >>>>>>
> >>>>>> I'll try to do a intel and mlx implementation to get a cross section.
> >>>>>> I have a good collection of nics here so should be able to show a
> >>>>>> couple firmware versions. It could be fine I think to have the raw
> >>>>>> kptr access and then also kfuncs for some things perhaps.
> >>>>>>
> >>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
> >>>>>>>> parsing to user space will indeed result in what you just said - major
> >>>>>>>> vendors are supported and that's it.
> >>>>>>
> >>>>>> I'm not sure about why it would make it harder for new vendors? I think
> >>>>>> the opposite,
> >>>>>
> >>>>> TBH I'm only replying to the email because of the above part :)
> >>>>> I thought this would be self evident, but I guess our perspectives
> >>>>> are different.
> >>>>>
> >>>>> Perhaps you look at it from the perspective of SW running on someone
> >>>>> else's cloud, an being able to move to another cloud, without having
> >>>>> to worry if feature X is available in xdp or just skb.
> >>>>>
> >>>>> I look at it from the perspective of maintaining a cloud, with people
> >>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
> >>>>> (superior) startup, and cloud users are messing with raw descriptor -
> >>>>> I'd need to go find every XDP program out there and make sure it
> >>>>> understands the new descriptors.
> >>>>
> >>>> Here is another perspective:
> >>>>
> >>>> As AF_XDP application developer I don't wan't to deal with the
> >>>> underlying hardware in detail. I like to request a feature from the OS
> >>>> (in this case rx/tx timestamping). If the feature is available I will
> >>>> simply use it, if not I might have to work around it - maybe by falling
> >>>> back to SW timestamping.
> >>>>
> >>>> All parts of my application (BPF program included) should not be
> >>>> optimized/adjusted for all the different HW variants out there.
> >>>
> >>> Yes, absolutely agreed. Abstracting away those kinds of hardware
> >>> differences is the whole *point* of having an OS/driver model. I.e.,
> >>> it's what the kernel is there for! If people want to bypass that and get
> >>> direct access to the hardware, they can already do that by using DPDK.
> >>>
> >>> So in other words, 100% agreed that we should not expect the BPF
> >>> developers to deal with hardware details as would be required with a
> >>> kptr-based interface.
> >>>
> >>> As for the kfunc-based interface, I think it shows some promise.
> >>> Exposing a list of function names to retrieve individual metadata items
> >>> instead of a struct layout is sorta comparable in terms of developer UI
> >>> accessibility etc (IMO).
> >>
> >> Looks like there are quite some use cases for hw_timestamp.
> >> Do you think we could add it to the uapi like struct xdp_md?
> >>
> >> The following is the current xdp_md:
> >> struct xdp_md {
> >>           __u32 data;
> >>           __u32 data_end;
> >>           __u32 data_meta;
> >>           /* Below access go through struct xdp_rxq_info */
> >>           __u32 ingress_ifindex; /* rxq->dev->ifindex */
> >>           __u32 rx_queue_index;  /* rxq->queue_index  */
> >>
> >>           __u32 egress_ifindex;  /* txq->dev->ifindex */
> >> };
> >>
> >> We could add  __u64 hw_timestamp to the xdp_md so user
> >> can just do xdp_md->hw_timestamp to get the value.
> >> xdp_md->hw_timestamp == 0 means hw_timestamp is not
> >> available.
> >>
> >> Inside the kernel, the ctx rewriter can generate code
> >> to call driver specific function to retrieve the data.
> >
> > If the driver generates the code to retrieve the data, how's that
> > different from the kfunc approach?
> > The only difference I see is that it would be a more strong UAPI than
> > the kfuncs?
>
> Another thing may be worth considering, some hints for some HW/driver may be
> harder (or may not worth) to unroll/inline.  For example, I see driver is doing
> spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc
> and avoid the unroll/inline may be the right thing to do.

Yeah, I'm trying to look at the drivers right now and doing
spinlocks/seqlocks might complicate the story...
But it seems like in the worst case, the unrolled bytecode can always
resort to calling a kernel function?
(we might need to have some scratch area to preserve r1-r5 and we
can't touch r6-r9 because we are not in a real call, but seems doable;
I'll try to illustrate with a bunch of examples)


> >> The kfunc approach can be used to *less* common use cases?
> >
> > What's the advantage of having two approaches when one can cover
> > common and uncommon cases?
> >
> >>> There are three main drawbacks, AFAICT:
> >>>
> >>> 1. It requires driver developers to write and maintain the code that
> >>> generates the unrolled BPF bytecode to access the metadata fields, which
> >>> is a non-trivial amount of complexity. Maybe this can be abstracted away
> >>> with some internal helpers though (like, e.g., a
> >>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
> >>> the required JMP/MOV/LDX instructions?
> >>>
> >>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>> custom XDP program that calls the kfuncs and puts the data into the
> >>> metadata area. We could solve this with some code in libxdp, though; if
> >>> this code can be made generic enough (so it just dumps the available
> >>> metadata functions from the running kernel at load time), it may be
> >>> possible to make it generic enough that it will be forward-compatible
> >>> with new versions of the kernel that add new fields, which should
> >>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>> think the CPUMAP and veth use cases are also quite important, and that
> >>> we want metadata to be available for building SKBs in this path. Maybe
> >>> this can be resolved by having a convenient kfunc for this that can be
> >>> used for programs doing such redirects. E.g., you could just call
> >>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>> would recursively expand into all the kfunc calls needed to extract the
> >>> metadata supported by the SKB path?
> >>>
> >>> -Toke
> >>>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 17:05                     ` Martin KaFai Lau
@ 2022-11-01 20:12                       ` Stanislav Fomichev
  2022-11-02 14:06                       ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-11-01 20:12 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, brouer, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo,
	Toke Høiland-Jørgensen

On Tue, Nov 1, 2022 at 10:05 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >>>> 2. AF_XDP programs won't be able to access the metadata without using a
> >>>> custom XDP program that calls the kfuncs and puts the data into the
> >>>> metadata area. We could solve this with some code in libxdp, though; if
> >>>> this code can be made generic enough (so it just dumps the available
> >>>> metadata functions from the running kernel at load time), it may be
> >>>> possible to make it generic enough that it will be forward-compatible
> >>>> with new versions of the kernel that add new fields, which should
> >>>> alleviate Florian's concern about keeping things in sync.
> >>>
> >>> Good point. I had to convert to a custom program to use the kfuncs :-(
> >>> But your suggestion sounds good; maybe libxdp can accept some extra
> >>> info about at which offset the user would like to place the metadata
> >>> and the library can generate the required bytecode?
> >>>
> >>>> 3. It will make it harder to consume the metadata when building SKBs. I
> >>>> think the CPUMAP and veth use cases are also quite important, and that
> >>>> we want metadata to be available for building SKBs in this path. Maybe
> >>>> this can be resolved by having a convenient kfunc for this that can be
> >>>> used for programs doing such redirects. E.g., you could just call
> >>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>>> would recursively expand into all the kfunc calls needed to extract the
> >>>> metadata supported by the SKB path?
> >>>
> >>> So this xdp_copy_metadata_for_skb will create a metadata layout that
> >>
> >> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> >> Not sure where is the best point to specify this prog though.  Somehow during
> >> bpf_xdp_redirect_map?
> >> or this prog belongs to the target cpumap and the xdp prog redirecting to this
> >> cpumap has to write the meta layout in a way that the cpumap is expecting?
> >
> > We're probably interested in triggering it from the places where xdp
> > frames can eventually be converted into skbs?
> > So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> > anything that's not XDP_DROP / AF_XDP redirect).
> > We can probably make it magically work, and can generate
> > kernel-digestible metadata whenever data == data_meta, but the
> > question - should we?
> > (need to make sure we won't regress any existing cases that are not
> > relying on the metadata)
>
> Instead of having some kernel-digestible meta data, how about calling another
> bpf prog to initialize the skb fields from the meta area after
> __xdp_build_skb_from_frame() in the cpumap, so
> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>
> The xdp_prog@rx sets the meta data and then redirect.  If the xdp_prog@rx can
> also specify a xdp prog to initialize the skb fields from the meta area, then
> there is no need to have a kfunc to enforce a kernel-digestible layout.  Not
> sure what is a good way to specify this xdp_prog though...

Not sure also about whether doing it at this point is too late or not?
Need to take a closer look at all __xdp_build_skb_from_frame call sites...
Also see Toke/Dave discussing potentially having helpers to override
some of that metadata. In this case, having more control on the user
side makes sense..

I'll probably start with an explicit helper for now, just to
see if the overall approach is workable. Maybe we can have a follow up
discussion about doing it more transparently.

> >>> the kernel will be able to understand when converting back to skb?
> >>> IIUC, the xdp program will look something like the following:
> >>>
> >>> if (xdp packet is to be consumed by af_xdp) {
> >>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble your
> >>> own metadata layout
> >>>     return bpf_redirect_map(xsk, ...);
> >>> } else {
> >>>     // if the packet is to be consumed by the kernel
> >>>     xdp_copy_metadata_for_skb(ctx);
> >>>     return bpf_redirect(...);
> >>> }
> >>>
> >>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
> >>> put some magic number in the first byte(s) of the metadata so the
> >>> kernel can check whether xdp_copy_metadata_for_skb has been called
> >>> previously (or maybe xdp_frame can carry this extra signal, idk).
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-11-01 13:18             ` Jesper Dangaard Brouer
@ 2022-11-01 20:12               ` Stanislav Fomichev
  2022-11-01 22:23               ` [xdp-hints] " Toke Høiland-Jørgensen
  1 sibling, 0 replies; 50+ messages in thread
From: Stanislav Fomichev @ 2022-11-01 20:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, brouer, Martin KaFai Lau, ast, daniel, andrii,
	song, yhs, kpsingh, haoluo, jolsa, Jakub Kicinski,
	Willem de Bruijn, Anatoly Burakov, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf, John Fastabend

On Tue, Nov 1, 2022 at 6:18 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 31/10/2022 18.00, Stanislav Fomichev wrote:
> > On Mon, Oct 31, 2022 at 7:22 AM Alexander Lobakin
> > <alexandr.lobakin@intel.com> wrote:
> >>
> >> From: Stanislav Fomichev <sdf@google.com>
> >> Date: Fri, 28 Oct 2022 11:46:14 -0700
> >>
> >>> On Fri, Oct 28, 2022 at 3:37 AM Jesper Dangaard Brouer
> >>> <jbrouer@redhat.com> wrote:
> >>>>
> >>>>
> >>>> On 28/10/2022 08.22, Martin KaFai Lau wrote:
> >>>>> On 10/27/22 1:00 PM, Stanislav Fomichev wrote:
> >>>>>> Example on how the metadata is prepared from the BPF context
> >>>>>> and consumed by AF_XDP:
> >>>>>>
> >>>>>> - bpf_xdp_metadata_have_rx_timestamp to test whether it's supported;
> >>>>>>     if not, I'm assuming verifier will remove this "if (0)" branch
> >>>>>> - bpf_xdp_metadata_rx_timestamp returns a _copy_ of metadata;
> >>>>>>     the program has to bpf_xdp_adjust_meta+memcpy it;
> >>>>>>     maybe returning a pointer is better?
> >>>>>> - af_xdp consumer grabs it from data-<expected_metadata_offset> and
> >>>>>>     makes sure timestamp is not empty
> >>>>>> - when loading the program, we pass BPF_F_XDP_HAS_METADATA+prog_ifindex
> >>>>>>
> >>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
> >>>>>> Cc: Willem de Bruijn <willemb@google.com>
> >>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
> >>>>>> Cc: xdp-hints@xdp-project.net
> >>>>>> Cc: netdev@vger.kernel.org
> >>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >>>>>> ---
> >>>>>>    .../testing/selftests/bpf/progs/xskxceiver.c  | 22 ++++++++++++++++++
> >>>>>>    tools/testing/selftests/bpf/xskxceiver.c      | 23 ++++++++++++++++++-
> >>>>>>    2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> [...]
> >>
> >>>> IMHO sizeof() should come from a struct describing data_meta area see:
> >>>>
> >>>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L62
> >>>
> >>> I guess I should've used pointers for the return type instead, something like:
> >>>
> >>> extern __u64 *bpf_xdp_metadata_rx_timestamp(struct xdp_md *ctx) __ksym;
> >>>
> >>> {
> >>>     ...
> >>>      __u64 *rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>      if (rx_timestamp) {
> >>>          bpf_xdp_adjust_meta(ctx, -(int)sizeof(*rx_timestamp));
> >>>          __builtin_memcpy(data_meta, rx_timestamp, sizeof(*rx_timestamp));
> >>>      }
> >>> }
> >>>
> >>> Does that look better?
> >>
> >> I guess it will then be resolved to a direct store, right?
> >> I mean, to smth like
> >>
> >>          if (rx_timestamp)
> >>                  *(u32 *)data_meta = *rx_timestamp;
> >>
> >> , where *rx_timestamp points directly to the Rx descriptor?
> >
> > Right. I should've used that form from the beginning, that memcpy is
> > confusing :-(
> >
> >>>
> >>>>>> +        if (ret != 0)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        data = (void *)(long)ctx->data;
> >>>>>> +        data_meta = (void *)(long)ctx->data_meta;
> >>>>>> +
> >>>>>> +        if (data_meta + sizeof(__u32) > data)
> >>>>>> +            return XDP_DROP;
> >>>>>> +
> >>>>>> +        rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> >>>>>> +        __builtin_memcpy(data_meta, &rx_timestamp, sizeof(__u32));
> >>>>
> >>>> So, this approach first stores hints on some other memory location, and
> >>>> then need to copy over information into data_meta area. That isn't good
> >>>> from a performance perspective.
> >>>>
> >>>> My idea is to store it in the final data_meta destination immediately.
> >>>
> >>> This approach doesn't have to store the hints in the other memory
> >>> location. xdp_buff->priv can point to the real hw descriptor and the
> >>> kfunc can have a bytecode that extracts the data from the hw
> >>> descriptors. For this particular RFC, we can think that 'skb' is that
> >>> hw descriptor for veth driver.
>
> Once you point xdp_buff->priv to the real hw descriptor, then we also
> need to have some additional data/pointers to NIC hardware info + HW
> setup state. You will hit some of the same challenges as John, like
> hardware/firmware revisions and chip models, that Jakub pointed out.
> Because your approach stays with the driver code, I guess it will be a
> bit easier code wise. Maybe we can store data/pointer needed for this in
> xdp_rxq_info (xdp->rxq).
>
> I would need to see some code that juggling this HW NCI state from the
> kfunc expansion to be convinced this is the right approach.

I've replied to Martin in another thread, but that's a good point. We
might need to do locks while parsing the descriptors and converting to
kernel time, so maybe assuming that everything can be unrolled won't
stand 100%. OTOH, it seems like unrolled bytecode can (with some
quirks) always call into some driver specific c function...

I'm trying to convert a couple of drivers (without really testing the
implementation) and see whether there are any other big issues.


> >>
> >> I really do think intermediate stores can be avoided with this
> >> approach.
> >> Oh, and BTW, if we plan to use a particular Hint in the BPF program
> >> only, there's no need to place it in the metadata area at all, is
> >> there? You only want to get it in your code, so just retrieve it to
> >> the stack and that's it. data_meta is only for cases when you want
> >> hints to appear in AF_XDP.
> >
> > Correct.
>
> It is not *only* AF_XDP that need data stored in data_meta.
>
> The stores data_meta are also needed for veth and cpumap, because the HW
> descriptor is "out-of-scope" and thus no-longer available.
>
>
> >
> >>>> Do notice that in my approach, the existing ethtool config setting and
> >>>> socket options (for timestamps) still apply.  Thus, each individual
> >>>> hardware hint are already configurable. Thus we already have a config
> >>>> interface. I do acknowledge, that in-case a feature is disabled it still
> >>>> takes up space in data_meta areas, but importantly it is NOT stored into
> >>>> the area (for performance reasons).
> >>>
> >>> That should be the case with this rfc as well, isn't it? Worst case
> >>> scenario, that kfunc bytecode can explicitly check ethtool options and
> >>> return false if it's disabled?
> >>
> >> (to Jesper)
> >>
> >> Once again, Ethtool idea doesn't work. Let's say you have roughly
> >> 50% of frames to be consumed by XDP, other 50% go to skb path and
> >> the stack. In skb, I want as many HW data as possible: checksums,
> >> hash and so on. Let's say in XDP prog I want only timestamp. What's
> >> then? Disable everything but stamp and kill skb path? Enable
> >> everything and kill XDP path?
> >> Stanislav's approach allows you to request only fields you need from
> >> the BPF prog directly, I don't see any reasons for adding one more
> >> layer "oh no I won't give you checksum because it's disabled via
> >> Ethtool".
> >> Maybe I get something wrong, pls explain then :P
> >
> > Agree, good point.
>
> Stanislav's (and John's proposal) is definitely focused and addressing
> something else than my patchset.
>
> I optimized the XDP-hints population (for i40e) down to 6 nanosec (on
> 3.6 GHz CPU = 21 cycles).  Plus, I added an ethtool switch to turn it
> off for those XDP users that cannot live with this overhead.  Hoping
> this would be fast-enough such that we didn't have to add this layer.
> (If XDP returns XDP_PASS then this decoded info can be used for the SKB
> creation. Thus, this is essentially just moving decoding RX-desc a bit
> earlier in the the driver).
>
> One of my use-cases is getting RX-checksum support in xdp_frame's and
> transferring this to SKB creation time.  I have done a number of
> measurements[1] to find out how much we can gain of performance for UDP
> packets (1500 bytes) with/without RX-checksum.  Initial result showed I
> saved 91 nanosec, but that was avoiding to touching data.  Doing full
> userspace UDP delivery with a copy (or copy+checksum) showed the real
> save was 54 nanosec.  In this context, the 6 nanosec was very small.
> Thus, I didn't choose to pursue a BPF layer for individual fields.
>
>   [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
>
> Sure it is super cool if we can create this BPF layer that programmable
> selects individual fields from the descriptor, and maybe we ALSO need that.
> Could this layer could still be added after my patchset(?), as one could
> disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
> only fields need by the specific XDP-prog use-case.
> Could they also co-exist(?), kfuncs/kptr could extend the
> xdp_hints_rx_common struct (in data_meta area) with more advanced
> offload-hints and then update the BTF-ID (yes, BPF can already resolve
> its own BTF-IDs from BPF-prog code).
>
> Great to see all the discussions and different oppinons :-)
> --Jesper
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 20:12                     ` Stanislav Fomichev
@ 2022-11-01 21:17                       ` Martin KaFai Lau
  0 siblings, 0 replies; 50+ messages in thread
From: Martin KaFai Lau @ 2022-11-01 21:17 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, Bezdeka, Florian, kuba,
	john.fastabend, alexandr.lobakin, anatoly.burakov, song, Deric,
	Nemanja, andrii, Kiszka, Jan, magnus.karlsson, willemb, ast,
	brouer, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints, netdev,
	jolsa, haoluo, Yonghong Song

On 11/1/22 1:12 PM, Stanislav Fomichev wrote:
>>>>> As for the kfunc-based interface, I think it shows some promise.
>>>>> Exposing a list of function names to retrieve individual metadata items
>>>>> instead of a struct layout is sorta comparable in terms of developer UI
>>>>> accessibility etc (IMO).
>>>>
>>>> Looks like there are quite some use cases for hw_timestamp.
>>>> Do you think we could add it to the uapi like struct xdp_md?
>>>>
>>>> The following is the current xdp_md:
>>>> struct xdp_md {
>>>>            __u32 data;
>>>>            __u32 data_end;
>>>>            __u32 data_meta;
>>>>            /* Below access go through struct xdp_rxq_info */
>>>>            __u32 ingress_ifindex; /* rxq->dev->ifindex */
>>>>            __u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>            __u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> };
>>>>
>>>> We could add  __u64 hw_timestamp to the xdp_md so user
>>>> can just do xdp_md->hw_timestamp to get the value.
>>>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>>>> available.
>>>>
>>>> Inside the kernel, the ctx rewriter can generate code
>>>> to call driver specific function to retrieve the data.
>>>
>>> If the driver generates the code to retrieve the data, how's that
>>> different from the kfunc approach?
>>> The only difference I see is that it would be a more strong UAPI than
>>> the kfuncs?
>>
>> Another thing may be worth considering, some hints for some HW/driver may be
>> harder (or may not worth) to unroll/inline.  For example, I see driver is doing
>> spin_lock_bh while getting the hwtstamp.  For this case, keep calling a kfunc
>> and avoid the unroll/inline may be the right thing to do.
> 
> Yeah, I'm trying to look at the drivers right now and doing
> spinlocks/seqlocks might complicate the story...
> But it seems like in the worst case, the unrolled bytecode can always
> resort to calling a kernel function?

unroll the common cases and call kernel function for everything else? that 
should be doable.  The bpf prog calling it as kfunc will have more flexibility 
like this here.

> (we might need to have some scratch area to preserve r1-r5 and we
> can't touch r6-r9 because we are not in a real call, but seems doable;
> I'll try to illustrate with a bunch of examples)
> 
> 
>>>> The kfunc approach can be used to *less* common use cases?
>>>
>>> What's the advantage of having two approaches when one can cover
>>> common and uncommon cases?
>>>
>>>>> There are three main drawbacks, AFAICT:
>>>>>
>>>>> 1. It requires driver developers to write and maintain the code that
>>>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>>>> with some internal helpers though (like, e.g., a
>>>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>>>> the required JMP/MOV/LDX instructions?
>>>>>
>>>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract the
>>>>> metadata supported by the SKB path?
>>>>>
>>>>> -Toke
>>>>>
>>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver
  2022-11-01 13:18             ` Jesper Dangaard Brouer
  2022-11-01 20:12               ` Stanislav Fomichev
@ 2022-11-01 22:23               ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-01 22:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Stanislav Fomichev, Alexander Lobakin
  Cc: brouer, Jesper Dangaard Brouer, Martin KaFai Lau, ast, daniel,
	andrii, song, yhs, John Fastabend, kpsingh, haoluo, jolsa,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

>>>>> So, this approach first stores hints on some other memory location, and
>>>>> then need to copy over information into data_meta area. That isn't good
>>>>> from a performance perspective.
>>>>>
>>>>> My idea is to store it in the final data_meta destination immediately.
>>>>
>>>> This approach doesn't have to store the hints in the other memory
>>>> location. xdp_buff->priv can point to the real hw descriptor and the
>>>> kfunc can have a bytecode that extracts the data from the hw
>>>> descriptors. For this particular RFC, we can think that 'skb' is that
>>>> hw descriptor for veth driver.
>
> Once you point xdp_buff->priv to the real hw descriptor, then we also
> need to have some additional data/pointers to NIC hardware info + HW
> setup state. You will hit some of the same challenges as John, like
> hardware/firmware revisions and chip models, that Jakub pointed out.
> Because your approach stays with the driver code, I guess it will be a
> bit easier code wise. Maybe we can store data/pointer needed for this in
> xdp_rxq_info (xdp->rxq).
>
> I would need to see some code that juggling this HW NCI state from the
> kfunc expansion to be convinced this is the right approach.

+1 on needing to see this working for the actual metadata we want to
support, but I think the kfunc approach otherwise shows promise; see
below.

[...]

> Sure it is super cool if we can create this BPF layer that programmable
> selects individual fields from the descriptor, and maybe we ALSO need that.
> Could this layer could still be added after my patchset(?), as one could
> disable the XDP-hints (via ethtool) and then use kfuncs/kptr to extract
> only fields need by the specific XDP-prog use-case.
> Could they also co-exist(?), kfuncs/kptr could extend the
> xdp_hints_rx_common struct (in data_meta area) with more advanced
> offload-hints and then update the BTF-ID (yes, BPF can already resolve
> its own BTF-IDs from BPF-prog code).

I actually think the two approaches are more similar than they appear
from a user-facing API perspective. Or at least they should be.

What I mean is, that with the BTF-ID approach, we still expect people to
write code like (from Stanislav's example in the other xdp_hints thread[0]):

If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly
populated at runtime by libbpf? */) {
  // do something with rx_timestamp
  // also, handle xdp_hints_ixgbe and then xdp_hints_common ?
} else if (ctx_hints_btf_id == xdp_hints_ixgbe) {
  // do something else
  // plus explicitly handle xdp_hints_common here?
} else {
  // handle xdp_hints_common
}

whereas with kfuncs (from this thread) this becomes:

if (xdp_metadata_rx_timestamp_exists(ctx))
  timestamp = xdp_metadata_rx_timestamp(ctx);


We can hide the former behind CO-RE macros to make it look like the
latter. But because we're just exposing the BTF IDs, people can in fact
just write code like the example above (directly checking the BTF IDs),
and that will work fine, but has a risk of leading to a proliferation of
device-specific XDP programs. Whereas with kfuncs we keep all this stuff
internal to the kernel (inside the kfuncs), making it much easier to
change it later.

Quoting yourself from the other thread[1]:

> In this patchset I'm trying to balance the different users. And via BTF
> I'm trying hard not to create more UAPI (e.g. more fixed fields avail in
> xdp_md that we cannot get rid of). And trying to add driver flexibility
> on-top of the common struct.  This flexibility seems to be stalling the
> patchset as we haven't found the perfect way to express this (yet) given
> BTF layout is per driver.

With kfuncs we kinda sidestep this issue because the kernel can handle
the per-driver specialisation by the unrolling trick. The drawback being
that programs will be tied to a particular device if they are using
metadata, but I think that's an acceptable trade-off.

-Toke

[0] https://lore.kernel.org/r/CAKH8qBuYVk7QwVOSYrhMNnaKFKGd7M9bopDyNp6-SnN6hSeTDQ@mail.gmail.com
[1] https://lore.kernel.org/r/ad360933-953a-7a99-5057-4d452a9a6005@redhat.com


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-01 17:05                     ` Martin KaFai Lau
  2022-11-01 20:12                       ` Stanislav Fomichev
@ 2022-11-02 14:06                       ` Jesper Dangaard Brouer
  2022-11-02 22:01                         ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-02 14:06 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: brouer, Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo,
	Toke Hoiland Jorgensen



On 01/11/2022 18.05, Martin KaFai Lau wrote:
> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau 
>> <martin.lau@linux.dev> wrote:
>>>
>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>> 2. AF_XDP programs won't be able to access the metadata without 
>>>>> using a
>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>> metadata area. We could solve this with some code in libxdp, 
>>>>> though; if
>>>>> this code can be made generic enough (so it just dumps the available
>>>>> metadata functions from the running kernel at load time), it may be
>>>>> possible to make it generic enough that it will be forward-compatible
>>>>> with new versions of the kernel that add new fields, which should
>>>>> alleviate Florian's concern about keeping things in sync.
>>>>
>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>> info about at which offset the user would like to place the metadata
>>>> and the library can generate the required bytecode?
>>>>
>>>>> 3. It will make it harder to consume the metadata when building 
>>>>> SKBs. I
>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>> used for programs doing such redirects. E.g., you could just call
>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>> would recursively expand into all the kfunc calls needed to extract 
>>>>> the
>>>>> metadata supported by the SKB path?
>>>>
>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>
>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>> Not sure where is the best point to specify this prog though.  
>>> Somehow during
>>> bpf_xdp_redirect_map?
>>> or this prog belongs to the target cpumap and the xdp prog 
>>> redirecting to this
>>> cpumap has to write the meta layout in a way that the cpumap is 
>>> expecting?
>>
>> We're probably interested in triggering it from the places where xdp
>> frames can eventually be converted into skbs?
>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> anything that's not XDP_DROP / AF_XDP redirect).
>> We can probably make it magically work, and can generate
>> kernel-digestible metadata whenever data == data_meta, but the
>> question - should we?
>> (need to make sure we won't regress any existing cases that are not
>> relying on the metadata)
> 
> Instead of having some kernel-digestible meta data, how about calling 
> another bpf prog to initialize the skb fields from the meta area after 
> __xdp_build_skb_from_frame() in the cpumap, so 
> run_xdp_set_skb_fileds_from_metadata() may be a better name.
> 

I very much like this idea of calling another bpf prog to initialize the
SKB fields from the meta area. (As a reminder, data need to come from
meta area, because at this point the hardware RX-desc is out-of-scope).
I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.

We could invoke this BPF-prog inside __xdp_build_skb_from_frame().

We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
context inputs. Right?  (Not sure, if this is acceptable with the BPF
maintainers new rules)

> The xdp_prog@rx sets the meta data and then redirect.  If the 
> xdp_prog@rx can also specify a xdp prog to initialize the skb fields 
> from the meta area, then there is no need to have a kfunc to enforce a 
> kernel-digestible layout.  Not sure what is a good way to specify this 
> xdp_prog though...

The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
__xdp_build_skb_from_frame() is that it need to know howto decode the
meta area for every device driver or XDP-prog populating this (as veth
and cpumap can get redirected packets from multiple device drivers).
Sure, using a common function/helper/macro like
xdp_copy_metadata_for_skb() could help reduce this multiplexing, but we
want to have maximum flexibility to extend this without having to update
the kernel, right.

Fortunately __xdp_build_skb_from_frame() have a net_device parameter,
that points to the device is was received on (xdp_frame->dev_rx).
Thus, we could extend net_device and add this BPF-prog on a per
net_device basis.  This could function as a driver BPF-prog callback
that populates the SKB fields from the XDP meta data.
Is this a good or bad idea?


>>>> the kernel will be able to understand when converting back to skb?
>>>> IIUC, the xdp program will look something like the following:
>>>>
>>>> if (xdp packet is to be consumed by af_xdp) {
>>>>     // do a bunch of bpf_xdp_metadata_<metadata> calls and assemble 
>>>> your
>>>> own metadata layout
>>>>     return bpf_redirect_map(xsk, ...);
>>>> } else {
>>>>     // if the packet is to be consumed by the kernel
>>>>     xdp_copy_metadata_for_skb(ctx);
>>>>     return bpf_redirect(...);
>>>> }
>>>>
>>>> Sounds like a great suggestion! xdp_copy_metadata_for_skb can maybe
>>>> put some magic number in the first byte(s) of the metadata so the
>>>> kernel can check whether xdp_copy_metadata_for_skb has been called
>>>> previously (or maybe xdp_frame can carry this extra signal, idk).

I'm in favor of adding a flag bit to xdp_frame to signal this.
In __xdp_build_skb_from_frame() we could check this flag signal and then
invoke the BPF-prog type BPF_PROG_TYPE_XDP2SKB.

--Jesper


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-02 14:06                       ` Jesper Dangaard Brouer
@ 2022-11-02 22:01                         ` Toke Høiland-Jørgensen
  2022-11-02 23:10                           ` Stanislav Fomichev
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-02 22:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Martin KaFai Lau, Stanislav Fomichev
  Cc: brouer, Bezdeka, Florian, kuba, john.fastabend, alexandr.lobakin,
	anatoly.burakov, song, Deric, Nemanja, andrii, Kiszka, Jan,
	magnus.karlsson, willemb, ast, yhs, kpsingh, daniel, bpf,
	mtahhan, xdp-hints, netdev, jolsa, haoluo

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau 
>>> <martin.lau@linux.dev> wrote:
>>>>
>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>> 2. AF_XDP programs won't be able to access the metadata without 
>>>>>> using a
>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>> metadata area. We could solve this with some code in libxdp, 
>>>>>> though; if
>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>> with new versions of the kernel that add new fields, which should
>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>
>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>> info about at which offset the user would like to place the metadata
>>>>> and the library can generate the required bytecode?
>>>>>
>>>>>> 3. It will make it harder to consume the metadata when building 
>>>>>> SKBs. I
>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>> would recursively expand into all the kfunc calls needed to extract 
>>>>>> the
>>>>>> metadata supported by the SKB path?
>>>>>
>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>
>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>> Not sure where is the best point to specify this prog though.  
>>>> Somehow during
>>>> bpf_xdp_redirect_map?
>>>> or this prog belongs to the target cpumap and the xdp prog 
>>>> redirecting to this
>>>> cpumap has to write the meta layout in a way that the cpumap is 
>>>> expecting?
>>>
>>> We're probably interested in triggering it from the places where xdp
>>> frames can eventually be converted into skbs?
>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>> anything that's not XDP_DROP / AF_XDP redirect).
>>> We can probably make it magically work, and can generate
>>> kernel-digestible metadata whenever data == data_meta, but the
>>> question - should we?
>>> (need to make sure we won't regress any existing cases that are not
>>> relying on the metadata)
>> 
>> Instead of having some kernel-digestible meta data, how about calling 
>> another bpf prog to initialize the skb fields from the meta area after 
>> __xdp_build_skb_from_frame() in the cpumap, so 
>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>> 
>
> I very much like this idea of calling another bpf prog to initialize the
> SKB fields from the meta area. (As a reminder, data need to come from
> meta area, because at this point the hardware RX-desc is out-of-scope).
> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>
> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>
> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
> context inputs. Right?  (Not sure, if this is acceptable with the BPF
> maintainers new rules)
>
>> The xdp_prog@rx sets the meta data and then redirect.  If the 
>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields 
>> from the meta area, then there is no need to have a kfunc to enforce a 
>> kernel-digestible layout.  Not sure what is a good way to specify this 
>> xdp_prog though...
>
> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
> __xdp_build_skb_from_frame() is that it need to know howto decode the
> meta area for every device driver or XDP-prog populating this (as veth
> and cpumap can get redirected packets from multiple device drivers).

If we have the helper to copy the data "out of" the drivers, why do we
need a second BPF program to copy data to the SKB?

I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
each of the kfuncs needed for the metadata used by SKBs, all of which
get unrolled. The helper takes the output of these metadata-extracting
kfuncs and stores it "somewhere". This "somewhere" could well be the
metadata area; but in any case, since it's hidden away inside a helper
(or kfunc) from the calling XDP program's PoV, the helper can just stash
all the data in a fixed format, which __xdp_build_skb_from_frame() can
then just read statically. We could even make this format match the
field layout of struct sk_buff, so all we have to do is memcpy a
contiguous chunk of memory when building the SKB.

> Sure, using a common function/helper/macro like
> xdp_copy_metadata_for_skb() could help reduce this multiplexing, but
> we want to have maximum flexibility to extend this without having to
> update the kernel, right.

The extension mechanism is in which kfuncs are available to XDP programs
to extract metadata. The kernel then just becomes another consumer of
those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could
also be other kfuncs added that are not used for skbs (even
vendor-specific ones if we want to allow that).

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-02 22:01                         ` Toke Høiland-Jørgensen
@ 2022-11-02 23:10                           ` Stanislav Fomichev
  2022-11-03  0:09                             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Stanislav Fomichev @ 2022-11-02 23:10 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, Martin KaFai Lau, brouer, Bezdeka,
	Florian, kuba, john.fastabend, alexandr.lobakin, anatoly.burakov,
	song, Deric, Nemanja, andrii, Kiszka, Jan, magnus.karlsson,
	willemb, ast, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints,
	netdev, jolsa, haoluo

On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>
> > On 01/11/2022 18.05, Martin KaFai Lau wrote:
> >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
> >>> <martin.lau@linux.dev> wrote:
> >>>>
> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
> >>>>>> 2. AF_XDP programs won't be able to access the metadata without
> >>>>>> using a
> >>>>>> custom XDP program that calls the kfuncs and puts the data into the
> >>>>>> metadata area. We could solve this with some code in libxdp,
> >>>>>> though; if
> >>>>>> this code can be made generic enough (so it just dumps the available
> >>>>>> metadata functions from the running kernel at load time), it may be
> >>>>>> possible to make it generic enough that it will be forward-compatible
> >>>>>> with new versions of the kernel that add new fields, which should
> >>>>>> alleviate Florian's concern about keeping things in sync.
> >>>>>
> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
> >>>>> But your suggestion sounds good; maybe libxdp can accept some extra
> >>>>> info about at which offset the user would like to place the metadata
> >>>>> and the library can generate the required bytecode?
> >>>>>
> >>>>>> 3. It will make it harder to consume the metadata when building
> >>>>>> SKBs. I
> >>>>>> think the CPUMAP and veth use cases are also quite important, and that
> >>>>>> we want metadata to be available for building SKBs in this path. Maybe
> >>>>>> this can be resolved by having a convenient kfunc for this that can be
> >>>>>> used for programs doing such redirects. E.g., you could just call
> >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
> >>>>>> would recursively expand into all the kfunc calls needed to extract
> >>>>>> the
> >>>>>> metadata supported by the SKB path?
> >>>>>
> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
> >>>>
> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
> >>>> Not sure where is the best point to specify this prog though.
> >>>> Somehow during
> >>>> bpf_xdp_redirect_map?
> >>>> or this prog belongs to the target cpumap and the xdp prog
> >>>> redirecting to this
> >>>> cpumap has to write the meta layout in a way that the cpumap is
> >>>> expecting?
> >>>
> >>> We're probably interested in triggering it from the places where xdp
> >>> frames can eventually be converted into skbs?
> >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
> >>> anything that's not XDP_DROP / AF_XDP redirect).
> >>> We can probably make it magically work, and can generate
> >>> kernel-digestible metadata whenever data == data_meta, but the
> >>> question - should we?
> >>> (need to make sure we won't regress any existing cases that are not
> >>> relying on the metadata)
> >>
> >> Instead of having some kernel-digestible meta data, how about calling
> >> another bpf prog to initialize the skb fields from the meta area after
> >> __xdp_build_skb_from_frame() in the cpumap, so
> >> run_xdp_set_skb_fileds_from_metadata() may be a better name.
> >>
> >
> > I very much like this idea of calling another bpf prog to initialize the
> > SKB fields from the meta area. (As a reminder, data need to come from
> > meta area, because at this point the hardware RX-desc is out-of-scope).
> > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
> >
> > We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
> >
> > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
> > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
> > context inputs. Right?  (Not sure, if this is acceptable with the BPF
> > maintainers new rules)
> >
> >> The xdp_prog@rx sets the meta data and then redirect.  If the
> >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
> >> from the meta area, then there is no need to have a kfunc to enforce a
> >> kernel-digestible layout.  Not sure what is a good way to specify this
> >> xdp_prog though...
> >
> > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
> > __xdp_build_skb_from_frame() is that it need to know howto decode the
> > meta area for every device driver or XDP-prog populating this (as veth
> > and cpumap can get redirected packets from multiple device drivers).
>
> If we have the helper to copy the data "out of" the drivers, why do we
> need a second BPF program to copy data to the SKB?
>
> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
> each of the kfuncs needed for the metadata used by SKBs, all of which
> get unrolled. The helper takes the output of these metadata-extracting
> kfuncs and stores it "somewhere". This "somewhere" could well be the
> metadata area; but in any case, since it's hidden away inside a helper
> (or kfunc) from the calling XDP program's PoV, the helper can just stash
> all the data in a fixed format, which __xdp_build_skb_from_frame() can
> then just read statically. We could even make this format match the
> field layout of struct sk_buff, so all we have to do is memcpy a
> contiguous chunk of memory when building the SKB.

+1

I'm currently doing exactly what you're suggesting (minus matching skb layout):

struct xdp_to_skb_metadata {
  u32 magic; // randomized at boot
  ... skb-consumable-metadata in fixed format
} __randomize_layout;

bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
-sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
to fill in the actual data.

Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
C code that parses that 'struct xdp_to_skb_metadata'.
(To be precise, I'm trying to parse the metadata from
skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
100% sure that's the right place).
(I also randomize the layout and magic to make sure userspace doesn't
depend on it because nothing stops this packet to be routed into xsk
socket..)

> > Sure, using a common function/helper/macro like
> > xdp_copy_metadata_for_skb() could help reduce this multiplexing, but
> > we want to have maximum flexibility to extend this without having to
> > update the kernel, right.
>
> The extension mechanism is in which kfuncs are available to XDP programs
> to extract metadata. The kernel then just becomes another consumer of
> those kfuncs, by way of the xdp_copy_metadata_for_skb(); but there could
> also be other kfuncs added that are not used for skbs (even
> vendor-specific ones if we want to allow that).
>
> -Toke
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-02 23:10                           ` Stanislav Fomichev
@ 2022-11-03  0:09                             ` Toke Høiland-Jørgensen
  2022-11-03 12:01                               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-03  0:09 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jesper Dangaard Brouer, Martin KaFai Lau, brouer, Bezdeka,
	Florian, kuba, john.fastabend, alexandr.lobakin, anatoly.burakov,
	song, Deric, Nemanja, andrii, Kiszka, Jan, magnus.karlsson,
	willemb, ast, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints,
	netdev, jolsa, haoluo

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>
>> > On 01/11/2022 18.05, Martin KaFai Lau wrote:
>> >> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>> >>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>> >>> <martin.lau@linux.dev> wrote:
>> >>>>
>> >>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>> >>>>>> 2. AF_XDP programs won't be able to access the metadata without
>> >>>>>> using a
>> >>>>>> custom XDP program that calls the kfuncs and puts the data into the
>> >>>>>> metadata area. We could solve this with some code in libxdp,
>> >>>>>> though; if
>> >>>>>> this code can be made generic enough (so it just dumps the available
>> >>>>>> metadata functions from the running kernel at load time), it may be
>> >>>>>> possible to make it generic enough that it will be forward-compatible
>> >>>>>> with new versions of the kernel that add new fields, which should
>> >>>>>> alleviate Florian's concern about keeping things in sync.
>> >>>>>
>> >>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>> >>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>> >>>>> info about at which offset the user would like to place the metadata
>> >>>>> and the library can generate the required bytecode?
>> >>>>>
>> >>>>>> 3. It will make it harder to consume the metadata when building
>> >>>>>> SKBs. I
>> >>>>>> think the CPUMAP and veth use cases are also quite important, and that
>> >>>>>> we want metadata to be available for building SKBs in this path. Maybe
>> >>>>>> this can be resolved by having a convenient kfunc for this that can be
>> >>>>>> used for programs doing such redirects. E.g., you could just call
>> >>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>> >>>>>> would recursively expand into all the kfunc calls needed to extract
>> >>>>>> the
>> >>>>>> metadata supported by the SKB path?
>> >>>>>
>> >>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>> >>>>
>> >>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>> >>>> Not sure where is the best point to specify this prog though.
>> >>>> Somehow during
>> >>>> bpf_xdp_redirect_map?
>> >>>> or this prog belongs to the target cpumap and the xdp prog
>> >>>> redirecting to this
>> >>>> cpumap has to write the meta layout in a way that the cpumap is
>> >>>> expecting?
>> >>>
>> >>> We're probably interested in triggering it from the places where xdp
>> >>> frames can eventually be converted into skbs?
>> >>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>> >>> anything that's not XDP_DROP / AF_XDP redirect).
>> >>> We can probably make it magically work, and can generate
>> >>> kernel-digestible metadata whenever data == data_meta, but the
>> >>> question - should we?
>> >>> (need to make sure we won't regress any existing cases that are not
>> >>> relying on the metadata)
>> >>
>> >> Instead of having some kernel-digestible meta data, how about calling
>> >> another bpf prog to initialize the skb fields from the meta area after
>> >> __xdp_build_skb_from_frame() in the cpumap, so
>> >> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>> >>
>> >
>> > I very much like this idea of calling another bpf prog to initialize the
>> > SKB fields from the meta area. (As a reminder, data need to come from
>> > meta area, because at this point the hardware RX-desc is out-of-scope).
>> > I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>> >
>> > We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>> >
>> > We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>> > run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>> > context inputs. Right?  (Not sure, if this is acceptable with the BPF
>> > maintainers new rules)
>> >
>> >> The xdp_prog@rx sets the meta data and then redirect.  If the
>> >> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>> >> from the meta area, then there is no need to have a kfunc to enforce a
>> >> kernel-digestible layout.  Not sure what is a good way to specify this
>> >> xdp_prog though...
>> >
>> > The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>> > __xdp_build_skb_from_frame() is that it need to know howto decode the
>> > meta area for every device driver or XDP-prog populating this (as veth
>> > and cpumap can get redirected packets from multiple device drivers).
>>
>> If we have the helper to copy the data "out of" the drivers, why do we
>> need a second BPF program to copy data to the SKB?
>>
>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>> each of the kfuncs needed for the metadata used by SKBs, all of which
>> get unrolled. The helper takes the output of these metadata-extracting
>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>> metadata area; but in any case, since it's hidden away inside a helper
>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>> then just read statically. We could even make this format match the
>> field layout of struct sk_buff, so all we have to do is memcpy a
>> contiguous chunk of memory when building the SKB.
>
> +1
>
> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>
> struct xdp_to_skb_metadata {
>   u32 magic; // randomized at boot
>   ... skb-consumable-metadata in fixed format
> } __randomize_layout;
>
> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
> to fill in the actual data.
>
> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
> C code that parses that 'struct xdp_to_skb_metadata'.
> (To be precise, I'm trying to parse the metadata from
> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
> 100% sure that's the right place).
> (I also randomize the layout and magic to make sure userspace doesn't
> depend on it because nothing stops this packet to be routed into xsk
> socket..)

Ah, nice trick with __randomize_layout - I agree we need to do something
to prevent userspace from inadvertently starting to rely on this, and
this seems like a great solution!

Look forward to seeing what the whole thing looks like in a more
complete form :)

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-03  0:09                             ` Toke Høiland-Jørgensen
@ 2022-11-03 12:01                               ` Jesper Dangaard Brouer
  2022-11-03 12:48                                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-03 12:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev
  Cc: brouer, Jesper Dangaard Brouer, Martin KaFai Lau, Bezdeka,
	Florian, kuba, john.fastabend, alexandr.lobakin, anatoly.burakov,
	song, Deric, Nemanja, andrii, Kiszka, Jan, magnus.karlsson,
	willemb, ast, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints,
	netdev, jolsa, haoluo


On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>
>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>
>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>> using a
>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>> though; if
>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>
>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>
>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>> SKBs. I
>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>> the
>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>
>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>
>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>> Somehow during
>>>>>>> bpf_xdp_redirect_map?
>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>> redirecting to this
>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>> expecting?
>>>>>>
>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>> frames can eventually be converted into skbs?
>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>> We can probably make it magically work, and can generate
>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>> question - should we?
>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>> relying on the metadata)
>>>>>
>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>
>>>>
>>>> I very much like this idea of calling another bpf prog to initialize the
>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>
>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>
>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>> maintainers new rules)
>>>>
>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>> xdp_prog though...
>>>>
>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>> and cpumap can get redirected packets from multiple device drivers).
>>>
>>> If we have the helper to copy the data "out of" the drivers, why do we
>>> need a second BPF program to copy data to the SKB?
>>>

IMHO the second BPF program to populate the SKB is needed to add
flexibility and extensibility.

My end-goal here is to speedup packet parsing.
This BPF-prog should (in time) be able to update skb->transport_header
and skb->network_header.  As I mentioned before, HW RX-hash already tell
us the L3 and L4 protocols and in-most-cases header-len.  Even without
HW-hints, the XDP-prog likely have parsed the packet once. This parse
information is lost today, and redone by netstack. What about storing
this header parse info in meta data and re-use in this new XDP2SKB hook?

The reason for suggesting this BPF-prog to be a callback, associated
with the net_device, were that HW is going to differ on what HW hints
that HW support.  Thus, we can avoid a generic C-function that need to
check for all the possible hints, and instead have a BPF-prog that only
contains the code that is relevant for this net_device.


>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>> get unrolled. The helper takes the output of these metadata-extracting
>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>> metadata area; but in any case, since it's hidden away inside a helper
>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>> then just read statically. We could even make this format match the
>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>> contiguous chunk of memory when building the SKB.
>>
>> +1

Sorry, I think this "hiding" layout trick is going in a wrong direction.

Imagine the use-case of cpumap redirect.  The physical device XDP-prog
calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
and then I want to run another XDP-prog that want to look at these
HW-hints, and then likely call XDP_PASS which creates the SKB, also
using these HW-hints.  I take it, it would not be possible when using
the xdp_copy_metadata_for_skb() helper?

>>
>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>
>> struct xdp_to_skb_metadata {
>>    u32 magic; // randomized at boot
>>    ... skb-consumable-metadata in fixed format
>> } __randomize_layout;
>>
>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>> to fill in the actual data.
>>
>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>> C code that parses that 'struct xdp_to_skb_metadata'.
>> (To be precise, I'm trying to parse the metadata from
>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>> 100% sure that's the right place).
>> (I also randomize the layout and magic to make sure userspace doesn't
>> depend on it because nothing stops this packet to be routed into xsk
>> socket..)
> 
> Ah, nice trick with __randomize_layout - I agree we need to do something
> to prevent userspace from inadvertently starting to rely on this, and
> this seems like a great solution!

Sorry, I disagree where this is going.  Why do all of a sudden want to
prevent userspace (e.g. AF_XDP) from using this data?!?

The hole exercise started with wanting to provide AF_XDP with these
HW-hints. The hints a standard AF_XDP user wants is likely very similar
to what the SKB user wants.  Why do the AF_XDP user need to open code this?

The BTF-ID scheme precisely allows us to expose this layout to
userspace, and at the same time have freedom to change this in kernel
space, as userspace must decode the BTF-layout before reading this.
I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
ask for?  You can just consider the BTF-ID as the magic number, as it
will be more-or-less random per kernel build (and module load order).

> Look forward to seeing what the whole thing looks like in a more
> complete form :)

I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
some driver code that handles the issues of getting HW setup state
exposed needed to decode the RX-desc format.

I sense that I myself, haven't been good enough to explain/convey the
BTF-ID scheme.  Next week, I will code some examples that demo how
BTF-IDs can be used from BPF-progs, even as a communication channel
between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
TC-BPF).

--Jesper


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-03 12:01                               ` Jesper Dangaard Brouer
@ 2022-11-03 12:48                                 ` Toke Høiland-Jørgensen
  2022-11-03 15:25                                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 50+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-03 12:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Stanislav Fomichev
  Cc: brouer, Jesper Dangaard Brouer, Martin KaFai Lau, Bezdeka,
	Florian, kuba, john.fastabend, alexandr.lobakin, anatoly.burakov,
	song, Deric, Nemanja, andrii, Kiszka, Jan, magnus.karlsson,
	willemb, ast, yhs, kpsingh, daniel, bpf, mtahhan, xdp-hints,
	netdev, jolsa, haoluo

Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>>
>>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>>
>>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>>> using a
>>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>>> though; if
>>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>>
>>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>>
>>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>>> SKBs. I
>>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>>> the
>>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>>
>>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>>
>>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>>> Somehow during
>>>>>>>> bpf_xdp_redirect_map?
>>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>>> redirecting to this
>>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>>> expecting?
>>>>>>>
>>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>>> frames can eventually be converted into skbs?
>>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>>> We can probably make it magically work, and can generate
>>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>>> question - should we?
>>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>>> relying on the metadata)
>>>>>>
>>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>>
>>>>>
>>>>> I very much like this idea of calling another bpf prog to initialize the
>>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>>
>>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>>
>>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>>> maintainers new rules)
>>>>>
>>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>>> xdp_prog though...
>>>>>
>>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>>> and cpumap can get redirected packets from multiple device drivers).
>>>>
>>>> If we have the helper to copy the data "out of" the drivers, why do we
>>>> need a second BPF program to copy data to the SKB?
>>>>
>
> IMHO the second BPF program to populate the SKB is needed to add
> flexibility and extensibility.
>
> My end-goal here is to speedup packet parsing.
> This BPF-prog should (in time) be able to update skb->transport_header
> and skb->network_header.  As I mentioned before, HW RX-hash already tell
> us the L3 and L4 protocols and in-most-cases header-len.  Even without
> HW-hints, the XDP-prog likely have parsed the packet once. This parse
> information is lost today, and redone by netstack. What about storing
> this header parse info in meta data and re-use in this new XDP2SKB hook?
>
> The reason for suggesting this BPF-prog to be a callback, associated
> with the net_device, were that HW is going to differ on what HW hints
> that HW support.  Thus, we can avoid a generic C-function that need to
> check for all the possible hints, and instead have a BPF-prog that only
> contains the code that is relevant for this net_device.

But that's exactly what the xdp_copy_metadata_for_skb() is! It's a
dynamic "BPF program" (generated as unrolled kfunc calls) just running
in the helper and stashing the results in an intermediate struct in the
metadata area. And once it's done that, we don't need *another* dynamic
BPF program to read it back out and populate the SKB, because the
intermediate format it's been stashed into is under the control of the
kernel (we just need a flag to indicate that it's there).

>>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>>> get unrolled. The helper takes the output of these metadata-extracting
>>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>>> metadata area; but in any case, since it's hidden away inside a helper
>>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>>> then just read statically. We could even make this format match the
>>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>>> contiguous chunk of memory when building the SKB.
>>>
>>> +1
>
> Sorry, I think this "hiding" layout trick is going in a wrong direction.
>
> Imagine the use-case of cpumap redirect.  The physical device XDP-prog
> calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
> calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
> and then I want to run another XDP-prog that want to look at these
> HW-hints, and then likely call XDP_PASS which creates the SKB, also
> using these HW-hints.  I take it, it would not be possible when using
> the xdp_copy_metadata_for_skb() helper?

You're right that it should be possible to read the values back out
again later. That is totally possible with this scheme, though; the
'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP
program can just read that. We can explicitly support it by using the
BTF ID as the "magic value" as you suggest, which would be fine by me.

I still think we should be using the __randomize_layout trick, though,
precisely so that BPF consumers are forced to use BTF relocations to
read it; otherwise we risk the struct layout ossifying into UAPI because
people are just going to assume it's static...

>>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>>
>>> struct xdp_to_skb_metadata {
>>>    u32 magic; // randomized at boot
>>>    ... skb-consumable-metadata in fixed format
>>> } __randomize_layout;
>>>
>>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>>> to fill in the actual data.
>>>
>>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>>> C code that parses that 'struct xdp_to_skb_metadata'.
>>> (To be precise, I'm trying to parse the metadata from
>>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>>> 100% sure that's the right place).
>>> (I also randomize the layout and magic to make sure userspace doesn't
>>> depend on it because nothing stops this packet to be routed into xsk
>>> socket..)
>> 
>> Ah, nice trick with __randomize_layout - I agree we need to do something
>> to prevent userspace from inadvertently starting to rely on this, and
>> this seems like a great solution!
>
> Sorry, I disagree where this is going.  Why do all of a sudden want to
> prevent userspace (e.g. AF_XDP) from using this data?!?

See above: I don't think we should prevent userspace from using it (and
we're not), but we should prevent the struct layout from ossifying.

> The hole exercise started with wanting to provide AF_XDP with these
> HW-hints. The hints a standard AF_XDP user wants is likely very
> similar to what the SKB user wants. Why do the AF_XDP user need to
> open code this?
>
> The BTF-ID scheme precisely allows us to expose this layout to
> userspace, and at the same time have freedom to change this in kernel
> space, as userspace must decode the BTF-layout before reading this.
> I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
> scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
> ask for?  You can just consider the BTF-ID as the magic number, as it
> will be more-or-less random per kernel build (and module load order).

As mentioned above, I would be totally fine with just having the
xdp_to_skb_metadata be part of BTF, enabling both XDP programs and
AF_XDP consumers to re-use it.

>> Look forward to seeing what the whole thing looks like in a more
>> complete form :)
>
> I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
> some driver code that handles the issues of getting HW setup state
> exposed needed to decode the RX-desc format.
>
> I sense that I myself, haven't been good enough to explain/convey the
> BTF-ID scheme.  Next week, I will code some examples that demo how
> BTF-IDs can be used from BPF-progs, even as a communication channel
> between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
> TC-BPF).

For my part at least, it's not a lack of understanding that makes me
prefer the kfunc approach. Rather, it's the complexity of having to
resolve the multiple BTF IDs, and the risk of ossifying the struct
layouts because people are going to do that wrong. Using kfuncs gives us
much more control of the API, especially if we combine it with struct
randomisation for the bits we do expose.

Translating what we've discussed above into the terms used in your patch
series, this would correspond to *only* having the xdp_metadata_common
struct exposed via BTF, and not bothering with all the other
driver-specific layouts. So an XDP/AF_XDP user that only wants to use
the metadata that's also used by the stack can just call
xdp_copy_metadata_for_skb(), and then read the resulting metadata area
(using BTF). And if someone wants to access metadata that's *not* used
by the stack, they'd have to call additional kfuncs to extract that.

And similarly, if someone wants only a subset of the metadata used by an
SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead
just call the individual kfuncs to extract just the fields they want.

I think this strikes a nice balance between the flexibility by the
kernel to change things, the flexibility of XDP consumers to request
only the data they want, and the ability for the same metadata to be
consumed at different points. WDYT?

-Toke


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
  2022-11-03 12:48                                 ` Toke Høiland-Jørgensen
@ 2022-11-03 15:25                                   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 50+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-03 15:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Stanislav Fomichev
  Cc: brouer, Martin KaFai Lau, Bezdeka, Florian, kuba, john.fastabend,
	alexandr.lobakin, anatoly.burakov, song, Deric, Nemanja, andrii,
	Kiszka, Jan, magnus.karlsson, willemb, ast, yhs, kpsingh, daniel,
	bpf, mtahhan, xdp-hints, netdev, jolsa, haoluo



On 03/11/2022 13.48, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
> 
>> On 03/11/2022 01.09, Toke Høiland-Jørgensen wrote:
>>> Stanislav Fomichev <sdf@google.com> writes:
>>>
>>>> On Wed, Nov 2, 2022 at 3:02 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>>
>>>>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:
>>>>>
>>>>>> On 01/11/2022 18.05, Martin KaFai Lau wrote:
>>>>>>> On 10/31/22 6:59 PM, Stanislav Fomichev wrote:
>>>>>>>> On Mon, Oct 31, 2022 at 3:57 PM Martin KaFai Lau
>>>>>>>> <martin.lau@linux.dev> wrote:
>>>>>>>>>
>>>>>>>>> On 10/31/22 10:00 AM, Stanislav Fomichev wrote:
>>>>>>>>>>> 2. AF_XDP programs won't be able to access the metadata without
>>>>>>>>>>> using a
>>>>>>>>>>> custom XDP program that calls the kfuncs and puts the data into the
>>>>>>>>>>> metadata area. We could solve this with some code in libxdp,
>>>>>>>>>>> though; if
>>>>>>>>>>> this code can be made generic enough (so it just dumps the available
>>>>>>>>>>> metadata functions from the running kernel at load time), it may be
>>>>>>>>>>> possible to make it generic enough that it will be forward-compatible
>>>>>>>>>>> with new versions of the kernel that add new fields, which should
>>>>>>>>>>> alleviate Florian's concern about keeping things in sync.
>>>>>>>>>>
>>>>>>>>>> Good point. I had to convert to a custom program to use the kfuncs :-(
>>>>>>>>>> But your suggestion sounds good; maybe libxdp can accept some extra
>>>>>>>>>> info about at which offset the user would like to place the metadata
>>>>>>>>>> and the library can generate the required bytecode?
>>>>>>>>>>
>>>>>>>>>>> 3. It will make it harder to consume the metadata when building
>>>>>>>>>>> SKBs. I
>>>>>>>>>>> think the CPUMAP and veth use cases are also quite important, and that
>>>>>>>>>>> we want metadata to be available for building SKBs in this path. Maybe
>>>>>>>>>>> this can be resolved by having a convenient kfunc for this that can be
>>>>>>>>>>> used for programs doing such redirects. E.g., you could just call
>>>>>>>>>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>>>>>>>>>> would recursively expand into all the kfunc calls needed to extract
>>>>>>>>>>> the
>>>>>>>>>>> metadata supported by the SKB path?
>>>>>>>>>>
>>>>>>>>>> So this xdp_copy_metadata_for_skb will create a metadata layout that
>>>>>>>>>
>>>>>>>>> Can the xdp_copy_metadata_for_skb be written as a bpf prog itself?
>>>>>>>>> Not sure where is the best point to specify this prog though.
>>>>>>>>> Somehow during
>>>>>>>>> bpf_xdp_redirect_map?
>>>>>>>>> or this prog belongs to the target cpumap and the xdp prog
>>>>>>>>> redirecting to this
>>>>>>>>> cpumap has to write the meta layout in a way that the cpumap is
>>>>>>>>> expecting?
>>>>>>>>
>>>>>>>> We're probably interested in triggering it from the places where xdp
>>>>>>>> frames can eventually be converted into skbs?
>>>>>>>> So for plain 'return XDP_PASS' and things like bpf_redirect/etc? (IOW,
>>>>>>>> anything that's not XDP_DROP / AF_XDP redirect).
>>>>>>>> We can probably make it magically work, and can generate
>>>>>>>> kernel-digestible metadata whenever data == data_meta, but the
>>>>>>>> question - should we?
>>>>>>>> (need to make sure we won't regress any existing cases that are not
>>>>>>>> relying on the metadata)
>>>>>>>
>>>>>>> Instead of having some kernel-digestible meta data, how about calling
>>>>>>> another bpf prog to initialize the skb fields from the meta area after
>>>>>>> __xdp_build_skb_from_frame() in the cpumap, so
>>>>>>> run_xdp_set_skb_fileds_from_metadata() may be a better name.
>>>>>>>
>>>>>>
>>>>>> I very much like this idea of calling another bpf prog to initialize the
>>>>>> SKB fields from the meta area. (As a reminder, data need to come from
>>>>>> meta area, because at this point the hardware RX-desc is out-of-scope).
>>>>>> I'm onboard with xdp_copy_metadata_for_skb() populating the meta area.
>>>>>>
>>>>>> We could invoke this BPF-prog inside __xdp_build_skb_from_frame().
>>>>>>
>>>>>> We might need a new BPF_PROG_TYPE_XDP2SKB as this new BPF-prog
>>>>>> run_xdp_set_skb_fields_from_metadata() would need both xdp_buff + SKB as
>>>>>> context inputs. Right?  (Not sure, if this is acceptable with the BPF
>>>>>> maintainers new rules)
>>>>>>
>>>>>>> The xdp_prog@rx sets the meta data and then redirect.  If the
>>>>>>> xdp_prog@rx can also specify a xdp prog to initialize the skb fields
>>>>>>> from the meta area, then there is no need to have a kfunc to enforce a
>>>>>>> kernel-digestible layout.  Not sure what is a good way to specify this
>>>>>>> xdp_prog though...
>>>>>>
>>>>>> The challenge of running this (BPF_PROG_TYPE_XDP2SKB) BPF-prog inside
>>>>>> __xdp_build_skb_from_frame() is that it need to know howto decode the
>>>>>> meta area for every device driver or XDP-prog populating this (as veth
>>>>>> and cpumap can get redirected packets from multiple device drivers).
>>>>>
>>>>> If we have the helper to copy the data "out of" the drivers, why do we
>>>>> need a second BPF program to copy data to the SKB?
>>>>>
>>
>> IMHO the second BPF program to populate the SKB is needed to add
>> flexibility and extensibility.
>>
>> My end-goal here is to speedup packet parsing.
>> This BPF-prog should (in time) be able to update skb->transport_header
>> and skb->network_header.  As I mentioned before, HW RX-hash already tell
>> us the L3 and L4 protocols and in-most-cases header-len.  Even without
>> HW-hints, the XDP-prog likely have parsed the packet once. This parse
>> information is lost today, and redone by netstack. What about storing
>> this header parse info in meta data and re-use in this new XDP2SKB hook?
>>
>> The reason for suggesting this BPF-prog to be a callback, associated
>> with the net_device, were that HW is going to differ on what HW hints
>> that HW support.  Thus, we can avoid a generic C-function that need to
>> check for all the possible hints, and instead have a BPF-prog that only
>> contains the code that is relevant for this net_device.
> 
> But that's exactly what the xdp_copy_metadata_for_skb() is! It's a
> dynamic "BPF program" (generated as unrolled kfunc calls) just running
> in the helper and stashing the results in an intermediate struct in the
> metadata area. And once it's done that, we don't need *another* dynamic
> BPF program to read it back out and populate the SKB, because the
> intermediate format it's been stashed into is under the control of the
> kernel (we just need a flag to indicate that it's there).
> 
>>>>> I.e., the XDP program calls xdp_copy_metadata_for_skb(); this invokes
>>>>> each of the kfuncs needed for the metadata used by SKBs, all of which
>>>>> get unrolled. The helper takes the output of these metadata-extracting
>>>>> kfuncs and stores it "somewhere". This "somewhere" could well be the
>>>>> metadata area; but in any case, since it's hidden away inside a helper
>>>>> (or kfunc) from the calling XDP program's PoV, the helper can just stash
>>>>> all the data in a fixed format, which __xdp_build_skb_from_frame() can
>>>>> then just read statically. We could even make this format match the
>>>>> field layout of struct sk_buff, so all we have to do is memcpy a
>>>>> contiguous chunk of memory when building the SKB.
>>>>
>>>> +1
>>
>> Sorry, I think this "hiding" layout trick is going in a wrong direction.
>>
>> Imagine the use-case of cpumap redirect.  The physical device XDP-prog
>> calls xdp_copy_metadata_for_skb() to extract info from RX-desc, then it
>> calls redirect into cpumap.  On remote CPU, the xdp_frame is picked up,
>> and then I want to run another XDP-prog that want to look at these
>> HW-hints, and then likely call XDP_PASS which creates the SKB, also
>> using these HW-hints.  I take it, it would not be possible when using
>> the xdp_copy_metadata_for_skb() helper?
> 
> You're right that it should be possible to read the values back out
> again later. That is totally possible with this scheme, though; the
> 'xdp_to_skb_metadata' is going to be in the vmlinux BTF, so an XDP
> program can just read that. We can explicitly support it by using the
> BTF ID as the "magic value" as you suggest, which would be fine by me.
> 

I'm on-board if we as you suggest, add the BTF_ID as the "magic value" 
(as last member due to AF_XDP processing).  When 
xdp_copy_metadata_for_skb() writes 'xdp_to_skb_metadata' in metadata 
area.  We should simply see this BTF-ID as a 'cookie' or magic number.

> I still think we should be using the __randomize_layout trick, though,
> precisely so that BPF consumers are forced to use BTF relocations to
> read it; otherwise we risk the struct layout ossifying into UAPI because
> people are just going to assume it's static...
> 

I'm also on-board with some level of randomization to the struct to 
force consumers to use BTF for relocations. e.g BTF-ID cookie/magic 
should be at a fixed location.


>>>> I'm currently doing exactly what you're suggesting (minus matching skb layout):
>>>>
>>>> struct xdp_to_skb_metadata {
>>>>     u32 magic; // randomized at boot
>>>>     ... skb-consumable-metadata in fixed format
>>>> } __randomize_layout;
>>>>
>>>> bpf_xdp_copy_metadata_for_skb() does bpf_xdp_adjust_meta(ctx,
>>>> -sizeof(struct xdp_to_skb_metadata)) and then calls a bunch of kfuncs
>>>> to fill in the actual data.
>>>>
>>>> Then, at __xdp_build_skb_from_frame time, I'm having a regular kernel
>>>> C code that parses that 'struct xdp_to_skb_metadata'.
>>>> (To be precise, I'm trying to parse the metadata from
>>>> skb_metadata_set; it's called from __xdp_build_skb_from_frame, but not
>>>> 100% sure that's the right place).
>>>> (I also randomize the layout and magic to make sure userspace doesn't
>>>> depend on it because nothing stops this packet to be routed into xsk
>>>> socket..)
>>>
>>> Ah, nice trick with __randomize_layout - I agree we need to do something
>>> to prevent userspace from inadvertently starting to rely on this, and
>>> this seems like a great solution!
>>
>> Sorry, I disagree where this is going.  Why do all of a sudden want to
>> prevent userspace (e.g. AF_XDP) from using this data?!?
> 
> See above: I don't think we should prevent userspace from using it (and
> we're not), but we should prevent the struct layout from ossifying.
> 

Okay, then we are in agreement, avoid ossifying struct layout.

>> The hole exercise started with wanting to provide AF_XDP with these
>> HW-hints. The hints a standard AF_XDP user wants is likely very
>> similar to what the SKB user wants. Why do the AF_XDP user need to
>> open code this?
>>
>> The BTF-ID scheme precisely allows us to expose this layout to
>> userspace, and at the same time have freedom to change this in kernel
>> space, as userspace must decode the BTF-layout before reading this.
>> I was hoping xdp_copy_metadata_for_skb() could simply use the BTF-ID
>> scheme, with the BTF-ID of struct xdp_hints_rx_common, is to too much to
>> ask for?  You can just consider the BTF-ID as the magic number, as it
>> will be more-or-less random per kernel build (and module load order).
> 
> As mentioned above, I would be totally fine with just having the
> xdp_to_skb_metadata be part of BTF, enabling both XDP programs and
> AF_XDP consumers to re-use it.
>

My use-case is that AF_XDP will need to key on this runtime BTF-ID magic 
value anyway to read out 'xdp_to_skb_metadata' values.  I have an 
XDP-prog running, that will RX-timestamp only the time-sync protocol 
packets.  Code wise, I will simply add my RX-timestamp on top of struct 
'xdp_to_skb_metadata' and then update the BTF-ID magic value.  I don't 
need to add a real BTF-ID but just some magic value that my AF_XDP 
userspace prog knows about.  In my current code[1], I'm playing nice and 
adds the BPF-prog's own local BTF-ID via bpf_core_type_id_local().

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80


>>> Look forward to seeing what the whole thing looks like in a more
>>> complete form :)
>>
>> I'm sort of on-board with the kfuncs and unroll-tricks, if I can see
>> some driver code that handles the issues of getting HW setup state
>> exposed needed to decode the RX-desc format.
>>
>> I sense that I myself, haven't been good enough to explain/convey the
>> BTF-ID scheme.  Next week, I will code some examples that demo how
>> BTF-IDs can be used from BPF-progs, even as a communication channel
>> between different BPF-progs (e.g. drv XDP-prog -> cpumap XDP-prog ->
>> TC-BPF).
> 
> For my part at least, it's not a lack of understanding that makes me
> prefer the kfunc approach. Rather, it's the complexity of having to
> resolve the multiple BTF IDs, and the risk of ossifying the struct
> layouts because people are going to do that wrong. Using kfuncs gives us
> much more control of the API, especially if we combine it with struct
> randomisation for the bits we do expose.
> 
> Translating what we've discussed above into the terms used in your patch
> series, this would correspond to *only* having the xdp_metadata_common
> struct exposed via BTF, and not bothering with all the other
> driver-specific layouts. So an XDP/AF_XDP user that only wants to use
> the metadata that's also used by the stack can just call
> xdp_copy_metadata_for_skb(), and then read the resulting metadata area
> (using BTF). And if someone wants to access metadata that's *not* used
> by the stack, they'd have to call additional kfuncs to extract that.
> 

I agree, that this patchset does/will simplify my patchset.  My driver 
specific structs for BTF-IDs will no-longer be needed.  As it is now 
up-to XDP-prog to explicitly extend with fields. This should reduce your 
concern with resolving multiple BTF IDs.

Maybe after this patchset, I would suggest that we could create some 
"kernel-central" struct's that have e.g. RX-timestamp and mark (mlx5 
have HW support for mark) and protocol types (via RX-hash). As these 
could be used by XDP-prog's that explicitly extract these, and 
communicate the layout via setting the BTF-ID (via calling 
bpf_core_type_id_kernel()).  Making it easier to consume from chained 
BPF-progs, AF_XDP and even via kernel C-code that updates SKB fields as 
the number of these magic BTF-ID's will be small enough.


> And similarly, if someone wants only a subset of the metadata used by an
> SKB, they can just *not* call xdp_copy_metadata_for_skb(), and instead
> just call the individual kfuncs to extract just the fields they want.
> 
> I think this strikes a nice balance between the flexibility by the
> kernel to change things, the flexibility of XDP consumers to request
> only the data they want, and the ability for the same metadata to be
> consumed at different points. WDYT?

With above comments, I think we are closer to an agreement again.

--Jesper


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-11-03 15:26 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-27 20:00 [RFC bpf-next 0/5] xdp: hints via kfuncs Stanislav Fomichev
2022-10-27 20:00 ` [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
2022-10-27 20:00 ` [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-10-28  8:40   ` Jesper Dangaard Brouer
2022-10-28 18:46     ` Stanislav Fomichev
2022-10-27 20:00 ` [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts Stanislav Fomichev
2022-10-27 20:05   ` Andrii Nakryiko
2022-10-27 20:10     ` Stanislav Fomichev
2022-10-27 20:00 ` [RFC bpf-next 4/5] selftests/bpf: Convert xskxceiver to use custom program Stanislav Fomichev
2022-10-27 20:00 ` [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver Stanislav Fomichev
2022-10-28  6:22   ` Martin KaFai Lau
2022-10-28 10:37     ` Jesper Dangaard Brouer
2022-10-28 18:46       ` Stanislav Fomichev
2022-10-31 14:20         ` Alexander Lobakin
2022-10-31 14:29           ` Alexander Lobakin
2022-10-31 17:00           ` Stanislav Fomichev
2022-11-01 13:18             ` Jesper Dangaard Brouer
2022-11-01 20:12               ` Stanislav Fomichev
2022-11-01 22:23               ` [xdp-hints] " Toke Høiland-Jørgensen
2022-10-28 15:58 ` [RFC bpf-next 0/5] xdp: hints via kfuncs John Fastabend
2022-10-28 18:04   ` Jakub Kicinski
2022-10-28 18:46     ` Stanislav Fomichev
2022-10-28 23:16       ` John Fastabend
2022-10-29  1:14         ` Jakub Kicinski
2022-10-31 14:10           ` [xdp-hints] " Bezdeka, Florian
2022-10-31 15:28             ` Toke Høiland-Jørgensen
2022-10-31 17:00               ` Stanislav Fomichev
2022-10-31 22:57                 ` Martin KaFai Lau
2022-11-01  1:59                   ` Stanislav Fomichev
2022-11-01 12:52                     ` Toke Høiland-Jørgensen
2022-11-01 13:43                       ` David Ahern
2022-11-01 14:20                         ` Toke Høiland-Jørgensen
2022-11-01 17:05                     ` Martin KaFai Lau
2022-11-01 20:12                       ` Stanislav Fomichev
2022-11-02 14:06                       ` Jesper Dangaard Brouer
2022-11-02 22:01                         ` Toke Høiland-Jørgensen
2022-11-02 23:10                           ` Stanislav Fomichev
2022-11-03  0:09                             ` Toke Høiland-Jørgensen
2022-11-03 12:01                               ` Jesper Dangaard Brouer
2022-11-03 12:48                                 ` Toke Høiland-Jørgensen
2022-11-03 15:25                                   ` Jesper Dangaard Brouer
2022-10-31 19:36               ` Yonghong Song
2022-10-31 22:09                 ` Stanislav Fomichev
2022-10-31 22:38                   ` Yonghong Song
2022-10-31 22:55                     ` Stanislav Fomichev
2022-11-01 14:23                       ` Jesper Dangaard Brouer
2022-11-01 17:31                   ` Martin KaFai Lau
2022-11-01 20:12                     ` Stanislav Fomichev
2022-11-01 21:17                       ` Martin KaFai Lau
2022-10-31 17:01           ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).