All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc
@ 2023-11-22 18:20 Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: linux-kselftest, linux-kernel, bpf, netdev, steffen.klassert,
	antony.antony, alexei.starovoitov
  Cc: devel


This patchset adds two kfunc helpers, bpf_xdp_get_xfrm_state() and
bpf_xdp_xfrm_state_release() that wrap xfrm_state_lookup() and
xfrm_state_put(). The intent is to support software RSS (via XDP) for
the ongoing/upcoming ipsec pcpu work [0]. Recent experiments performed
on (hopefully) reproducible AWS testbeds indicate that single tunnel
pcpu ipsec can reach line rate on 100G ENA nics.

Note this patchset only tests/shows generic xfrm_state access. The
"secret sauce" (if you can really even call it that) involves accessing
a soon-to-be-upstreamed pcpu_num field in xfrm_state. Early example is
available here [1].

[0]: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/03/
[1]: https://github.com/danobi/xdp-tools/blob/e89a1c617aba3b50d990f779357d6ce2863ecb27/xdp-bench/xdp_redirect_cpumap.bpf.c#L385-L406

Changes from RFCv2:
* Rebased to ipsec-next
* Fix netns leak

Changes from RFCv1:
* Add Antony's commit tags
* Add KF_ACQUIRE and KF_RELEASE semantics

Daniel Xu (7):
  bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc
  bpf: selftests: test_tunnel: Use ping -6 over ping6
  bpf: selftests: test_tunnel: Mount bpffs if necessary
  bpf: selftests: test_tunnel: Use vmlinux.h declarations
  bpf: selftests: test_tunnel: Disable CO-RE relocations
  bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()

 include/net/xfrm.h                            |   9 ++
 net/xfrm/Makefile                             |   1 +
 net/xfrm/xfrm_policy.c                        |   2 +
 net/xfrm/xfrm_state_bpf.c                     | 127 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/test_tunnel_kern.c    |  98 ++++++++------
 tools/testing/selftests/bpf/test_tunnel.sh    |  43 ++++--
 7 files changed, 227 insertions(+), 54 deletions(-)
 create mode 100644 net/xfrm/xfrm_state_bpf.c

-- 
2.42.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-22 23:26   ` Alexei Starovoitov
  2023-11-25 20:36   ` Yonghong Song
  2023-11-22 18:20 ` [PATCH ipsec-next v1 2/7] bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc Daniel Xu
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: john.fastabend, Herbert Xu, davem, ast, daniel, pabeni, hawk,
	kuba, edumazet, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: linux-kernel, netdev, bpf, devel

This commit adds an unstable kfunc helper to access internal xfrm_state
associated with an SA. This is intended to be used for the upcoming
IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
words: for custom software RSS.

That being said, the function that this kfunc wraps is fairly generic
and used for a lot of xfrm tasks. I'm sure people will find uses
elsewhere over time.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 include/net/xfrm.h        |   9 ++++
 net/xfrm/Makefile         |   1 +
 net/xfrm/xfrm_policy.c    |   2 +
 net/xfrm/xfrm_state_bpf.c | 111 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 123 insertions(+)
 create mode 100644 net/xfrm/xfrm_state_bpf.c

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index c9bb0f892f55..1d107241b901 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -2190,4 +2190,13 @@ static inline int register_xfrm_interface_bpf(void)
 
 #endif
 
+#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
+int register_xfrm_state_bpf(void);
+#else
+static inline int register_xfrm_state_bpf(void)
+{
+	return 0;
+}
+#endif
+
 #endif	/* _NET_XFRM_H */
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index cd47f88921f5..547cec77ba03 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
 obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
 obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
+obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index c13dc3ef7910..1b7e75159727 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -4218,6 +4218,8 @@ void __init xfrm_init(void)
 #ifdef CONFIG_XFRM_ESPINTCP
 	espintcp_init();
 #endif
+
+	register_xfrm_state_bpf();
 }
 
 #ifdef CONFIG_AUDITSYSCALL
diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
new file mode 100644
index 000000000000..0c1f2f91125c
--- /dev/null
+++ b/net/xfrm/xfrm_state_bpf.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Unstable XFRM state BPF helpers.
+ *
+ * Note that it is allowed to break compatibility for these functions since the
+ * interface they are exposed through to BPF programs is explicitly unstable.
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+#include <net/xfrm.h>
+
+/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
+ *
+ * Members:
+ * @error      - Out parameter, set for any errors encountered
+ *		 Values:
+ *		   -EINVAL - netns_id is less than -1
+ *		   -EINVAL - Passed NULL for opts
+ *		   -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
+ *		   -ENONET - No network namespace found for netns_id
+ * @netns_id	- Specify the network namespace for lookup
+ *		 Values:
+ *		   BPF_F_CURRENT_NETNS (-1)
+ *		     Use namespace associated with ctx
+ *		   [0, S32_MAX]
+ *		     Network Namespace ID
+ * @mark	- XFRM mark to match on
+ * @daddr	- Destination address to match on
+ * @spi		- Security parameter index to match on
+ * @proto	- L3 protocol to match on
+ * @family	- L3 protocol family to match on
+ */
+struct bpf_xfrm_state_opts {
+	s32 error;
+	s32 netns_id;
+	u32 mark;
+	xfrm_address_t daddr;
+	__be32 spi;
+	u8 proto;
+	u16 family;
+};
+
+enum {
+	BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
+};
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in xfrm_state BTF");
+
+/* bpf_xdp_get_xfrm_state - Get XFRM state
+ *
+ * Parameters:
+ * @ctx 	- Pointer to ctx (xdp_md) in XDP program
+ *		    Cannot be NULL
+ * @opts	- Options for lookup (documented above)
+ *		    Cannot be NULL
+ * @opts__sz	- Length of the bpf_xfrm_state_opts structure
+ *		    Must be BPF_XFRM_STATE_OPTS_SZ
+ */
+__bpf_kfunc struct xfrm_state *
+bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
+{
+	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+	struct net *net = dev_net(xdp->rxq->dev);
+	struct xfrm_state *x;
+
+	if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+
+	if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
+		opts->error = -EINVAL;
+		return NULL;
+	}
+
+	if (opts->netns_id >= 0) {
+		net = get_net_ns_by_id(net, opts->netns_id);
+		if (unlikely(!net)) {
+			opts->error = -ENONET;
+			return NULL;
+		}
+	}
+
+	x = xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
+			      opts->proto, opts->family);
+
+	if (opts->netns_id >= 0)
+		put_net(net);
+
+	return x;
+}
+
+__diag_pop()
+
+BTF_SET8_START(xfrm_state_kfunc_set)
+BTF_ID_FLAGS(func, bpf_xdp_get_xfrm_state, KF_RET_NULL | KF_ACQUIRE)
+BTF_SET8_END(xfrm_state_kfunc_set)
+
+static const struct btf_kfunc_id_set xfrm_state_xdp_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xfrm_state_kfunc_set,
+};
+
+int __init register_xfrm_state_bpf(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP,
+					 &xfrm_state_xdp_kfunc_set);
+}
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 2/7] bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 3/7] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: john.fastabend, Herbert Xu, davem, ast, daniel, pabeni, hawk,
	kuba, edumazet, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: netdev, linux-kernel, bpf, devel

This kfunc releases a previously acquired xfrm_state from
bpf_xdp_get_xfrm_state().

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 net/xfrm/xfrm_state_bpf.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
index 0c1f2f91125c..33d1b00fedbd 100644
--- a/net/xfrm/xfrm_state_bpf.c
+++ b/net/xfrm/xfrm_state_bpf.c
@@ -93,10 +93,26 @@ bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32
 	return x;
 }
 
+/* bpf_xdp_xfrm_state_release - Release acquired xfrm_state object
+ *
+ * This must be invoked for referenced PTR_TO_BTF_ID, and the verifier rejects
+ * the program if any references remain in the program in all of the explored
+ * states.
+ *
+ * Parameters:
+ * @x		- Pointer to referenced xfrm_state object, obtained using
+ *		  bpf_xdp_get_xfrm_state.
+ */
+__bpf_kfunc void bpf_xdp_xfrm_state_release(struct xfrm_state *x)
+{
+	xfrm_state_put(x);
+}
+
 __diag_pop()
 
 BTF_SET8_START(xfrm_state_kfunc_set)
 BTF_ID_FLAGS(func, bpf_xdp_get_xfrm_state, KF_RET_NULL | KF_ACQUIRE)
+BTF_ID_FLAGS(func, bpf_xdp_xfrm_state_release, KF_RELEASE)
 BTF_SET8_END(xfrm_state_kfunc_set)
 
 static const struct btf_kfunc_id_set xfrm_state_xdp_kfunc_set = {
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 3/7] bpf: selftests: test_tunnel: Use ping -6 over ping6
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 2/7] bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 4/7] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: shuah, daniel, andrii, ast, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: mykolal, martin.lau, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel,
	devel, netdev

The ping6 binary went away over 7 years ago [0].

[0]: https://github.com/iputils/iputils/commit/ebad35fee3de851b809c7b72ccc654a72b6af61d

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/test_tunnel.sh | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index 2dec7dbf29a2..85ba39992461 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -295,13 +295,13 @@ test_ip6gre()
 	add_ip6gretap_tunnel
 	attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# overlay: ipv4 over ipv6
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	ping $PING_ARG 10.1.1.100
 	check_err $?
 	# overlay: ipv6 over ipv6
-	ip netns exec at_ns0 ping6 $PING_ARG fc80::200
+	ip netns exec at_ns0 ping -6 $PING_ARG fc80::200
 	check_err $?
 	cleanup
 
@@ -324,13 +324,13 @@ test_ip6gretap()
 	add_ip6gretap_tunnel
 	attach_bpf $DEV ip6gretap_set_tunnel ip6gretap_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# overlay: ipv4 over ipv6
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	ping $PING_ARG 10.1.1.100
 	check_err $?
 	# overlay: ipv6 over ipv6
-	ip netns exec at_ns0 ping6 $PING_ARG fc80::200
+	ip netns exec at_ns0 ping -6 $PING_ARG fc80::200
 	check_err $?
 	cleanup
 
@@ -376,7 +376,7 @@ test_ip6erspan()
 	config_device
 	add_ip6erspan_tunnel $1
 	attach_bpf $DEV ip4ip6erspan_set_tunnel ip4ip6erspan_get_tunnel
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	check_err $?
 	cleanup
@@ -474,7 +474,7 @@ test_ipip6()
 	ip link set dev veth1 mtu 1500
 	attach_bpf $DEV ipip6_set_tunnel ipip6_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# ip4 over ip6
 	ping $PING_ARG 10.1.1.100
 	check_err $?
@@ -502,11 +502,11 @@ test_ip6ip6()
 	ip link set dev veth1 mtu 1500
 	attach_bpf $DEV ip6ip6_set_tunnel ip6ip6_get_tunnel
 	# underlay
-	ping6 $PING_ARG ::11
+	ping -6 $PING_ARG ::11
 	# ip6 over ip6
-	ping6 $PING_ARG 1::11
+	ping -6 $PING_ARG 1::11
 	check_err $?
-	ip netns exec at_ns0 ping6 $PING_ARG 1::22
+	ip netns exec at_ns0 ping -6 $PING_ARG 1::22
 	check_err $?
 	cleanup
 
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 4/7] bpf: selftests: test_tunnel: Mount bpffs if necessary
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (2 preceding siblings ...)
  2023-11-22 18:20 ` [PATCH ipsec-next v1 3/7] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: shuah, daniel, andrii, ast, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: martin.lau, song, yonghong.song, john.fastabend, kpsingh, sdf,
	haoluo, jolsa, mykolal, bpf, linux-kselftest, linux-kernel,
	devel, netdev

Previously, if bpffs was not already mounted, then the test suite would
fail during object file pinning steps. Fix by mounting bpffs if
necessary.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/test_tunnel.sh | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index 85ba39992461..dd3c79129e87 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -46,7 +46,8 @@
 # 6) Forward the packet to the overlay tnl dev
 
 BPF_FILE="test_tunnel_kern.bpf.o"
-BPF_PIN_TUNNEL_DIR="/sys/fs/bpf/tc/tunnel"
+BPF_FS="/sys/fs/bpf"
+BPF_PIN_TUNNEL_DIR="${BPF_FS}/tc/tunnel"
 PING_ARG="-c 3 -w 10 -q"
 ret=0
 GREEN='\033[0;92m'
@@ -668,10 +669,20 @@ check_err()
 	fi
 }
 
+mount_bpffs()
+{
+	if ! mount | grep "bpf on /sys/fs/bpf" &>/dev/null; then
+		mount -t bpf bpf "$BPF_FS"
+	fi
+}
+
 bpf_tunnel_test()
 {
 	local errors=0
 
+	echo "Mounting bpffs..."
+	mount_bpffs
+
 	echo "Testing GRE tunnel..."
 	test_gre
 	errors=$(( $errors + $? ))
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (3 preceding siblings ...)
  2023-11-22 18:20 ` [PATCH ipsec-next v1 4/7] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-26  0:34   ` Yonghong Song
  2023-11-22 18:20 ` [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
  2023-11-22 18:20 ` [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
  6 siblings, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: shuah, daniel, andrii, ast, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: mykolal, martin.lau, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel,
	devel, netdev

vmlinux.h declarations are more ergnomic, especially when working with
kfuncs. The uapi headers are often incomplete for kfunc definitions.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 .../selftests/bpf/progs/bpf_tracing_net.h     |  1 +
 .../selftests/bpf/progs/test_tunnel_kern.c    | 48 ++++---------------
 2 files changed, 9 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
index 0b793a102791..1bdc680b0e0e 100644
--- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
+++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
@@ -26,6 +26,7 @@
 #define IPV6_AUTOFLOWLABEL	70
 
 #define TC_ACT_UNSPEC		(-1)
+#define TC_ACT_OK		0
 #define TC_ACT_SHOT		2
 
 #define SOL_TCP			6
diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index f66af753bbbb..3065a716544d 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,62 +6,30 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
-#include <stddef.h>
-#include <string.h>
-#include <arpa/inet.h>
-#include <linux/bpf.h>
-#include <linux/if_ether.h>
-#include <linux/if_packet.h>
-#include <linux/if_tunnel.h>
-#include <linux/ip.h>
-#include <linux/ipv6.h>
-#include <linux/icmp.h>
-#include <linux/types.h>
-#include <linux/socket.h>
-#include <linux/pkt_cls.h>
-#include <linux/erspan.h>
-#include <linux/udp.h>
+#include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include "bpf_kfuncs.h"
+#include "bpf_tracing_net.h"
 
 #define log_err(__ret) bpf_printk("ERROR line:%d ret:%d\n", __LINE__, __ret)
 
-#define VXLAN_UDP_PORT 4789
+#define VXLAN_UDP_PORT		4789
+#define ETH_P_IP		0x0800
+#define PACKET_HOST		0
+#define TUNNEL_CSUM		bpf_htons(0x01)
+#define TUNNEL_KEY		bpf_htons(0x04)
 
 /* Only IPv4 address assigned to veth1.
  * 172.16.1.200
  */
 #define ASSIGNED_ADDR_VETH1 0xac1001c8
 
-struct geneve_opt {
-	__be16	opt_class;
-	__u8	type;
-	__u8	length:5;
-	__u8	r3:1;
-	__u8	r2:1;
-	__u8	r1:1;
-	__u8	opt_data[8]; /* hard-coded to 8 byte */
-};
-
 struct vxlanhdr {
 	__be32 vx_flags;
 	__be32 vx_vni;
 } __attribute__((packed));
 
-struct vxlan_metadata {
-	__u32     gbp;
-};
-
-struct bpf_fou_encap {
-	__be16 sport;
-	__be16 dport;
-};
-
-enum bpf_fou_encap_type {
-	FOU_BPF_ENCAP_FOU,
-	FOU_BPF_ENCAP_GUE,
-};
-
 int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap, int type) __ksym;
 int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (4 preceding siblings ...)
  2023-11-22 18:20 ` [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-26  0:51   ` Yonghong Song
  2023-11-22 18:20 ` [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
  6 siblings, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: shuah, daniel, andrii, ast, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: mykolal, martin.lau, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel,
	devel, netdev

Switching to vmlinux.h definitions seems to make the verifier very
unhappy with bitfield accesses. The error is:

    ; md.u.md2.dir = direction;
    33: (69) r1 = *(u16 *)(r2 +11)
    misaligned stack access off (0x0; 0x0)+-64+11 size 2

It looks like disabling CO-RE relocations seem to make the error go
away.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 tools/testing/selftests/bpf/progs/test_tunnel_kern.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 3065a716544d..ec7e04e012ae 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,6 +6,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define BPF_NO_PRESERVE_ACCESS_INDEX
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
  2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
                   ` (5 preceding siblings ...)
  2023-11-22 18:20 ` [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
@ 2023-11-22 18:20 ` Daniel Xu
  2023-11-22 23:28   ` Alexei Starovoitov
  6 siblings, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-22 18:20 UTC (permalink / raw)
  To: john.fastabend, davem, ast, daniel, hawk, kuba, andrii, shuah,
	steffen.klassert, antony.antony, alexei.starovoitov
  Cc: martin.lau, song, yonghong.song, kpsingh, sdf, haoluo, jolsa,
	mykolal, bpf, linux-kselftest, linux-kernel, netdev, devel

This commit extends test_tunnel selftest to test the new XDP xfrm state
lookup kfunc.

Co-developed-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
---
 .../selftests/bpf/progs/test_tunnel_kern.c    | 49 +++++++++++++++++++
 tools/testing/selftests/bpf/test_tunnel.sh    | 12 +++--
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index ec7e04e012ae..17bf9ce28460 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -35,6 +35,10 @@ int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap, int type) __ksym;
 int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
 			  struct bpf_fou_encap *encap) __ksym;
+struct xfrm_state *
+bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts,
+		       u32 opts__sz) __ksym;
+void bpf_xdp_xfrm_state_release(struct xfrm_state *x) __ksym;
 
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
@@ -948,4 +952,49 @@ int xfrm_get_state(struct __sk_buff *skb)
 	return TC_ACT_OK;
 }
 
+SEC("xdp")
+int xfrm_get_state_xdp(struct xdp_md *xdp)
+{
+	struct bpf_xfrm_state_opts opts = {};
+	struct xfrm_state *x = NULL;
+	struct ip_esp_hdr *esph;
+	struct bpf_dynptr ptr;
+	u8 esph_buf[8] = {};
+	u8 iph_buf[20] = {};
+	struct iphdr *iph;
+	u32 off;
+
+	if (bpf_dynptr_from_xdp(xdp, 0, &ptr))
+		goto out;
+
+	off = sizeof(struct ethhdr);
+	iph = bpf_dynptr_slice(&ptr, off, iph_buf, sizeof(iph_buf));
+	if (!iph || iph->protocol != IPPROTO_ESP)
+		goto out;
+
+	off += sizeof(struct iphdr);
+	esph = bpf_dynptr_slice(&ptr, off, esph_buf, sizeof(esph_buf));
+	if (!esph)
+		goto out;
+
+	opts.netns_id = BPF_F_CURRENT_NETNS,
+	opts.daddr.a4 = iph->daddr;
+	opts.spi = esph->spi;
+	opts.proto = IPPROTO_ESP;
+	opts.family = AF_INET;
+
+	x = bpf_xdp_get_xfrm_state(xdp, &opts, sizeof(opts));
+	if (!x || opts.error)
+		goto out;
+
+	if (!x->replay_esn)
+		goto out;
+
+	bpf_printk("replay-window %d\n", x->replay_esn->replay_window);
+out:
+	if (x)
+		bpf_xdp_xfrm_state_release(x);
+	return XDP_PASS;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tunnel.sh b/tools/testing/selftests/bpf/test_tunnel.sh
index dd3c79129e87..17d263681c71 100755
--- a/tools/testing/selftests/bpf/test_tunnel.sh
+++ b/tools/testing/selftests/bpf/test_tunnel.sh
@@ -528,7 +528,7 @@ setup_xfrm_tunnel()
 	# at_ns0 -> root
 	ip netns exec at_ns0 \
 		ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
-			spi $spi_in_to_out reqid 1 mode tunnel \
+			spi $spi_in_to_out reqid 1 mode tunnel replay-window 42 \
 			auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
 	ip netns exec at_ns0 \
 		ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir out \
@@ -537,7 +537,7 @@ setup_xfrm_tunnel()
 	# root -> at_ns0
 	ip netns exec at_ns0 \
 		ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
-			spi $spi_out_to_in reqid 2 mode tunnel \
+			spi $spi_out_to_in reqid 2 mode tunnel replay-window 42 \
 			auth-trunc 'hmac(sha1)' $auth 96 enc 'cbc(aes)' $enc
 	ip netns exec at_ns0 \
 		ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir in \
@@ -553,14 +553,14 @@ setup_xfrm_tunnel()
 	# root namespace
 	# at_ns0 -> root
 	ip xfrm state add src 172.16.1.100 dst 172.16.1.200 proto esp \
-		spi $spi_in_to_out reqid 1 mode tunnel \
+		spi $spi_in_to_out reqid 1 mode tunnel replay-window 42 \
 		auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
 	ip xfrm policy add src 10.1.1.100/32 dst 10.1.1.200/32 dir in \
 		tmpl src 172.16.1.100 dst 172.16.1.200 proto esp reqid 1 \
 		mode tunnel
 	# root -> at_ns0
 	ip xfrm state add src 172.16.1.200 dst 172.16.1.100 proto esp \
-		spi $spi_out_to_in reqid 2 mode tunnel \
+		spi $spi_out_to_in reqid 2 mode tunnel replay-window 42 \
 		auth-trunc 'hmac(sha1)' $auth 96  enc 'cbc(aes)' $enc
 	ip xfrm policy add src 10.1.1.200/32 dst 10.1.1.100/32 dir out \
 		tmpl src 172.16.1.200 dst 172.16.1.100 proto esp reqid 2 \
@@ -585,6 +585,8 @@ test_xfrm_tunnel()
 	tc qdisc add dev veth1 clsact
 	tc filter add dev veth1 proto ip ingress bpf da object-pinned \
 		${BPF_PIN_TUNNEL_DIR}/xfrm_get_state
+	ip link set dev veth1 xdpdrv pinned \
+		${BPF_PIN_TUNNEL_DIR}/xfrm_get_state_xdp
 	ip netns exec at_ns0 ping $PING_ARG 10.1.1.200
 	sleep 1
 	grep "reqid 1" ${TRACE}
@@ -593,6 +595,8 @@ test_xfrm_tunnel()
 	check_err $?
 	grep "remote ip 0xac100164" ${TRACE}
 	check_err $?
+	grep "replay-window 42" ${TRACE}
+	check_err $?
 	cleanup
 
 	if [ $ret -ne 0 ]; then
-- 
2.42.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
@ 2023-11-22 23:26   ` Alexei Starovoitov
  2023-11-25 20:36   ` Yonghong Song
  1 sibling, 0 replies; 33+ messages in thread
From: Alexei Starovoitov @ 2023-11-22 23:26 UTC (permalink / raw)
  To: Daniel Xu
  Cc: John Fastabend, Herbert Xu, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Paolo Abeni, Jesper Dangaard Brouer,
	Jakub Kicinski, Eric Dumazet, Steffen Klassert, antony.antony,
	LKML, Network Development, bpf, devel

On Wed, Nov 22, 2023 at 10:21 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +                 "Global functions as their definitions will be in xfrm_state BTF");

Pls use __bpf_kfunc_start_defs() instead.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
  2023-11-22 18:20 ` [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
@ 2023-11-22 23:28   ` Alexei Starovoitov
  2023-11-24 20:59     ` Daniel Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2023-11-22 23:28 UTC (permalink / raw)
  To: Daniel Xu
  Cc: John Fastabend, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, Jakub Kicinski,
	Andrii Nakryiko, Shuah Khan, Steffen Klassert, antony.antony,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Network Development,
	devel

On Wed, Nov 22, 2023 at 10:21 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> +
> +       bpf_printk("replay-window %d\n", x->replay_esn->replay_window);

Pls no printk in tests. Find a different way to validate.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
  2023-11-22 23:28   ` Alexei Starovoitov
@ 2023-11-24 20:59     ` Daniel Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-24 20:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, Jakub Kicinski,
	Andrii Nakryiko, Shuah Khan, Steffen Klassert, antony.antony,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, Network Development,
	devel

Hi Alexei,

On Wed, Nov 22, 2023 at 03:28:16PM -0800, Alexei Starovoitov wrote:
> On Wed, Nov 22, 2023 at 10:21 AM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > +
> > +       bpf_printk("replay-window %d\n", x->replay_esn->replay_window);
> 
> Pls no printk in tests. Find a different way to validate.

Ack. I'll migrate the ipsec tunnel tests to test_progs next rev so it
can use mmaped globals.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
  2023-11-22 23:26   ` Alexei Starovoitov
@ 2023-11-25 20:36   ` Yonghong Song
  2023-11-26  4:38     ` Daniel Xu
  1 sibling, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-25 20:36 UTC (permalink / raw)
  To: Daniel Xu, john.fastabend, Herbert Xu, davem, ast, daniel,
	pabeni, hawk, kuba, edumazet, steffen.klassert, antony.antony,
	alexei.starovoitov
  Cc: linux-kernel, netdev, bpf, devel


On 11/22/23 1:20 PM, Daniel Xu wrote:
> This commit adds an unstable kfunc helper to access internal xfrm_state
> associated with an SA. This is intended to be used for the upcoming
> IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> words: for custom software RSS.
>
> That being said, the function that this kfunc wraps is fairly generic
> and used for a lot of xfrm tasks. I'm sure people will find uses
> elsewhere over time.
>
> Co-developed-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>   include/net/xfrm.h        |   9 ++++
>   net/xfrm/Makefile         |   1 +
>   net/xfrm/xfrm_policy.c    |   2 +
>   net/xfrm/xfrm_state_bpf.c | 111 ++++++++++++++++++++++++++++++++++++++
>   4 files changed, 123 insertions(+)
>   create mode 100644 net/xfrm/xfrm_state_bpf.c
>
> diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> index c9bb0f892f55..1d107241b901 100644
> --- a/include/net/xfrm.h
> +++ b/include/net/xfrm.h
> @@ -2190,4 +2190,13 @@ static inline int register_xfrm_interface_bpf(void)
>   
>   #endif
>   
> +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> +int register_xfrm_state_bpf(void);
> +#else
> +static inline int register_xfrm_state_bpf(void)
> +{
> +	return 0;
> +}
> +#endif
> +
>   #endif	/* _NET_XFRM_H */
> diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> index cd47f88921f5..547cec77ba03 100644
> --- a/net/xfrm/Makefile
> +++ b/net/xfrm/Makefile
> @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
>   obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
>   obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
>   obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> index c13dc3ef7910..1b7e75159727 100644
> --- a/net/xfrm/xfrm_policy.c
> +++ b/net/xfrm/xfrm_policy.c
> @@ -4218,6 +4218,8 @@ void __init xfrm_init(void)
>   #ifdef CONFIG_XFRM_ESPINTCP
>   	espintcp_init();
>   #endif
> +
> +	register_xfrm_state_bpf();
>   }
>   
>   #ifdef CONFIG_AUDITSYSCALL
> diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> new file mode 100644
> index 000000000000..0c1f2f91125c
> --- /dev/null
> +++ b/net/xfrm/xfrm_state_bpf.c
> @@ -0,0 +1,111 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Unstable XFRM state BPF helpers.
> + *
> + * Note that it is allowed to break compatibility for these functions since the
> + * interface they are exposed through to BPF programs is explicitly unstable.
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf_ids.h>
> +#include <net/xdp.h>
> +#include <net/xfrm.h>
> +
> +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> + *
> + * Members:
> + * @error      - Out parameter, set for any errors encountered
> + *		 Values:
> + *		   -EINVAL - netns_id is less than -1
> + *		   -EINVAL - Passed NULL for opts
> + *		   -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> + *		   -ENONET - No network namespace found for netns_id
> + * @netns_id	- Specify the network namespace for lookup
> + *		 Values:
> + *		   BPF_F_CURRENT_NETNS (-1)
> + *		     Use namespace associated with ctx
> + *		   [0, S32_MAX]
> + *		     Network Namespace ID
> + * @mark	- XFRM mark to match on
> + * @daddr	- Destination address to match on
> + * @spi		- Security parameter index to match on
> + * @proto	- L3 protocol to match on
> + * @family	- L3 protocol family to match on
> + */
> +struct bpf_xfrm_state_opts {
> +	s32 error;
> +	s32 netns_id;
> +	u32 mark;
> +	xfrm_address_t daddr;
> +	__be32 spi;
> +	u8 proto;
> +	u16 family;
> +};
> +
> +enum {
> +	BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> +};
> +
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in xfrm_state BTF");
> +
> +/* bpf_xdp_get_xfrm_state - Get XFRM state
> + *
> + * Parameters:
> + * @ctx 	- Pointer to ctx (xdp_md) in XDP program
> + *		    Cannot be NULL
> + * @opts	- Options for lookup (documented above)
> + *		    Cannot be NULL
> + * @opts__sz	- Length of the bpf_xfrm_state_opts structure
> + *		    Must be BPF_XFRM_STATE_OPTS_SZ
> + */
> +__bpf_kfunc struct xfrm_state *
> +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> +{
> +	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> +	struct net *net = dev_net(xdp->rxq->dev);
> +	struct xfrm_state *x;
> +
> +	if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> +		opts->error = -EINVAL;

If opts is NULL, obvious we have issue opts->error access.
If opts is not NULL and opts_sz < 4, we also have issue with
opts->error access since it may override some other stuff
on the stack.

In such cases, we do not need to do 'opts->error = -EINVAL'
and can simply 'return NULL'. bpf program won't be able
to check opts->error anyway since the opts is either NULL
or opts_sz < 4.

> +		return NULL;
> +	}
> +
> +	if (unlikely(opts->netns_id < BPF_F_CURRENT_NETNS)) {
> +		opts->error = -EINVAL;
> +		return NULL;
> +	}
> +
> +	if (opts->netns_id >= 0) {
> +		net = get_net_ns_by_id(net, opts->netns_id);
> +		if (unlikely(!net)) {
> +			opts->error = -ENONET;
> +			return NULL;
> +		}
> +	}
> +
> +	x = xfrm_state_lookup(net, opts->mark, &opts->daddr, opts->spi,
> +			      opts->proto, opts->family);
> +
> +	if (opts->netns_id >= 0)
> +		put_net(net);
> +
> +	return x;
> +}
> +
> +__diag_pop()

[...]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations
  2023-11-22 18:20 ` [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
@ 2023-11-26  0:34   ` Yonghong Song
  2023-11-26  4:34     ` Daniel Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-26  0:34 UTC (permalink / raw)
  To: Daniel Xu, shuah, daniel, andrii, ast, steffen.klassert,
	antony.antony, alexei.starovoitov
  Cc: mykolal, martin.lau, song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kselftest, linux-kernel, devel, netdev


On 11/22/23 1:20 PM, Daniel Xu wrote:
> vmlinux.h declarations are more ergnomic, especially when working with
> kfuncs. The uapi headers are often incomplete for kfunc definitions.
>
> Co-developed-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>   .../selftests/bpf/progs/bpf_tracing_net.h     |  1 +
>   .../selftests/bpf/progs/test_tunnel_kern.c    | 48 ++++---------------
>   2 files changed, 9 insertions(+), 40 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> index 0b793a102791..1bdc680b0e0e 100644
> --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> @@ -26,6 +26,7 @@
>   #define IPV6_AUTOFLOWLABEL	70
>   
>   #define TC_ACT_UNSPEC		(-1)
> +#define TC_ACT_OK		0
>   #define TC_ACT_SHOT		2
>   
>   #define SOL_TCP			6
> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> index f66af753bbbb..3065a716544d 100644
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -6,62 +6,30 @@
>    * modify it under the terms of version 2 of the GNU General Public
>    * License as published by the Free Software Foundation.
>    */
> -#include <stddef.h>
> -#include <string.h>
> -#include <arpa/inet.h>
> -#include <linux/bpf.h>
> -#include <linux/if_ether.h>
> -#include <linux/if_packet.h>
> -#include <linux/if_tunnel.h>
> -#include <linux/ip.h>
> -#include <linux/ipv6.h>
> -#include <linux/icmp.h>
> -#include <linux/types.h>
> -#include <linux/socket.h>
> -#include <linux/pkt_cls.h>
> -#include <linux/erspan.h>
> -#include <linux/udp.h>
> +#include "vmlinux.h"
>   #include <bpf/bpf_helpers.h>
>   #include <bpf/bpf_endian.h>
> +#include "bpf_kfuncs.h"
> +#include "bpf_tracing_net.h"
>   
>   #define log_err(__ret) bpf_printk("ERROR line:%d ret:%d\n", __LINE__, __ret)
>   
> -#define VXLAN_UDP_PORT 4789
> +#define VXLAN_UDP_PORT		4789
> +#define ETH_P_IP		0x0800
> +#define PACKET_HOST		0
> +#define TUNNEL_CSUM		bpf_htons(0x01)
> +#define TUNNEL_KEY		bpf_htons(0x04)
>   
>   /* Only IPv4 address assigned to veth1.
>    * 172.16.1.200
>    */
>   #define ASSIGNED_ADDR_VETH1 0xac1001c8
>   
> -struct geneve_opt {
> -	__be16	opt_class;
> -	__u8	type;
> -	__u8	length:5;
> -	__u8	r3:1;
> -	__u8	r2:1;
> -	__u8	r1:1;
> -	__u8	opt_data[8]; /* hard-coded to 8 byte */
> -};
> -
>   struct vxlanhdr {
>   	__be32 vx_flags;
>   	__be32 vx_vni;
>   } __attribute__((packed));

In my particular setup, I have struct vxlanhdr defined in vmlinux.h so
I hit a compilation failure.

>   
> -struct vxlan_metadata {
> -	__u32     gbp;
> -};
> -
> -struct bpf_fou_encap {
> -	__be16 sport;
> -	__be16 dport;
> -};
> -
> -enum bpf_fou_encap_type {
> -	FOU_BPF_ENCAP_FOU,
> -	FOU_BPF_ENCAP_GUE,
> -};
> -
>   int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
>   			  struct bpf_fou_encap *encap, int type) __ksym;
>   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-22 18:20 ` [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
@ 2023-11-26  0:51   ` Yonghong Song
  2023-11-26  0:54     ` Alexei Starovoitov
  0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-26  0:51 UTC (permalink / raw)
  To: Daniel Xu, shuah, daniel, andrii, ast, steffen.klassert,
	antony.antony, alexei.starovoitov, Eddy Z
  Cc: mykolal, martin.lau, song, john.fastabend, kpsingh, sdf, haoluo,
	jolsa, bpf, linux-kselftest, linux-kernel, devel, netdev


On 11/22/23 1:20 PM, Daniel Xu wrote:
> Switching to vmlinux.h definitions seems to make the verifier very
> unhappy with bitfield accesses. The error is:
>
>      ; md.u.md2.dir = direction;
>      33: (69) r1 = *(u16 *)(r2 +11)
>      misaligned stack access off (0x0; 0x0)+-64+11 size 2
>
> It looks like disabling CO-RE relocations seem to make the error go
> away.

Thanks for reporting. I did some preliminary investigation and the
failure is due to that we do not support CORE-based bitfield store
yet. Besides disabling CORE-relocation as in this patch, there
are a few ways to do this:
   - Change the code to avoid bitfield store and use 1/2/4/8 byte(s)
     store. A little bit ugly but it should work.
   - Use to-be-supported 'preserve_static_offset'
     (https://reviews.llvm.org/D133361)
     to preserve the offset. This might work (I didn't
     try it yet).
   - Eduard did some early study trying to remove CORE attribute
     (preserve_access_index) from UAPI structures. In this particular
     case, erspan_metadata is in /usr/include/linux/erspan.h.

We will also investigate whether we could store bitfield store
directly with CORE.

>
> Co-developed-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> ---
>   tools/testing/selftests/bpf/progs/test_tunnel_kern.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> index 3065a716544d..ec7e04e012ae 100644
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -6,6 +6,7 @@
>    * modify it under the terms of version 2 of the GNU General Public
>    * License as published by the Free Software Foundation.
>    */
> +#define BPF_NO_PRESERVE_ACCESS_INDEX

This is a temporary workaround and hopefully we can lift it in the
near future. Please add a comment here with prefix 'Workaround' to
explain why this is needed and later on we can earliy search the
keyword and remember to tackle this.

>   #include "vmlinux.h"
>   #include <bpf/bpf_helpers.h>
>   #include <bpf/bpf_endian.h>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-26  0:51   ` Yonghong Song
@ 2023-11-26  0:54     ` Alexei Starovoitov
  2023-11-26  4:22       ` Yonghong Song
  0 siblings, 1 reply; 33+ messages in thread
From: Alexei Starovoitov @ 2023-11-26  0:54 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Daniel Xu, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony, Eddy Z,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Sat, Nov 25, 2023 at 4:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
> >
> > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > index 3065a716544d..ec7e04e012ae 100644
> > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > @@ -6,6 +6,7 @@
> >    * modify it under the terms of version 2 of the GNU General Public
> >    * License as published by the Free Software Foundation.
> >    */
> > +#define BPF_NO_PRESERVE_ACCESS_INDEX
>
> This is a temporary workaround and hopefully we can lift it in the
> near future. Please add a comment here with prefix 'Workaround' to
> explain why this is needed and later on we can earliy search the
> keyword and remember to tackle this.

I suspect we will forget to remove this "workaround" and people
will start copy pasting it.
Let's change the test instead to avoid bitfield access.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-26  0:54     ` Alexei Starovoitov
@ 2023-11-26  4:22       ` Yonghong Song
  2023-11-26 20:14         ` Eduard Zingerman
  0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-26  4:22 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Xu, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony, Eddy Z,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/25/23 7:54 PM, Alexei Starovoitov wrote:
> On Sat, Nov 25, 2023 at 4:52 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>>> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>> index 3065a716544d..ec7e04e012ae 100644
>>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>> @@ -6,6 +6,7 @@
>>>     * modify it under the terms of version 2 of the GNU General Public
>>>     * License as published by the Free Software Foundation.
>>>     */
>>> +#define BPF_NO_PRESERVE_ACCESS_INDEX
>> This is a temporary workaround and hopefully we can lift it in the
>> near future. Please add a comment here with prefix 'Workaround' to
>> explain why this is needed and later on we can earliy search the
>> keyword and remember to tackle this.
> I suspect we will forget to remove this "workaround" and people
> will start copy pasting it.
> Let's change the test instead to avoid bitfield access.

Agree. Avoiding bitfield access is definitely a solution.
I just checked llvm preserve_static_offset (not merged yet),
it seems to be able to fix the issue as well.

Applying patch https://reviews.llvm.org/D133361 to latest llvm-project,
and with the following patch on top of patch 6,

=====

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index ec7e04e012ae..11cbb12b4029 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -6,7 +6,10 @@
   * modify it under the terms of version 2 of the GNU General Public
   * License as published by the Free Software Foundation.
   */
-#define BPF_NO_PRESERVE_ACCESS_INDEX
+#if __has_attribute(preserve_static_offset)
+struct __attribute__((preserve_static_offset)) erspan_md2;
+struct __attribute__((preserve_static_offset)) erspan_metadata;
+#endif
  #include "vmlinux.h"
  #include <bpf/bpf_helpers.h>
  #include <bpf/bpf_endian.h>
@@ -25,12 +28,12 @@
   * 172.16.1.200
   */
  #define ASSIGNED_ADDR_VETH1 0xac1001c8

  struct vxlanhdr {
         __be32 vx_flags;
         __be32 vx_vni;
  } __attribute__((packed));

  int bpf_skb_set_fou_encap(struct __sk_buff *skb_ctx,
                           struct bpf_fou_encap *encap, int type) __ksym;
  int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
@@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
         __u8 hwid = 7;
  
         md.version = 2;
+#if __has_attribute(preserve_static_offset)
         md.u.md2.dir = direction;
         md.u.md2.hwid = hwid & 0xf;
         md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
+#else
+       /* Change bit-field store to byte(s)-level stores. */
+#endif
  #endif
  
         ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));

====

Eduard, could you double check whether this is a valid use case
to solve this kind of issue with preserve_static_offset attribute?


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations
  2023-11-26  0:34   ` Yonghong Song
@ 2023-11-26  4:34     ` Daniel Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-26  4:34 UTC (permalink / raw)
  To: Yonghong Song
  Cc: shuah, daniel, andrii, ast, steffen.klassert, antony.antony,
	alexei.starovoitov, mykolal, martin.lau, song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, bpf, linux-kselftest, linux-kernel,
	devel, netdev

Hi Yonghong,

On Sat, Nov 25, 2023 at 04:34:36PM -0800, Yonghong Song wrote:
> 
> On 11/22/23 1:20 PM, Daniel Xu wrote:
> > vmlinux.h declarations are more ergnomic, especially when working with
> > kfuncs. The uapi headers are often incomplete for kfunc definitions.
> > 
> > Co-developed-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >   .../selftests/bpf/progs/bpf_tracing_net.h     |  1 +
> >   .../selftests/bpf/progs/test_tunnel_kern.c    | 48 ++++---------------
> >   2 files changed, 9 insertions(+), 40 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > index 0b793a102791..1bdc680b0e0e 100644
> > --- a/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > +++ b/tools/testing/selftests/bpf/progs/bpf_tracing_net.h
> > @@ -26,6 +26,7 @@
> >   #define IPV6_AUTOFLOWLABEL	70
> >   #define TC_ACT_UNSPEC		(-1)
> > +#define TC_ACT_OK		0
> >   #define TC_ACT_SHOT		2
> >   #define SOL_TCP			6
> > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > index f66af753bbbb..3065a716544d 100644
> > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > @@ -6,62 +6,30 @@
> >    * modify it under the terms of version 2 of the GNU General Public
> >    * License as published by the Free Software Foundation.
> >    */
> > -#include <stddef.h>
> > -#include <string.h>
> > -#include <arpa/inet.h>
> > -#include <linux/bpf.h>
> > -#include <linux/if_ether.h>
> > -#include <linux/if_packet.h>
> > -#include <linux/if_tunnel.h>
> > -#include <linux/ip.h>
> > -#include <linux/ipv6.h>
> > -#include <linux/icmp.h>
> > -#include <linux/types.h>
> > -#include <linux/socket.h>
> > -#include <linux/pkt_cls.h>
> > -#include <linux/erspan.h>
> > -#include <linux/udp.h>
> > +#include "vmlinux.h"
> >   #include <bpf/bpf_helpers.h>
> >   #include <bpf/bpf_endian.h>
> > +#include "bpf_kfuncs.h"
> > +#include "bpf_tracing_net.h"
> >   #define log_err(__ret) bpf_printk("ERROR line:%d ret:%d\n", __LINE__, __ret)
> > -#define VXLAN_UDP_PORT 4789
> > +#define VXLAN_UDP_PORT		4789
> > +#define ETH_P_IP		0x0800
> > +#define PACKET_HOST		0
> > +#define TUNNEL_CSUM		bpf_htons(0x01)
> > +#define TUNNEL_KEY		bpf_htons(0x04)
> >   /* Only IPv4 address assigned to veth1.
> >    * 172.16.1.200
> >    */
> >   #define ASSIGNED_ADDR_VETH1 0xac1001c8
> > -struct geneve_opt {
> > -	__be16	opt_class;
> > -	__u8	type;
> > -	__u8	length:5;
> > -	__u8	r3:1;
> > -	__u8	r2:1;
> > -	__u8	r1:1;
> > -	__u8	opt_data[8]; /* hard-coded to 8 byte */
> > -};
> > -
> >   struct vxlanhdr {
> >   	__be32 vx_flags;
> >   	__be32 vx_vni;
> >   } __attribute__((packed));
> 
> In my particular setup, I have struct vxlanhdr defined in vmlinux.h so
> I hit a compilation failure.

Yeah, saw the same error in CI (the emails are nice btw). Looks like
vxlanhdr isn't even being used in this selftest. I've deleted it for v2.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 1/7] bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
  2023-11-25 20:36   ` Yonghong Song
@ 2023-11-26  4:38     ` Daniel Xu
  0 siblings, 0 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-26  4:38 UTC (permalink / raw)
  To: Yonghong Song
  Cc: john.fastabend, Herbert Xu, davem, ast, daniel, pabeni, hawk,
	kuba, edumazet, steffen.klassert, antony.antony,
	alexei.starovoitov, linux-kernel, netdev, bpf, devel

On Sat, Nov 25, 2023 at 12:36:29PM -0800, Yonghong Song wrote:
> 
> On 11/22/23 1:20 PM, Daniel Xu wrote:
> > This commit adds an unstable kfunc helper to access internal xfrm_state
> > associated with an SA. This is intended to be used for the upcoming
> > IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
> > words: for custom software RSS.
> > 
> > That being said, the function that this kfunc wraps is fairly generic
> > and used for a lot of xfrm tasks. I'm sure people will find uses
> > elsewhere over time.
> > 
> > Co-developed-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Antony Antony <antony.antony@secunet.com>
> > Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
> > ---
> >   include/net/xfrm.h        |   9 ++++
> >   net/xfrm/Makefile         |   1 +
> >   net/xfrm/xfrm_policy.c    |   2 +
> >   net/xfrm/xfrm_state_bpf.c | 111 ++++++++++++++++++++++++++++++++++++++
> >   4 files changed, 123 insertions(+)
> >   create mode 100644 net/xfrm/xfrm_state_bpf.c
> > 
> > diff --git a/include/net/xfrm.h b/include/net/xfrm.h
> > index c9bb0f892f55..1d107241b901 100644
> > --- a/include/net/xfrm.h
> > +++ b/include/net/xfrm.h
> > @@ -2190,4 +2190,13 @@ static inline int register_xfrm_interface_bpf(void)
> >   #endif
> > +#if IS_ENABLED(CONFIG_DEBUG_INFO_BTF)
> > +int register_xfrm_state_bpf(void);
> > +#else
> > +static inline int register_xfrm_state_bpf(void)
> > +{
> > +	return 0;
> > +}
> > +#endif
> > +
> >   #endif	/* _NET_XFRM_H */
> > diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
> > index cd47f88921f5..547cec77ba03 100644
> > --- a/net/xfrm/Makefile
> > +++ b/net/xfrm/Makefile
> > @@ -21,3 +21,4 @@ obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
> >   obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
> >   obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
> >   obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
> > +obj-$(CONFIG_DEBUG_INFO_BTF) += xfrm_state_bpf.o
> > diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
> > index c13dc3ef7910..1b7e75159727 100644
> > --- a/net/xfrm/xfrm_policy.c
> > +++ b/net/xfrm/xfrm_policy.c
> > @@ -4218,6 +4218,8 @@ void __init xfrm_init(void)
> >   #ifdef CONFIG_XFRM_ESPINTCP
> >   	espintcp_init();
> >   #endif
> > +
> > +	register_xfrm_state_bpf();
> >   }
> >   #ifdef CONFIG_AUDITSYSCALL
> > diff --git a/net/xfrm/xfrm_state_bpf.c b/net/xfrm/xfrm_state_bpf.c
> > new file mode 100644
> > index 000000000000..0c1f2f91125c
> > --- /dev/null
> > +++ b/net/xfrm/xfrm_state_bpf.c
> > @@ -0,0 +1,111 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Unstable XFRM state BPF helpers.
> > + *
> > + * Note that it is allowed to break compatibility for these functions since the
> > + * interface they are exposed through to BPF programs is explicitly unstable.
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf_ids.h>
> > +#include <net/xdp.h>
> > +#include <net/xfrm.h>
> > +
> > +/* bpf_xfrm_state_opts - Options for XFRM state lookup helpers
> > + *
> > + * Members:
> > + * @error      - Out parameter, set for any errors encountered
> > + *		 Values:
> > + *		   -EINVAL - netns_id is less than -1
> > + *		   -EINVAL - Passed NULL for opts
> > + *		   -EINVAL - opts__sz isn't BPF_XFRM_STATE_OPTS_SZ
> > + *		   -ENONET - No network namespace found for netns_id
> > + * @netns_id	- Specify the network namespace for lookup
> > + *		 Values:
> > + *		   BPF_F_CURRENT_NETNS (-1)
> > + *		     Use namespace associated with ctx
> > + *		   [0, S32_MAX]
> > + *		     Network Namespace ID
> > + * @mark	- XFRM mark to match on
> > + * @daddr	- Destination address to match on
> > + * @spi		- Security parameter index to match on
> > + * @proto	- L3 protocol to match on
> > + * @family	- L3 protocol family to match on
> > + */
> > +struct bpf_xfrm_state_opts {
> > +	s32 error;
> > +	s32 netns_id;
> > +	u32 mark;
> > +	xfrm_address_t daddr;
> > +	__be32 spi;
> > +	u8 proto;
> > +	u16 family;
> > +};
> > +
> > +enum {
> > +	BPF_XFRM_STATE_OPTS_SZ = sizeof(struct bpf_xfrm_state_opts),
> > +};
> > +
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +		  "Global functions as their definitions will be in xfrm_state BTF");
> > +
> > +/* bpf_xdp_get_xfrm_state - Get XFRM state
> > + *
> > + * Parameters:
> > + * @ctx 	- Pointer to ctx (xdp_md) in XDP program
> > + *		    Cannot be NULL
> > + * @opts	- Options for lookup (documented above)
> > + *		    Cannot be NULL
> > + * @opts__sz	- Length of the bpf_xfrm_state_opts structure
> > + *		    Must be BPF_XFRM_STATE_OPTS_SZ
> > + */
> > +__bpf_kfunc struct xfrm_state *
> > +bpf_xdp_get_xfrm_state(struct xdp_md *ctx, struct bpf_xfrm_state_opts *opts, u32 opts__sz)
> > +{
> > +	struct xdp_buff *xdp = (struct xdp_buff *)ctx;
> > +	struct net *net = dev_net(xdp->rxq->dev);
> > +	struct xfrm_state *x;
> > +
> > +	if (!opts || opts__sz != BPF_XFRM_STATE_OPTS_SZ) {
> > +		opts->error = -EINVAL;
> 
> If opts is NULL, obvious we have issue opts->error access.
> If opts is not NULL and opts_sz < 4, we also have issue with
> opts->error access since it may override some other stuff
> on the stack.
> 
> In such cases, we do not need to do 'opts->error = -EINVAL'
> and can simply 'return NULL'. bpf program won't be able
> to check opts->error anyway since the opts is either NULL
> or opts_sz < 4.

Ack, will fix.

[...]


Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-26  4:22       ` Yonghong Song
@ 2023-11-26 20:14         ` Eduard Zingerman
  2023-11-27  0:04           ` Daniel Xu
  2023-11-27  5:20           ` Yonghong Song
  0 siblings, 2 replies; 33+ messages in thread
From: Eduard Zingerman @ 2023-11-26 20:14 UTC (permalink / raw)
  To: Yonghong Song, Alexei Starovoitov
  Cc: Daniel Xu, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
[...]
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -6,7 +6,10 @@
>    * modify it under the terms of version 2 of the GNU General Public
>    * License as published by the Free Software Foundation.
>    */
> -#define BPF_NO_PRESERVE_ACCESS_INDEX
> +#if __has_attribute(preserve_static_offset)
> +struct __attribute__((preserve_static_offset)) erspan_md2;
> +struct __attribute__((preserve_static_offset)) erspan_metadata;
> +#endif
>   #include "vmlinux.h"
[...]
>   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
> @@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
>          __u8 hwid = 7;
>   
>          md.version = 2;
> +#if __has_attribute(preserve_static_offset)
>          md.u.md2.dir = direction;
>          md.u.md2.hwid = hwid & 0xf;
>          md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> +#else
> +       /* Change bit-field store to byte(s)-level stores. */
> +#endif
>   #endif
>   
>          ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> 
> ====
> 
> Eduard, could you double check whether this is a valid use case
> to solve this kind of issue with preserve_static_offset attribute?

Tbh I'm not sure. This test passes with preserve_static_offset
because it suppresses preserve_access_index. In general clang
translates bitfield access to a set of IR statements like:

  C:
    struct foo {
      unsigned _;
      unsigned a:1;
      ...
    };
    ... foo->a ...

  IR:
    %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
    %bf.load = load i8, ptr %a, align 4
    %bf.clear = and i8 %bf.load, 1
    %bf.cast = zext i8 %bf.clear to i32

With preserve_static_offset the getelementptr+load are replaced by a
single statement which is preserved as-is till code generation,
thus load with align 4 is preserved.

On the other hand, I'm not sure that clang guarantees that load or
stores used for bitfield access would be always aligned according to
verifier expectations.

I think we should check if there are some clang knobs that prevent
generation of unaligned memory access. I'll take a look.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-26 20:14         ` Eduard Zingerman
@ 2023-11-27  0:04           ` Daniel Xu
  2023-11-27  1:52             ` Eduard Zingerman
  2023-11-27  5:20           ` Yonghong Song
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-27  0:04 UTC (permalink / raw)
  To: Eduard Zingerman
  Cc: Yonghong Song, Alexei Starovoitov, Shuah Khan, Daniel Borkmann,
	Andrii Nakryiko, Alexei Starovoitov, Steffen Klassert,
	antony.antony, Mykola Lysenko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

Hi,

On Sun, Nov 26, 2023 at 10:14:21PM +0200, Eduard Zingerman wrote:
> On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
> [...]
> > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > @@ -6,7 +6,10 @@
> >    * modify it under the terms of version 2 of the GNU General Public
> >    * License as published by the Free Software Foundation.
> >    */
> > -#define BPF_NO_PRESERVE_ACCESS_INDEX
> > +#if __has_attribute(preserve_static_offset)
> > +struct __attribute__((preserve_static_offset)) erspan_md2;
> > +struct __attribute__((preserve_static_offset)) erspan_metadata;
> > +#endif
> >   #include "vmlinux.h"
> [...]
> >   int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
> > @@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
> >          __u8 hwid = 7;
> >   
> >          md.version = 2;
> > +#if __has_attribute(preserve_static_offset)
> >          md.u.md2.dir = direction;
> >          md.u.md2.hwid = hwid & 0xf;
> >          md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> > +#else
> > +       /* Change bit-field store to byte(s)-level stores. */
> > +#endif
> >   #endif
> >   
> >          ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> > 
> > ====
> > 
> > Eduard, could you double check whether this is a valid use case
> > to solve this kind of issue with preserve_static_offset attribute?
> 
> Tbh I'm not sure. This test passes with preserve_static_offset
> because it suppresses preserve_access_index. In general clang
> translates bitfield access to a set of IR statements like:
> 
>   C:
>     struct foo {
>       unsigned _;
>       unsigned a:1;
>       ...
>     };
>     ... foo->a ...
> 
>   IR:
>     %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>     %bf.load = load i8, ptr %a, align 4
>     %bf.clear = and i8 %bf.load, 1
>     %bf.cast = zext i8 %bf.clear to i32
> 
> With preserve_static_offset the getelementptr+load are replaced by a
> single statement which is preserved as-is till code generation,
> thus load with align 4 is preserved.
> 
> On the other hand, I'm not sure that clang guarantees that load or
> stores used for bitfield access would be always aligned according to
> verifier expectations.
> 
> I think we should check if there are some clang knobs that prevent
> generation of unaligned memory access. I'll take a look.

Is there a reason to prefer fixing in compiler? I'm not opposed to it,
but the downside to compiler fix is it takes years to propagate and
sprinkles ifdefs into the code.

Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27  0:04           ` Daniel Xu
@ 2023-11-27  1:52             ` Eduard Zingerman
  2023-11-27  5:44               ` Yonghong Song
  0 siblings, 1 reply; 33+ messages in thread
From: Eduard Zingerman @ 2023-11-27  1:52 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Yonghong Song, Alexei Starovoitov, Shuah Khan, Daniel Borkmann,
	Andrii Nakryiko, Alexei Starovoitov, Steffen Klassert,
	antony.antony, Mykola Lysenko, Martin KaFai Lau, Song Liu,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
[...]
> > Tbh I'm not sure. This test passes with preserve_static_offset
> > because it suppresses preserve_access_index. In general clang
> > translates bitfield access to a set of IR statements like:
> > 
> >   C:
> >     struct foo {
> >       unsigned _;
> >       unsigned a:1;
> >       ...
> >     };
> >     ... foo->a ...
> > 
> >   IR:
> >     %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> >     %bf.load = load i8, ptr %a, align 4
> >     %bf.clear = and i8 %bf.load, 1
> >     %bf.cast = zext i8 %bf.clear to i32
> > 
> > With preserve_static_offset the getelementptr+load are replaced by a
> > single statement which is preserved as-is till code generation,
> > thus load with align 4 is preserved.
> > 
> > On the other hand, I'm not sure that clang guarantees that load or
> > stores used for bitfield access would be always aligned according to
> > verifier expectations.
> > 
> > I think we should check if there are some clang knobs that prevent
> > generation of unaligned memory access. I'll take a look.
> 
> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> but the downside to compiler fix is it takes years to propagate and
> sprinkles ifdefs into the code.
>
> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?

Well, the contraption below passes verification, tunnel selftest
appears to work. I might have messed up some shifts in the macro, though.

Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
field access might be unaligned.

---

diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
index 3065a716544d..41cd913ac7ff 100644
--- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
@@ -9,6 +9,7 @@
 #include "vmlinux.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_core_read.h>
 #include "bpf_kfuncs.h"
 #include "bpf_tracing_net.h"
 
@@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
 	return TC_ACT_OK;
 }
 
+#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({			\
+	void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);	\
+	unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);		\
+	unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64);		\
+	unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64);		\
+	unsigned bit_size = (rshift - lshift);				\
+	unsigned long long nval, val, hi, lo;				\
+									\
+	asm volatile("" : "=r"(p) : "0"(p));				\
+									\
+	switch (byte_size) {						\
+	case 1: val = *(unsigned char *)p; break;			\
+	case 2: val = *(unsigned short *)p; break;			\
+	case 4: val = *(unsigned int *)p; break;			\
+	case 8: val = *(unsigned long long *)p; break;			\
+	}								\
+	hi = val >> (bit_size + rshift);				\
+	hi <<= bit_size + rshift;					\
+	lo = val << (bit_size + lshift);				\
+	lo >>= bit_size + lshift;					\
+	nval = new_val;							\
+	nval <<= lshift;						\
+	nval >>= rshift;						\
+	val = hi | nval | lo;						\
+	switch (byte_size) {						\
+	case 1: *(unsigned char *)p      = val; break;			\
+	case 2: *(unsigned short *)p     = val; break;			\
+	case 4: *(unsigned int *)p       = val; break;			\
+	case 8: *(unsigned long long *)p = val; break;			\
+	}								\
+})
+
 SEC("tc")
 int erspan_set_tunnel(struct __sk_buff *skb)
 {
@@ -173,9 +206,9 @@ int erspan_set_tunnel(struct __sk_buff *skb)
 	__u8 hwid = 7;
 
 	md.version = 2;
-	md.u.md2.dir = direction;
-	md.u.md2.hwid = hwid & 0xf;
-	md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, dir, direction);
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid, (hwid & 0xf));
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid_upper, (hwid >> 4) & 0x3);
 #endif
 
 	ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
@@ -214,8 +247,9 @@ int erspan_get_tunnel(struct __sk_buff *skb)
 	bpf_printk("\tindex %x\n", index);
 #else
 	bpf_printk("\tdirection %d hwid %x timestamp %u\n",
-		   md.u.md2.dir,
-		   (md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+		   BPF_CORE_READ_BITFIELD(&md.u.md2, dir),
+		   (BPF_CORE_READ_BITFIELD(&md.u.md2, hwid_upper) << 4) +
+		   BPF_CORE_READ_BITFIELD(&md.u.md2, hwid),
 		   bpf_ntohl(md.u.md2.timestamp));
 #endif
 
@@ -252,9 +286,9 @@ int ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
 	__u8 hwid = 17;
 
 	md.version = 2;
-	md.u.md2.dir = direction;
-	md.u.md2.hwid = hwid & 0xf;
-	md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, dir, direction);
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid, (hwid & 0xf));
+	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid_upper, (hwid >> 4) & 0x3);
 #endif
 
 	ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
@@ -294,8 +328,9 @@ int ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
 	bpf_printk("\tindex %x\n", index);
 #else
 	bpf_printk("\tdirection %d hwid %x timestamp %u\n",
-		   md.u.md2.dir,
-		   (md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
+		   BPF_CORE_READ_BITFIELD(&md.u.md2, dir),
+		   (BPF_CORE_READ_BITFIELD(&md.u.md2, hwid_upper) << 4) +
+		   BPF_CORE_READ_BITFIELD(&md.u.md2, hwid),
 		   bpf_ntohl(md.u.md2.timestamp));
 #endif
 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-26 20:14         ` Eduard Zingerman
  2023-11-27  0:04           ` Daniel Xu
@ 2023-11-27  5:20           ` Yonghong Song
  1 sibling, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2023-11-27  5:20 UTC (permalink / raw)
  To: Eduard Zingerman, Alexei Starovoitov
  Cc: Daniel Xu, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/26/23 3:14 PM, Eduard Zingerman wrote:
> On Sat, 2023-11-25 at 20:22 -0800, Yonghong Song wrote:
> [...]
>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>> @@ -6,7 +6,10 @@
>>     * modify it under the terms of version 2 of the GNU General Public
>>     * License as published by the Free Software Foundation.
>>     */
>> -#define BPF_NO_PRESERVE_ACCESS_INDEX
>> +#if __has_attribute(preserve_static_offset)
>> +struct __attribute__((preserve_static_offset)) erspan_md2;
>> +struct __attribute__((preserve_static_offset)) erspan_metadata;
>> +#endif
>>    #include "vmlinux.h"
> [...]
>>    int bpf_skb_get_fou_encap(struct __sk_buff *skb_ctx,
>> @@ -174,9 +177,13 @@ int erspan_set_tunnel(struct __sk_buff *skb)
>>           __u8 hwid = 7;
>>    
>>           md.version = 2;
>> +#if __has_attribute(preserve_static_offset)
>>           md.u.md2.dir = direction;
>>           md.u.md2.hwid = hwid & 0xf;
>>           md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
>> +#else
>> +       /* Change bit-field store to byte(s)-level stores. */
>> +#endif
>>    #endif
>>    
>>           ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
>>
>> ====
>>
>> Eduard, could you double check whether this is a valid use case
>> to solve this kind of issue with preserve_static_offset attribute?
> Tbh I'm not sure. This test passes with preserve_static_offset
> because it suppresses preserve_access_index. In general clang
> translates bitfield access to a set of IR statements like:
>
>    C:
>      struct foo {
>        unsigned _;
>        unsigned a:1;
>        ...
>      };
>      ... foo->a ...
>
>    IR:
>      %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>      %bf.load = load i8, ptr %a, align 4
>      %bf.clear = and i8 %bf.load, 1
>      %bf.cast = zext i8 %bf.clear to i32
>
> With preserve_static_offset the getelementptr+load are replaced by a
> single statement which is preserved as-is till code generation,
> thus load with align 4 is preserved.
>
> On the other hand, I'm not sure that clang guarantees that load or
> stores used for bitfield access would be always aligned according to
> verifier expectations.

I think it should be true. The frontend does alignment analysis based on
types and (packed vs. unpacked) and assign each load/store with proper
alignment (like 'align 4' in the above). 'align 4' truely means
the load itself is 4-byte aligned. Otherwise, it will be very confusing
for arch's which do not support unaligned memory access (e.g. BPF).

>
> I think we should check if there are some clang knobs that prevent
> generation of unaligned memory access. I'll take a look.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27  1:52             ` Eduard Zingerman
@ 2023-11-27  5:44               ` Yonghong Song
  2023-11-27  5:53                 ` Yonghong Song
  0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-27  5:44 UTC (permalink / raw)
  To: Eduard Zingerman, Daniel Xu
  Cc: Alexei Starovoitov, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> [...]
>>> Tbh I'm not sure. This test passes with preserve_static_offset
>>> because it suppresses preserve_access_index. In general clang
>>> translates bitfield access to a set of IR statements like:
>>>
>>>    C:
>>>      struct foo {
>>>        unsigned _;
>>>        unsigned a:1;
>>>        ...
>>>      };
>>>      ... foo->a ...
>>>
>>>    IR:
>>>      %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>>>      %bf.load = load i8, ptr %a, align 4
>>>      %bf.clear = and i8 %bf.load, 1
>>>      %bf.cast = zext i8 %bf.clear to i32
>>>
>>> With preserve_static_offset the getelementptr+load are replaced by a
>>> single statement which is preserved as-is till code generation,
>>> thus load with align 4 is preserved.
>>>
>>> On the other hand, I'm not sure that clang guarantees that load or
>>> stores used for bitfield access would be always aligned according to
>>> verifier expectations.
>>>
>>> I think we should check if there are some clang knobs that prevent
>>> generation of unaligned memory access. I'll take a look.
>> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
>> but the downside to compiler fix is it takes years to propagate and
>> sprinkles ifdefs into the code.
>>
>> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> Well, the contraption below passes verification, tunnel selftest
> appears to work. I might have messed up some shifts in the macro, though.

I didn't test it. But from high level it should work.

>
> Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> field access might be unaligned.

clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.

>
> ---
>
> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> index 3065a716544d..41cd913ac7ff 100644
> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> @@ -9,6 +9,7 @@
>   #include "vmlinux.h"
>   #include <bpf/bpf_helpers.h>
>   #include <bpf/bpf_endian.h>
> +#include <bpf/bpf_core_read.h>
>   #include "bpf_kfuncs.h"
>   #include "bpf_tracing_net.h"
>   
> @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
>   	return TC_ACT_OK;
>   }
>   
> +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({			\
> +	void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);	\
> +	unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);		\
> +	unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64);		\
> +	unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64);		\
> +	unsigned bit_size = (rshift - lshift);				\
> +	unsigned long long nval, val, hi, lo;				\
> +									\
> +	asm volatile("" : "=r"(p) : "0"(p));				\

Use asm volatile("" : "+r"(p)) ?

> +									\
> +	switch (byte_size) {						\
> +	case 1: val = *(unsigned char *)p; break;			\
> +	case 2: val = *(unsigned short *)p; break;			\
> +	case 4: val = *(unsigned int *)p; break;			\
> +	case 8: val = *(unsigned long long *)p; break;			\
> +	}								\
> +	hi = val >> (bit_size + rshift);				\
> +	hi <<= bit_size + rshift;					\
> +	lo = val << (bit_size + lshift);				\
> +	lo >>= bit_size + lshift;					\
> +	nval = new_val;							\
> +	nval <<= lshift;						\
> +	nval >>= rshift;						\
> +	val = hi | nval | lo;						\
> +	switch (byte_size) {						\
> +	case 1: *(unsigned char *)p      = val; break;			\
> +	case 2: *(unsigned short *)p     = val; break;			\
> +	case 4: *(unsigned int *)p       = val; break;			\
> +	case 8: *(unsigned long long *)p = val; break;			\
> +	}								\
> +})

I think this should be put in libbpf public header files but not sure
where to put it. bpf_core_read.h although it is core write?

But on the other hand, this is a uapi struct bitfield write,
strictly speaking, CORE write is really unnecessary here. It
would be great if we can relieve users from dealing with
such unnecessary CORE writes. In that sense, for this particular
case, I would prefer rewriting the code by using byte-level
stores...

> +
>   SEC("tc")
>   int erspan_set_tunnel(struct __sk_buff *skb)
>   {
> @@ -173,9 +206,9 @@ int erspan_set_tunnel(struct __sk_buff *skb)
>   	__u8 hwid = 7;
>   
>   	md.version = 2;
> -	md.u.md2.dir = direction;
> -	md.u.md2.hwid = hwid & 0xf;
> -	md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, dir, direction);
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid, (hwid & 0xf));
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid_upper, (hwid >> 4) & 0x3);
>   #endif
>   
>   	ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> @@ -214,8 +247,9 @@ int erspan_get_tunnel(struct __sk_buff *skb)
>   	bpf_printk("\tindex %x\n", index);
>   #else
>   	bpf_printk("\tdirection %d hwid %x timestamp %u\n",
> -		   md.u.md2.dir,
> -		   (md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
> +		   BPF_CORE_READ_BITFIELD(&md.u.md2, dir),
> +		   (BPF_CORE_READ_BITFIELD(&md.u.md2, hwid_upper) << 4) +
> +		   BPF_CORE_READ_BITFIELD(&md.u.md2, hwid),
>   		   bpf_ntohl(md.u.md2.timestamp));
>   #endif
>   
> @@ -252,9 +286,9 @@ int ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
>   	__u8 hwid = 17;
>   
>   	md.version = 2;
> -	md.u.md2.dir = direction;
> -	md.u.md2.hwid = hwid & 0xf;
> -	md.u.md2.hwid_upper = (hwid >> 4) & 0x3;
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, dir, direction);
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid, (hwid & 0xf));
> +	BPF_CORE_WRITE_BITFIELD(&md.u.md2, hwid_upper, (hwid >> 4) & 0x3);
>   #endif
>   
>   	ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
> @@ -294,8 +328,9 @@ int ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
>   	bpf_printk("\tindex %x\n", index);
>   #else
>   	bpf_printk("\tdirection %d hwid %x timestamp %u\n",
> -		   md.u.md2.dir,
> -		   (md.u.md2.hwid_upper << 4) + md.u.md2.hwid,
> +		   BPF_CORE_READ_BITFIELD(&md.u.md2, dir),
> +		   (BPF_CORE_READ_BITFIELD(&md.u.md2, hwid_upper) << 4) +
> +		   BPF_CORE_READ_BITFIELD(&md.u.md2, hwid),
>   		   bpf_ntohl(md.u.md2.timestamp));
>   #endif
>   

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27  5:44               ` Yonghong Song
@ 2023-11-27  5:53                 ` Yonghong Song
  2023-11-27 20:45                   ` Daniel Xu
  0 siblings, 1 reply; 33+ messages in thread
From: Yonghong Song @ 2023-11-27  5:53 UTC (permalink / raw)
  To: Eduard Zingerman, Daniel Xu
  Cc: Alexei Starovoitov, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/27/23 12:44 AM, Yonghong Song wrote:
>
> On 11/26/23 8:52 PM, Eduard Zingerman wrote:
>> On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
>> [...]
>>>> Tbh I'm not sure. This test passes with preserve_static_offset
>>>> because it suppresses preserve_access_index. In general clang
>>>> translates bitfield access to a set of IR statements like:
>>>>
>>>>    C:
>>>>      struct foo {
>>>>        unsigned _;
>>>>        unsigned a:1;
>>>>        ...
>>>>      };
>>>>      ... foo->a ...
>>>>
>>>>    IR:
>>>>      %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>>>>      %bf.load = load i8, ptr %a, align 4
>>>>      %bf.clear = and i8 %bf.load, 1
>>>>      %bf.cast = zext i8 %bf.clear to i32
>>>>
>>>> With preserve_static_offset the getelementptr+load are replaced by a
>>>> single statement which is preserved as-is till code generation,
>>>> thus load with align 4 is preserved.
>>>>
>>>> On the other hand, I'm not sure that clang guarantees that load or
>>>> stores used for bitfield access would be always aligned according to
>>>> verifier expectations.
>>>>
>>>> I think we should check if there are some clang knobs that prevent
>>>> generation of unaligned memory access. I'll take a look.
>>> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
>>> but the downside to compiler fix is it takes years to propagate and
>>> sprinkles ifdefs into the code.
>>>
>>> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
>> Well, the contraption below passes verification, tunnel selftest
>> appears to work. I might have messed up some shifts in the macro, 
>> though.
>
> I didn't test it. But from high level it should work.
>
>>
>> Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
>> field access might be unaligned.
>
> clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
>
>>
>> ---
>>
>> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c 
>> b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>> index 3065a716544d..41cd913ac7ff 100644
>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>> @@ -9,6 +9,7 @@
>>   #include "vmlinux.h"
>>   #include <bpf/bpf_helpers.h>
>>   #include <bpf/bpf_endian.h>
>> +#include <bpf/bpf_core_read.h>
>>   #include "bpf_kfuncs.h"
>>   #include "bpf_tracing_net.h"
>>   @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
>>       return TC_ACT_OK;
>>   }
>>   +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
>> +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
>> +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
>> +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
>> +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
>> +    unsigned bit_size = (rshift - lshift);                \
>> +    unsigned long long nval, val, hi, lo;                \
>> +                                    \
>> +    asm volatile("" : "=r"(p) : "0"(p));                \
>
> Use asm volatile("" : "+r"(p)) ?
>
>> +                                    \
>> +    switch (byte_size) {                        \
>> +    case 1: val = *(unsigned char *)p; break;            \
>> +    case 2: val = *(unsigned short *)p; break;            \
>> +    case 4: val = *(unsigned int *)p; break;            \
>> +    case 8: val = *(unsigned long long *)p; break;            \
>> +    }                                \
>> +    hi = val >> (bit_size + rshift);                \
>> +    hi <<= bit_size + rshift;                    \
>> +    lo = val << (bit_size + lshift);                \
>> +    lo >>= bit_size + lshift;                    \
>> +    nval = new_val;                            \
>> +    nval <<= lshift;                        \
>> +    nval >>= rshift;                        \
>> +    val = hi | nval | lo;                        \
>> +    switch (byte_size) {                        \
>> +    case 1: *(unsigned char *)p      = val; break;            \
>> +    case 2: *(unsigned short *)p     = val; break;            \
>> +    case 4: *(unsigned int *)p       = val; break;            \
>> +    case 8: *(unsigned long long *)p = val; break;            \
>> +    }                                \
>> +})
>
> I think this should be put in libbpf public header files but not sure
> where to put it. bpf_core_read.h although it is core write?
>
> But on the other hand, this is a uapi struct bitfield write,
> strictly speaking, CORE write is really unnecessary here. It
> would be great if we can relieve users from dealing with
> such unnecessary CORE writes. In that sense, for this particular
> case, I would prefer rewriting the code by using byte-level
> stores...
or preserve_static_offset to clearly mean to undo bitfield CORE ...

[...]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27  5:53                 ` Yonghong Song
@ 2023-11-27 20:45                   ` Daniel Xu
  2023-11-27 21:32                     ` Eduard Zingerman
  2023-11-28  0:01                     ` Daniel Xu
  0 siblings, 2 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-27 20:45 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
> 
> On 11/27/23 12:44 AM, Yonghong Song wrote:
> > 
> > On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> > > On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> > > [...]
> > > > > Tbh I'm not sure. This test passes with preserve_static_offset
> > > > > because it suppresses preserve_access_index. In general clang
> > > > > translates bitfield access to a set of IR statements like:
> > > > > 
> > > > >    C:
> > > > >      struct foo {
> > > > >        unsigned _;
> > > > >        unsigned a:1;
> > > > >        ...
> > > > >      };
> > > > >      ... foo->a ...
> > > > > 
> > > > >    IR:
> > > > >      %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> > > > >      %bf.load = load i8, ptr %a, align 4
> > > > >      %bf.clear = and i8 %bf.load, 1
> > > > >      %bf.cast = zext i8 %bf.clear to i32
> > > > > 
> > > > > With preserve_static_offset the getelementptr+load are replaced by a
> > > > > single statement which is preserved as-is till code generation,
> > > > > thus load with align 4 is preserved.
> > > > > 
> > > > > On the other hand, I'm not sure that clang guarantees that load or
> > > > > stores used for bitfield access would be always aligned according to
> > > > > verifier expectations.
> > > > > 
> > > > > I think we should check if there are some clang knobs that prevent
> > > > > generation of unaligned memory access. I'll take a look.
> > > > Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> > > > but the downside to compiler fix is it takes years to propagate and
> > > > sprinkles ifdefs into the code.
> > > > 
> > > > Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> > > Well, the contraption below passes verification, tunnel selftest
> > > appears to work. I might have messed up some shifts in the macro,
> > > though.
> > 
> > I didn't test it. But from high level it should work.
> > 
> > > 
> > > Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> > > field access might be unaligned.
> > 
> > clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> > alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
> > 
> > > 
> > > ---
> > > 
> > > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > index 3065a716544d..41cd913ac7ff 100644
> > > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > @@ -9,6 +9,7 @@
> > >   #include "vmlinux.h"
> > >   #include <bpf/bpf_helpers.h>
> > >   #include <bpf/bpf_endian.h>
> > > +#include <bpf/bpf_core_read.h>
> > >   #include "bpf_kfuncs.h"
> > >   #include "bpf_tracing_net.h"
> > >   @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
> > >       return TC_ACT_OK;
> > >   }
> > >   +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
> > > +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
> > > +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
> > > +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> > > +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> > > +    unsigned bit_size = (rshift - lshift);                \
> > > +    unsigned long long nval, val, hi, lo;                \
> > > +                                    \
> > > +    asm volatile("" : "=r"(p) : "0"(p));                \
> > 
> > Use asm volatile("" : "+r"(p)) ?
> > 
> > > +                                    \
> > > +    switch (byte_size) {                        \
> > > +    case 1: val = *(unsigned char *)p; break;            \
> > > +    case 2: val = *(unsigned short *)p; break;            \
> > > +    case 4: val = *(unsigned int *)p; break;            \
> > > +    case 8: val = *(unsigned long long *)p; break;            \
> > > +    }                                \
> > > +    hi = val >> (bit_size + rshift);                \
> > > +    hi <<= bit_size + rshift;                    \
> > > +    lo = val << (bit_size + lshift);                \
> > > +    lo >>= bit_size + lshift;                    \
> > > +    nval = new_val;                            \
> > > +    nval <<= lshift;                        \
> > > +    nval >>= rshift;                        \
> > > +    val = hi | nval | lo;                        \
> > > +    switch (byte_size) {                        \
> > > +    case 1: *(unsigned char *)p      = val; break;            \
> > > +    case 2: *(unsigned short *)p     = val; break;            \
> > > +    case 4: *(unsigned int *)p       = val; break;            \
> > > +    case 8: *(unsigned long long *)p = val; break;            \
> > > +    }                                \
> > > +})
> > 
> > I think this should be put in libbpf public header files but not sure
> > where to put it. bpf_core_read.h although it is core write?
> > 
> > But on the other hand, this is a uapi struct bitfield write,
> > strictly speaking, CORE write is really unnecessary here. It
> > would be great if we can relieve users from dealing with
> > such unnecessary CORE writes. In that sense, for this particular
> > case, I would prefer rewriting the code by using byte-level
> > stores...
> or preserve_static_offset to clearly mean to undo bitfield CORE ...

Ok, I will do byte-level rewrite for next revision.

Just wondering, though: will bpftool be able to generate the appropriate
annotations for uapi structs? IIUC uapi structs look the same in BTF as
any other struct.

> 
> [...]
> 

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27 20:45                   ` Daniel Xu
@ 2023-11-27 21:32                     ` Eduard Zingerman
  2023-11-28  0:01                     ` Daniel Xu
  1 sibling, 0 replies; 33+ messages in thread
From: Eduard Zingerman @ 2023-11-27 21:32 UTC (permalink / raw)
  To: Daniel Xu, Yonghong Song
  Cc: Alexei Starovoitov, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Mon, 2023-11-27 at 14:45 -0600, Daniel Xu wrote:
[...]
> IIUC uapi structs look the same in BTF as any other struct.

Yes, and all share preserve_access_index attribute because of the way
attribute push/pop directives are generated in vmlinux.h.

> Just wondering, though: will bpftool be able to generate the appropriate
> annotations for uapi structs? 

The problem is that there is no easy way to identify if structure is
uapi in DWARF (from which BTF is generated).
One way to do this:
- modify pahole to check DW_AT_decl_file for each struct DWARF entry
  and generate some special decl tag in BTF;
- modify bpftool to interpret this tag as a marker to not generate
  preserve_access_index for a structure.

The drawback is that such behavior hardcodes some kernel specific
assumptions both in pahole and in bpftool. It also remains to be seen
if DW_AT_decl_file tags are consistent.

It might be the case that allowing excessive CO-RE relocations is a
better option. (And maybe tweak something about bitfield access
generation to avoid such issues as in this thread).

Thanks,
Eduard

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-27 20:45                   ` Daniel Xu
  2023-11-27 21:32                     ` Eduard Zingerman
@ 2023-11-28  0:01                     ` Daniel Xu
  2023-11-28  4:06                       ` Yonghong Song
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-28  0:01 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
> On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
> > 
> > On 11/27/23 12:44 AM, Yonghong Song wrote:
> > > 
> > > On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> > > > On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> > > > [...]
> > > > > > Tbh I'm not sure. This test passes with preserve_static_offset
> > > > > > because it suppresses preserve_access_index. In general clang
> > > > > > translates bitfield access to a set of IR statements like:
> > > > > > 
> > > > > >    C:
> > > > > >      struct foo {
> > > > > >        unsigned _;
> > > > > >        unsigned a:1;
> > > > > >        ...
> > > > > >      };
> > > > > >      ... foo->a ...
> > > > > > 
> > > > > >    IR:
> > > > > >      %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> > > > > >      %bf.load = load i8, ptr %a, align 4
> > > > > >      %bf.clear = and i8 %bf.load, 1
> > > > > >      %bf.cast = zext i8 %bf.clear to i32
> > > > > > 
> > > > > > With preserve_static_offset the getelementptr+load are replaced by a
> > > > > > single statement which is preserved as-is till code generation,
> > > > > > thus load with align 4 is preserved.
> > > > > > 
> > > > > > On the other hand, I'm not sure that clang guarantees that load or
> > > > > > stores used for bitfield access would be always aligned according to
> > > > > > verifier expectations.
> > > > > > 
> > > > > > I think we should check if there are some clang knobs that prevent
> > > > > > generation of unaligned memory access. I'll take a look.
> > > > > Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> > > > > but the downside to compiler fix is it takes years to propagate and
> > > > > sprinkles ifdefs into the code.
> > > > > 
> > > > > Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> > > > Well, the contraption below passes verification, tunnel selftest
> > > > appears to work. I might have messed up some shifts in the macro,
> > > > though.
> > > 
> > > I didn't test it. But from high level it should work.
> > > 
> > > > 
> > > > Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> > > > field access might be unaligned.
> > > 
> > > clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> > > alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
> > > 
> > > > 
> > > > ---
> > > > 
> > > > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > index 3065a716544d..41cd913ac7ff 100644
> > > > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > @@ -9,6 +9,7 @@
> > > >   #include "vmlinux.h"
> > > >   #include <bpf/bpf_helpers.h>
> > > >   #include <bpf/bpf_endian.h>
> > > > +#include <bpf/bpf_core_read.h>
> > > >   #include "bpf_kfuncs.h"
> > > >   #include "bpf_tracing_net.h"
> > > >   @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
> > > >       return TC_ACT_OK;
> > > >   }
> > > >   +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
> > > > +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
> > > > +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
> > > > +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> > > > +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> > > > +    unsigned bit_size = (rshift - lshift);                \
> > > > +    unsigned long long nval, val, hi, lo;                \
> > > > +                                    \
> > > > +    asm volatile("" : "=r"(p) : "0"(p));                \
> > > 
> > > Use asm volatile("" : "+r"(p)) ?
> > > 
> > > > +                                    \
> > > > +    switch (byte_size) {                        \
> > > > +    case 1: val = *(unsigned char *)p; break;            \
> > > > +    case 2: val = *(unsigned short *)p; break;            \
> > > > +    case 4: val = *(unsigned int *)p; break;            \
> > > > +    case 8: val = *(unsigned long long *)p; break;            \
> > > > +    }                                \
> > > > +    hi = val >> (bit_size + rshift);                \
> > > > +    hi <<= bit_size + rshift;                    \
> > > > +    lo = val << (bit_size + lshift);                \
> > > > +    lo >>= bit_size + lshift;                    \
> > > > +    nval = new_val;                            \
> > > > +    nval <<= lshift;                        \
> > > > +    nval >>= rshift;                        \
> > > > +    val = hi | nval | lo;                        \
> > > > +    switch (byte_size) {                        \
> > > > +    case 1: *(unsigned char *)p      = val; break;            \
> > > > +    case 2: *(unsigned short *)p     = val; break;            \
> > > > +    case 4: *(unsigned int *)p       = val; break;            \
> > > > +    case 8: *(unsigned long long *)p = val; break;            \
> > > > +    }                                \
> > > > +})
> > > 
> > > I think this should be put in libbpf public header files but not sure
> > > where to put it. bpf_core_read.h although it is core write?
> > > 
> > > But on the other hand, this is a uapi struct bitfield write,
> > > strictly speaking, CORE write is really unnecessary here. It
> > > would be great if we can relieve users from dealing with
> > > such unnecessary CORE writes. In that sense, for this particular
> > > case, I would prefer rewriting the code by using byte-level
> > > stores...
> > or preserve_static_offset to clearly mean to undo bitfield CORE ...
> 
> Ok, I will do byte-level rewrite for next revision.

[...]

This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .

But I don't think it's very pretty. Also I'm seeing on the internet that
people are saying the exact layout of bitfields is compiler dependent.
So I am wondering if these byte sized writes are correct. For that
matter, I am wondering how the GCC generated bitfield accesses line up
with clang generated BPF bytecode. Or why uapi contains a bitfield.

WDYT, should I send up v2 with this or should I do one of the other
approaches in this thread?

I am ok with any of the approaches.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28  0:01                     ` Daniel Xu
@ 2023-11-28  4:06                       ` Yonghong Song
  2023-11-28 16:02                         ` Andrii Nakryiko
  2023-11-28 16:13                         ` Daniel Xu
  0 siblings, 2 replies; 33+ messages in thread
From: Yonghong Song @ 2023-11-28  4:06 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/27/23 7:01 PM, Daniel Xu wrote:
> On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
>> On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
>>> On 11/27/23 12:44 AM, Yonghong Song wrote:
>>>> On 11/26/23 8:52 PM, Eduard Zingerman wrote:
>>>>> On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
>>>>> [...]
>>>>>>> Tbh I'm not sure. This test passes with preserve_static_offset
>>>>>>> because it suppresses preserve_access_index. In general clang
>>>>>>> translates bitfield access to a set of IR statements like:
>>>>>>>
>>>>>>>     C:
>>>>>>>       struct foo {
>>>>>>>         unsigned _;
>>>>>>>         unsigned a:1;
>>>>>>>         ...
>>>>>>>       };
>>>>>>>       ... foo->a ...
>>>>>>>
>>>>>>>     IR:
>>>>>>>       %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>>>>>>>       %bf.load = load i8, ptr %a, align 4
>>>>>>>       %bf.clear = and i8 %bf.load, 1
>>>>>>>       %bf.cast = zext i8 %bf.clear to i32
>>>>>>>
>>>>>>> With preserve_static_offset the getelementptr+load are replaced by a
>>>>>>> single statement which is preserved as-is till code generation,
>>>>>>> thus load with align 4 is preserved.
>>>>>>>
>>>>>>> On the other hand, I'm not sure that clang guarantees that load or
>>>>>>> stores used for bitfield access would be always aligned according to
>>>>>>> verifier expectations.
>>>>>>>
>>>>>>> I think we should check if there are some clang knobs that prevent
>>>>>>> generation of unaligned memory access. I'll take a look.
>>>>>> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
>>>>>> but the downside to compiler fix is it takes years to propagate and
>>>>>> sprinkles ifdefs into the code.
>>>>>>
>>>>>> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
>>>>> Well, the contraption below passes verification, tunnel selftest
>>>>> appears to work. I might have messed up some shifts in the macro,
>>>>> though.
>>>> I didn't test it. But from high level it should work.
>>>>
>>>>> Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
>>>>> field access might be unaligned.
>>>> clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
>>>> alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>> b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>> index 3065a716544d..41cd913ac7ff 100644
>>>>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>> @@ -9,6 +9,7 @@
>>>>>    #include "vmlinux.h"
>>>>>    #include <bpf/bpf_helpers.h>
>>>>>    #include <bpf/bpf_endian.h>
>>>>> +#include <bpf/bpf_core_read.h>
>>>>>    #include "bpf_kfuncs.h"
>>>>>    #include "bpf_tracing_net.h"
>>>>>    @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
>>>>>        return TC_ACT_OK;
>>>>>    }
>>>>>    +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
>>>>> +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
>>>>> +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
>>>>> +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
>>>>> +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
>>>>> +    unsigned bit_size = (rshift - lshift);                \
>>>>> +    unsigned long long nval, val, hi, lo;                \
>>>>> +                                    \
>>>>> +    asm volatile("" : "=r"(p) : "0"(p));                \
>>>> Use asm volatile("" : "+r"(p)) ?
>>>>
>>>>> +                                    \
>>>>> +    switch (byte_size) {                        \
>>>>> +    case 1: val = *(unsigned char *)p; break;            \
>>>>> +    case 2: val = *(unsigned short *)p; break;            \
>>>>> +    case 4: val = *(unsigned int *)p; break;            \
>>>>> +    case 8: val = *(unsigned long long *)p; break;            \
>>>>> +    }                                \
>>>>> +    hi = val >> (bit_size + rshift);                \
>>>>> +    hi <<= bit_size + rshift;                    \
>>>>> +    lo = val << (bit_size + lshift);                \
>>>>> +    lo >>= bit_size + lshift;                    \
>>>>> +    nval = new_val;                            \
>>>>> +    nval <<= lshift;                        \
>>>>> +    nval >>= rshift;                        \
>>>>> +    val = hi | nval | lo;                        \
>>>>> +    switch (byte_size) {                        \
>>>>> +    case 1: *(unsigned char *)p      = val; break;            \
>>>>> +    case 2: *(unsigned short *)p     = val; break;            \
>>>>> +    case 4: *(unsigned int *)p       = val; break;            \
>>>>> +    case 8: *(unsigned long long *)p = val; break;            \
>>>>> +    }                                \
>>>>> +})
>>>> I think this should be put in libbpf public header files but not sure
>>>> where to put it. bpf_core_read.h although it is core write?
>>>>
>>>> But on the other hand, this is a uapi struct bitfield write,
>>>> strictly speaking, CORE write is really unnecessary here. It
>>>> would be great if we can relieve users from dealing with
>>>> such unnecessary CORE writes. In that sense, for this particular
>>>> case, I would prefer rewriting the code by using byte-level
>>>> stores...
>>> or preserve_static_offset to clearly mean to undo bitfield CORE ...
>> Ok, I will do byte-level rewrite for next revision.
> [...]
>
> This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .
>
> But I don't think it's very pretty. Also I'm seeing on the internet that
> people are saying the exact layout of bitfields is compiler dependent.

Any reference for this (exact layout of bitfields is compiler dependent)?

> So I am wondering if these byte sized writes are correct. For that
> matter, I am wondering how the GCC generated bitfield accesses line up
> with clang generated BPF bytecode. Or why uapi contains a bitfield.

One thing for sure is memory layout of bitfields should be the same
for both clang and gcc as it is determined by C standard. Register
representation and how to manipulate could be different for different
compilers.

>
> WDYT, should I send up v2 with this or should I do one of the other
> approaches in this thread?

Daniel, look at your patch, since we need to do CORE_READ for
those bitfields any way, I think Eduard's patch with
BPF_CORE_WRITE_BITFIELD does make sense and it also makes code
easy to understand. Could you take Eduard's patch for now?
Whether and where to put BPF_CORE_WRITE_BITFIELD macros
can be decided later.

>
> I am ok with any of the approaches.
>
> Thanks,
> Daniel
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28  4:06                       ` Yonghong Song
@ 2023-11-28 16:02                         ` Andrii Nakryiko
  2023-11-28 16:13                         ` Daniel Xu
  1 sibling, 0 replies; 33+ messages in thread
From: Andrii Nakryiko @ 2023-11-28 16:02 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Daniel Xu, Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Mon, Nov 27, 2023 at 8:06 PM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 11/27/23 7:01 PM, Daniel Xu wrote:
> > On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
> >> On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
> >>> On 11/27/23 12:44 AM, Yonghong Song wrote:
> >>>> On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> >>>>> On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> >>>>> [...]
> >>>>>>> Tbh I'm not sure. This test passes with preserve_static_offset
> >>>>>>> because it suppresses preserve_access_index. In general clang
> >>>>>>> translates bitfield access to a set of IR statements like:
> >>>>>>>
> >>>>>>>     C:
> >>>>>>>       struct foo {
> >>>>>>>         unsigned _;
> >>>>>>>         unsigned a:1;
> >>>>>>>         ...
> >>>>>>>       };
> >>>>>>>       ... foo->a ...
> >>>>>>>
> >>>>>>>     IR:
> >>>>>>>       %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> >>>>>>>       %bf.load = load i8, ptr %a, align 4
> >>>>>>>       %bf.clear = and i8 %bf.load, 1
> >>>>>>>       %bf.cast = zext i8 %bf.clear to i32
> >>>>>>>
> >>>>>>> With preserve_static_offset the getelementptr+load are replaced by a
> >>>>>>> single statement which is preserved as-is till code generation,
> >>>>>>> thus load with align 4 is preserved.
> >>>>>>>
> >>>>>>> On the other hand, I'm not sure that clang guarantees that load or
> >>>>>>> stores used for bitfield access would be always aligned according to
> >>>>>>> verifier expectations.
> >>>>>>>
> >>>>>>> I think we should check if there are some clang knobs that prevent
> >>>>>>> generation of unaligned memory access. I'll take a look.
> >>>>>> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> >>>>>> but the downside to compiler fix is it takes years to propagate and
> >>>>>> sprinkles ifdefs into the code.
> >>>>>>
> >>>>>> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> >>>>> Well, the contraption below passes verification, tunnel selftest
> >>>>> appears to work. I might have messed up some shifts in the macro,
> >>>>> though.
> >>>> I didn't test it. But from high level it should work.
> >>>>
> >>>>> Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> >>>>> field access might be unaligned.
> >>>> clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> >>>> alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
> >>>>
> >>>>> ---
> >>>>>
> >>>>> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> >>>>> b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> >>>>> index 3065a716544d..41cd913ac7ff 100644
> >>>>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> >>>>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> >>>>> @@ -9,6 +9,7 @@
> >>>>>    #include "vmlinux.h"
> >>>>>    #include <bpf/bpf_helpers.h>
> >>>>>    #include <bpf/bpf_endian.h>
> >>>>> +#include <bpf/bpf_core_read.h>
> >>>>>    #include "bpf_kfuncs.h"
> >>>>>    #include "bpf_tracing_net.h"
> >>>>>    @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
> >>>>>        return TC_ACT_OK;
> >>>>>    }
> >>>>>    +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
> >>>>> +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
> >>>>> +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
> >>>>> +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> >>>>> +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> >>>>> +    unsigned bit_size = (rshift - lshift);                \
> >>>>> +    unsigned long long nval, val, hi, lo;                \
> >>>>> +                                    \
> >>>>> +    asm volatile("" : "=r"(p) : "0"(p));                \
> >>>> Use asm volatile("" : "+r"(p)) ?
> >>>>
> >>>>> +                                    \
> >>>>> +    switch (byte_size) {                        \
> >>>>> +    case 1: val = *(unsigned char *)p; break;            \
> >>>>> +    case 2: val = *(unsigned short *)p; break;            \
> >>>>> +    case 4: val = *(unsigned int *)p; break;            \
> >>>>> +    case 8: val = *(unsigned long long *)p; break;            \
> >>>>> +    }                                \
> >>>>> +    hi = val >> (bit_size + rshift);                \
> >>>>> +    hi <<= bit_size + rshift;                    \
> >>>>> +    lo = val << (bit_size + lshift);                \
> >>>>> +    lo >>= bit_size + lshift;                    \
> >>>>> +    nval = new_val;                            \
> >>>>> +    nval <<= lshift;                        \
> >>>>> +    nval >>= rshift;                        \
> >>>>> +    val = hi | nval | lo;                        \
> >>>>> +    switch (byte_size) {                        \
> >>>>> +    case 1: *(unsigned char *)p      = val; break;            \
> >>>>> +    case 2: *(unsigned short *)p     = val; break;            \
> >>>>> +    case 4: *(unsigned int *)p       = val; break;            \
> >>>>> +    case 8: *(unsigned long long *)p = val; break;            \
> >>>>> +    }                                \
> >>>>> +})
> >>>> I think this should be put in libbpf public header files but not sure
> >>>> where to put it. bpf_core_read.h although it is core write?
> >>>>
> >>>> But on the other hand, this is a uapi struct bitfield write,
> >>>> strictly speaking, CORE write is really unnecessary here. It
> >>>> would be great if we can relieve users from dealing with
> >>>> such unnecessary CORE writes. In that sense, for this particular
> >>>> case, I would prefer rewriting the code by using byte-level
> >>>> stores...
> >>> or preserve_static_offset to clearly mean to undo bitfield CORE ...
> >> Ok, I will do byte-level rewrite for next revision.
> > [...]
> >
> > This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .
> >
> > But I don't think it's very pretty. Also I'm seeing on the internet that
> > people are saying the exact layout of bitfields is compiler dependent.
>
> Any reference for this (exact layout of bitfields is compiler dependent)?
>
> > So I am wondering if these byte sized writes are correct. For that
> > matter, I am wondering how the GCC generated bitfield accesses line up
> > with clang generated BPF bytecode. Or why uapi contains a bitfield.
>
> One thing for sure is memory layout of bitfields should be the same
> for both clang and gcc as it is determined by C standard. Register
> representation and how to manipulate could be different for different
> compilers.
>
> >
> > WDYT, should I send up v2 with this or should I do one of the other
> > approaches in this thread?
>
> Daniel, look at your patch, since we need to do CORE_READ for
> those bitfields any way, I think Eduard's patch with
> BPF_CORE_WRITE_BITFIELD does make sense and it also makes code
> easy to understand. Could you take Eduard's patch for now?
> Whether and where to put BPF_CORE_WRITE_BITFIELD macros
> can be decided later.

bpf_core_read.h name is... let's say "historical" and was never meant
to limit stuff there to read-only or anything like that. Think about
it as just bpf_core.h where all the CO-RE-related stuff goes. So
please put BPF_CORE_WRITE_BITFIELD there.

>
> >
> > I am ok with any of the approaches.
> >
> > Thanks,
> > Daniel
> >

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28  4:06                       ` Yonghong Song
  2023-11-28 16:02                         ` Andrii Nakryiko
@ 2023-11-28 16:13                         ` Daniel Xu
  2023-11-28 16:17                           ` Daniel Xu
  2023-11-28 16:19                           ` Eduard Zingerman
  1 sibling, 2 replies; 33+ messages in thread
From: Daniel Xu @ 2023-11-28 16:13 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Mon, Nov 27, 2023 at 08:06:01PM -0800, Yonghong Song wrote:
> 
> On 11/27/23 7:01 PM, Daniel Xu wrote:
> > On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
> > > On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
> > > > On 11/27/23 12:44 AM, Yonghong Song wrote:
> > > > > On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> > > > > > On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> > > > > > [...]
> > > > > > > > Tbh I'm not sure. This test passes with preserve_static_offset
> > > > > > > > because it suppresses preserve_access_index. In general clang
> > > > > > > > translates bitfield access to a set of IR statements like:
> > > > > > > > 
> > > > > > > >     C:
> > > > > > > >       struct foo {
> > > > > > > >         unsigned _;
> > > > > > > >         unsigned a:1;
> > > > > > > >         ...
> > > > > > > >       };
> > > > > > > >       ... foo->a ...
> > > > > > > > 
> > > > > > > >     IR:
> > > > > > > >       %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> > > > > > > >       %bf.load = load i8, ptr %a, align 4
> > > > > > > >       %bf.clear = and i8 %bf.load, 1
> > > > > > > >       %bf.cast = zext i8 %bf.clear to i32
> > > > > > > > 
> > > > > > > > With preserve_static_offset the getelementptr+load are replaced by a
> > > > > > > > single statement which is preserved as-is till code generation,
> > > > > > > > thus load with align 4 is preserved.
> > > > > > > > 
> > > > > > > > On the other hand, I'm not sure that clang guarantees that load or
> > > > > > > > stores used for bitfield access would be always aligned according to
> > > > > > > > verifier expectations.
> > > > > > > > 
> > > > > > > > I think we should check if there are some clang knobs that prevent
> > > > > > > > generation of unaligned memory access. I'll take a look.
> > > > > > > Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> > > > > > > but the downside to compiler fix is it takes years to propagate and
> > > > > > > sprinkles ifdefs into the code.
> > > > > > > 
> > > > > > > Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> > > > > > Well, the contraption below passes verification, tunnel selftest
> > > > > > appears to work. I might have messed up some shifts in the macro,
> > > > > > though.
> > > > > I didn't test it. But from high level it should work.
> > > > > 
> > > > > > Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> > > > > > field access might be unaligned.
> > > > > clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> > > > > alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
> > > > > 
> > > > > > ---
> > > > > > 
> > > > > > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > index 3065a716544d..41cd913ac7ff 100644
> > > > > > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > @@ -9,6 +9,7 @@
> > > > > >    #include "vmlinux.h"
> > > > > >    #include <bpf/bpf_helpers.h>
> > > > > >    #include <bpf/bpf_endian.h>
> > > > > > +#include <bpf/bpf_core_read.h>
> > > > > >    #include "bpf_kfuncs.h"
> > > > > >    #include "bpf_tracing_net.h"
> > > > > >    @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
> > > > > >        return TC_ACT_OK;
> > > > > >    }
> > > > > >    +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
> > > > > > +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
> > > > > > +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
> > > > > > +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> > > > > > +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> > > > > > +    unsigned bit_size = (rshift - lshift);                \
> > > > > > +    unsigned long long nval, val, hi, lo;                \
> > > > > > +                                    \
> > > > > > +    asm volatile("" : "=r"(p) : "0"(p));                \
> > > > > Use asm volatile("" : "+r"(p)) ?
> > > > > 
> > > > > > +                                    \
> > > > > > +    switch (byte_size) {                        \
> > > > > > +    case 1: val = *(unsigned char *)p; break;            \
> > > > > > +    case 2: val = *(unsigned short *)p; break;            \
> > > > > > +    case 4: val = *(unsigned int *)p; break;            \
> > > > > > +    case 8: val = *(unsigned long long *)p; break;            \
> > > > > > +    }                                \
> > > > > > +    hi = val >> (bit_size + rshift);                \
> > > > > > +    hi <<= bit_size + rshift;                    \
> > > > > > +    lo = val << (bit_size + lshift);                \
> > > > > > +    lo >>= bit_size + lshift;                    \
> > > > > > +    nval = new_val;                            \
> > > > > > +    nval <<= lshift;                        \
> > > > > > +    nval >>= rshift;                        \
> > > > > > +    val = hi | nval | lo;                        \
> > > > > > +    switch (byte_size) {                        \
> > > > > > +    case 1: *(unsigned char *)p      = val; break;            \
> > > > > > +    case 2: *(unsigned short *)p     = val; break;            \
> > > > > > +    case 4: *(unsigned int *)p       = val; break;            \
> > > > > > +    case 8: *(unsigned long long *)p = val; break;            \
> > > > > > +    }                                \
> > > > > > +})
> > > > > I think this should be put in libbpf public header files but not sure
> > > > > where to put it. bpf_core_read.h although it is core write?
> > > > > 
> > > > > But on the other hand, this is a uapi struct bitfield write,
> > > > > strictly speaking, CORE write is really unnecessary here. It
> > > > > would be great if we can relieve users from dealing with
> > > > > such unnecessary CORE writes. In that sense, for this particular
> > > > > case, I would prefer rewriting the code by using byte-level
> > > > > stores...
> > > > or preserve_static_offset to clearly mean to undo bitfield CORE ...
> > > Ok, I will do byte-level rewrite for next revision.
> > [...]
> > 
> > This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .
> > 
> > But I don't think it's very pretty. Also I'm seeing on the internet that
> > people are saying the exact layout of bitfields is compiler dependent.
> 
> Any reference for this (exact layout of bitfields is compiler dependent)?
> 
> > So I am wondering if these byte sized writes are correct. For that
> > matter, I am wondering how the GCC generated bitfield accesses line up
> > with clang generated BPF bytecode. Or why uapi contains a bitfield.
> 
> One thing for sure is memory layout of bitfields should be the same
> for both clang and gcc as it is determined by C standard. Register
> representation and how to manipulate could be different for different
> compilers.

I was reading this thread:
https://github.com/Lora-net/LoRaMac-node/issues/697. It's obviously not
authoritative, but they sure sound confident!

I think I've also heard it before a long time ago when I was working on
adding bitfield support to bpftrace.


[...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28 16:13                         ` Daniel Xu
@ 2023-11-28 16:17                           ` Daniel Xu
  2023-11-28 16:56                             ` Yonghong Song
  2023-11-28 16:19                           ` Eduard Zingerman
  1 sibling, 1 reply; 33+ messages in thread
From: Daniel Xu @ 2023-11-28 16:17 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Tue, Nov 28, 2023 at 10:13:50AM -0600, Daniel Xu wrote:
> On Mon, Nov 27, 2023 at 08:06:01PM -0800, Yonghong Song wrote:
> > 
> > On 11/27/23 7:01 PM, Daniel Xu wrote:
> > > On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
> > > > On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
> > > > > On 11/27/23 12:44 AM, Yonghong Song wrote:
> > > > > > On 11/26/23 8:52 PM, Eduard Zingerman wrote:
> > > > > > > On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
> > > > > > > [...]
> > > > > > > > > Tbh I'm not sure. This test passes with preserve_static_offset
> > > > > > > > > because it suppresses preserve_access_index. In general clang
> > > > > > > > > translates bitfield access to a set of IR statements like:
> > > > > > > > > 
> > > > > > > > >     C:
> > > > > > > > >       struct foo {
> > > > > > > > >         unsigned _;
> > > > > > > > >         unsigned a:1;
> > > > > > > > >         ...
> > > > > > > > >       };
> > > > > > > > >       ... foo->a ...
> > > > > > > > > 
> > > > > > > > >     IR:
> > > > > > > > >       %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
> > > > > > > > >       %bf.load = load i8, ptr %a, align 4
> > > > > > > > >       %bf.clear = and i8 %bf.load, 1
> > > > > > > > >       %bf.cast = zext i8 %bf.clear to i32
> > > > > > > > > 
> > > > > > > > > With preserve_static_offset the getelementptr+load are replaced by a
> > > > > > > > > single statement which is preserved as-is till code generation,
> > > > > > > > > thus load with align 4 is preserved.
> > > > > > > > > 
> > > > > > > > > On the other hand, I'm not sure that clang guarantees that load or
> > > > > > > > > stores used for bitfield access would be always aligned according to
> > > > > > > > > verifier expectations.
> > > > > > > > > 
> > > > > > > > > I think we should check if there are some clang knobs that prevent
> > > > > > > > > generation of unaligned memory access. I'll take a look.
> > > > > > > > Is there a reason to prefer fixing in compiler? I'm not opposed to it,
> > > > > > > > but the downside to compiler fix is it takes years to propagate and
> > > > > > > > sprinkles ifdefs into the code.
> > > > > > > > 
> > > > > > > > Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
> > > > > > > Well, the contraption below passes verification, tunnel selftest
> > > > > > > appears to work. I might have messed up some shifts in the macro,
> > > > > > > though.
> > > > > > I didn't test it. But from high level it should work.
> > > > > > 
> > > > > > > Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
> > > > > > > field access might be unaligned.
> > > > > > clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
> > > > > > alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
> > > > > > 
> > > > > > > ---
> > > > > > > 
> > > > > > > diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > > b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > > index 3065a716544d..41cd913ac7ff 100644
> > > > > > > --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > > +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
> > > > > > > @@ -9,6 +9,7 @@
> > > > > > >    #include "vmlinux.h"
> > > > > > >    #include <bpf/bpf_helpers.h>
> > > > > > >    #include <bpf/bpf_endian.h>
> > > > > > > +#include <bpf/bpf_core_read.h>
> > > > > > >    #include "bpf_kfuncs.h"
> > > > > > >    #include "bpf_tracing_net.h"
> > > > > > >    @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
> > > > > > >        return TC_ACT_OK;
> > > > > > >    }
> > > > > > >    +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
> > > > > > > +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
> > > > > > > +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
> > > > > > > +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
> > > > > > > +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
> > > > > > > +    unsigned bit_size = (rshift - lshift);                \
> > > > > > > +    unsigned long long nval, val, hi, lo;                \
> > > > > > > +                                    \
> > > > > > > +    asm volatile("" : "=r"(p) : "0"(p));                \
> > > > > > Use asm volatile("" : "+r"(p)) ?
> > > > > > 
> > > > > > > +                                    \
> > > > > > > +    switch (byte_size) {                        \
> > > > > > > +    case 1: val = *(unsigned char *)p; break;            \
> > > > > > > +    case 2: val = *(unsigned short *)p; break;            \
> > > > > > > +    case 4: val = *(unsigned int *)p; break;            \
> > > > > > > +    case 8: val = *(unsigned long long *)p; break;            \
> > > > > > > +    }                                \
> > > > > > > +    hi = val >> (bit_size + rshift);                \
> > > > > > > +    hi <<= bit_size + rshift;                    \
> > > > > > > +    lo = val << (bit_size + lshift);                \
> > > > > > > +    lo >>= bit_size + lshift;                    \
> > > > > > > +    nval = new_val;                            \
> > > > > > > +    nval <<= lshift;                        \
> > > > > > > +    nval >>= rshift;                        \
> > > > > > > +    val = hi | nval | lo;                        \
> > > > > > > +    switch (byte_size) {                        \
> > > > > > > +    case 1: *(unsigned char *)p      = val; break;            \
> > > > > > > +    case 2: *(unsigned short *)p     = val; break;            \
> > > > > > > +    case 4: *(unsigned int *)p       = val; break;            \
> > > > > > > +    case 8: *(unsigned long long *)p = val; break;            \
> > > > > > > +    }                                \
> > > > > > > +})
> > > > > > I think this should be put in libbpf public header files but not sure
> > > > > > where to put it. bpf_core_read.h although it is core write?
> > > > > > 
> > > > > > But on the other hand, this is a uapi struct bitfield write,
> > > > > > strictly speaking, CORE write is really unnecessary here. It
> > > > > > would be great if we can relieve users from dealing with
> > > > > > such unnecessary CORE writes. In that sense, for this particular
> > > > > > case, I would prefer rewriting the code by using byte-level
> > > > > > stores...
> > > > > or preserve_static_offset to clearly mean to undo bitfield CORE ...
> > > > Ok, I will do byte-level rewrite for next revision.
> > > [...]
> > > 
> > > This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .
> > > 
> > > But I don't think it's very pretty. Also I'm seeing on the internet that
> > > people are saying the exact layout of bitfields is compiler dependent.
> > 
> > Any reference for this (exact layout of bitfields is compiler dependent)?
> > 
> > > So I am wondering if these byte sized writes are correct. For that
> > > matter, I am wondering how the GCC generated bitfield accesses line up
> > > with clang generated BPF bytecode. Or why uapi contains a bitfield.
> > 
> > One thing for sure is memory layout of bitfields should be the same
> > for both clang and gcc as it is determined by C standard. Register
> > representation and how to manipulate could be different for different
> > compilers.
> 
> I was reading this thread:
> https://github.com/Lora-net/LoRaMac-node/issues/697. It's obviously not
> authoritative, but they sure sound confident!
> 
> I think I've also heard it before a long time ago when I was working on
> adding bitfield support to bpftrace.

Wikipedia [0] also claims this:

        The layout of bit fields in a C struct is
        implementation-defined. For behavior that remains predictable
        across compilers, it may be preferable to emulate bit fields
        with a primitive and bit operators:  

[0]: https://en.wikipedia.org/wiki/Bit_field#C_programming_language

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28 16:13                         ` Daniel Xu
  2023-11-28 16:17                           ` Daniel Xu
@ 2023-11-28 16:19                           ` Eduard Zingerman
  1 sibling, 0 replies; 33+ messages in thread
From: Eduard Zingerman @ 2023-11-28 16:19 UTC (permalink / raw)
  To: Daniel Xu, Yonghong Song
  Cc: Alexei Starovoitov, Shuah Khan, Daniel Borkmann, Andrii Nakryiko,
	Alexei Starovoitov, Steffen Klassert, antony.antony,
	Mykola Lysenko, Martin KaFai Lau, Song Liu, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development

On Tue, 2023-11-28 at 10:13 -0600, Daniel Xu wrote:
[...]
> > One thing for sure is memory layout of bitfields should be the same
> > for both clang and gcc as it is determined by C standard. Register
> > representation and how to manipulate could be different for different
> > compilers.
> 
> I was reading this thread:
> https://github.com/Lora-net/LoRaMac-node/issues/697. It's obviously not
> authoritative, but they sure sound confident!
> 
> I think I've also heard it before a long time ago when I was working on
> adding bitfield support to bpftrace.
> 
> 
> [...]

Here is a citation from ISO/IEC 9899:201x (C11 standard) §6.7.2.1
(Structure and union specifiers), paragraph 11 (page 114 in my pdf):

> An implementation may allocate any addressable storage unit large
> enough to hold a bit- field. If enough space remains, a bit-field
> that immediately follows another bit-field in a structure shall be
> packed into adjacent bits of the same unit. If insufficient space
> remains, whether a bit-field that does not fit is put into the next
> unit or overlaps adjacent units is implementation-defined. The order
> of allocation of bit-fields within a unit (high-order to low-order
> or low-order to high-order) is implementation-defined. The alignment
> of the addressable storage unit is unspecified.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations
  2023-11-28 16:17                           ` Daniel Xu
@ 2023-11-28 16:56                             ` Yonghong Song
  0 siblings, 0 replies; 33+ messages in thread
From: Yonghong Song @ 2023-11-28 16:56 UTC (permalink / raw)
  To: Daniel Xu
  Cc: Eduard Zingerman, Alexei Starovoitov, Shuah Khan,
	Daniel Borkmann, Andrii Nakryiko, Alexei Starovoitov,
	Steffen Klassert, antony.antony, Mykola Lysenko,
	Martin KaFai Lau, Song Liu, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	open list:KERNEL SELFTEST FRAMEWORK, LKML, devel,
	Network Development


On 11/28/23 11:17 AM, Daniel Xu wrote:
> On Tue, Nov 28, 2023 at 10:13:50AM -0600, Daniel Xu wrote:
>> On Mon, Nov 27, 2023 at 08:06:01PM -0800, Yonghong Song wrote:
>>> On 11/27/23 7:01 PM, Daniel Xu wrote:
>>>> On Mon, Nov 27, 2023 at 02:45:11PM -0600, Daniel Xu wrote:
>>>>> On Sun, Nov 26, 2023 at 09:53:04PM -0800, Yonghong Song wrote:
>>>>>> On 11/27/23 12:44 AM, Yonghong Song wrote:
>>>>>>> On 11/26/23 8:52 PM, Eduard Zingerman wrote:
>>>>>>>> On Sun, 2023-11-26 at 18:04 -0600, Daniel Xu wrote:
>>>>>>>> [...]
>>>>>>>>>> Tbh I'm not sure. This test passes with preserve_static_offset
>>>>>>>>>> because it suppresses preserve_access_index. In general clang
>>>>>>>>>> translates bitfield access to a set of IR statements like:
>>>>>>>>>>
>>>>>>>>>>      C:
>>>>>>>>>>        struct foo {
>>>>>>>>>>          unsigned _;
>>>>>>>>>>          unsigned a:1;
>>>>>>>>>>          ...
>>>>>>>>>>        };
>>>>>>>>>>        ... foo->a ...
>>>>>>>>>>
>>>>>>>>>>      IR:
>>>>>>>>>>        %a = getelementptr inbounds %struct.foo, ptr %0, i32 0, i32 1
>>>>>>>>>>        %bf.load = load i8, ptr %a, align 4
>>>>>>>>>>        %bf.clear = and i8 %bf.load, 1
>>>>>>>>>>        %bf.cast = zext i8 %bf.clear to i32
>>>>>>>>>>
>>>>>>>>>> With preserve_static_offset the getelementptr+load are replaced by a
>>>>>>>>>> single statement which is preserved as-is till code generation,
>>>>>>>>>> thus load with align 4 is preserved.
>>>>>>>>>>
>>>>>>>>>> On the other hand, I'm not sure that clang guarantees that load or
>>>>>>>>>> stores used for bitfield access would be always aligned according to
>>>>>>>>>> verifier expectations.
>>>>>>>>>>
>>>>>>>>>> I think we should check if there are some clang knobs that prevent
>>>>>>>>>> generation of unaligned memory access. I'll take a look.
>>>>>>>>> Is there a reason to prefer fixing in compiler? I'm not opposed to it,
>>>>>>>>> but the downside to compiler fix is it takes years to propagate and
>>>>>>>>> sprinkles ifdefs into the code.
>>>>>>>>>
>>>>>>>>> Would it be possible to have an analogue of BPF_CORE_READ_BITFIELD()?
>>>>>>>> Well, the contraption below passes verification, tunnel selftest
>>>>>>>> appears to work. I might have messed up some shifts in the macro,
>>>>>>>> though.
>>>>>>> I didn't test it. But from high level it should work.
>>>>>>>
>>>>>>>> Still, if clang would peek unlucky BYTE_{OFFSET,SIZE} for a particular
>>>>>>>> field access might be unaligned.
>>>>>>> clang should pick a sensible BYTE_SIZE/BYTE_OFFSET to meet
>>>>>>> alignment requirement. This is also required for BPF_CORE_READ_BITFIELD.
>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> diff --git a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>>>>> b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>>>>> index 3065a716544d..41cd913ac7ff 100644
>>>>>>>> --- a/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>>>>> +++ b/tools/testing/selftests/bpf/progs/test_tunnel_kern.c
>>>>>>>> @@ -9,6 +9,7 @@
>>>>>>>>     #include "vmlinux.h"
>>>>>>>>     #include <bpf/bpf_helpers.h>
>>>>>>>>     #include <bpf/bpf_endian.h>
>>>>>>>> +#include <bpf/bpf_core_read.h>
>>>>>>>>     #include "bpf_kfuncs.h"
>>>>>>>>     #include "bpf_tracing_net.h"
>>>>>>>>     @@ -144,6 +145,38 @@ int ip6gretap_get_tunnel(struct __sk_buff *skb)
>>>>>>>>         return TC_ACT_OK;
>>>>>>>>     }
>>>>>>>>     +#define BPF_CORE_WRITE_BITFIELD(s, field, new_val) ({            \
>>>>>>>> +    void *p = (void *)s + __CORE_RELO(s, field, BYTE_OFFSET);    \
>>>>>>>> +    unsigned byte_size = __CORE_RELO(s, field, BYTE_SIZE);        \
>>>>>>>> +    unsigned lshift = __CORE_RELO(s, field, LSHIFT_U64); \
>>>>>>>> +    unsigned rshift = __CORE_RELO(s, field, RSHIFT_U64); \
>>>>>>>> +    unsigned bit_size = (rshift - lshift);                \
>>>>>>>> +    unsigned long long nval, val, hi, lo;                \
>>>>>>>> +                                    \
>>>>>>>> +    asm volatile("" : "=r"(p) : "0"(p));                \
>>>>>>> Use asm volatile("" : "+r"(p)) ?
>>>>>>>
>>>>>>>> +                                    \
>>>>>>>> +    switch (byte_size) {                        \
>>>>>>>> +    case 1: val = *(unsigned char *)p; break;            \
>>>>>>>> +    case 2: val = *(unsigned short *)p; break;            \
>>>>>>>> +    case 4: val = *(unsigned int *)p; break;            \
>>>>>>>> +    case 8: val = *(unsigned long long *)p; break;            \
>>>>>>>> +    }                                \
>>>>>>>> +    hi = val >> (bit_size + rshift);                \
>>>>>>>> +    hi <<= bit_size + rshift;                    \
>>>>>>>> +    lo = val << (bit_size + lshift);                \
>>>>>>>> +    lo >>= bit_size + lshift;                    \
>>>>>>>> +    nval = new_val;                            \
>>>>>>>> +    nval <<= lshift;                        \
>>>>>>>> +    nval >>= rshift;                        \
>>>>>>>> +    val = hi | nval | lo;                        \
>>>>>>>> +    switch (byte_size) {                        \
>>>>>>>> +    case 1: *(unsigned char *)p      = val; break;            \
>>>>>>>> +    case 2: *(unsigned short *)p     = val; break;            \
>>>>>>>> +    case 4: *(unsigned int *)p       = val; break;            \
>>>>>>>> +    case 8: *(unsigned long long *)p = val; break;            \
>>>>>>>> +    }                                \
>>>>>>>> +})
>>>>>>> I think this should be put in libbpf public header files but not sure
>>>>>>> where to put it. bpf_core_read.h although it is core write?
>>>>>>>
>>>>>>> But on the other hand, this is a uapi struct bitfield write,
>>>>>>> strictly speaking, CORE write is really unnecessary here. It
>>>>>>> would be great if we can relieve users from dealing with
>>>>>>> such unnecessary CORE writes. In that sense, for this particular
>>>>>>> case, I would prefer rewriting the code by using byte-level
>>>>>>> stores...
>>>>>> or preserve_static_offset to clearly mean to undo bitfield CORE ...
>>>>> Ok, I will do byte-level rewrite for next revision.
>>>> [...]
>>>>
>>>> This patch seems to work: https://pastes.dxuuu.xyz/0glrf9 .
>>>>
>>>> But I don't think it's very pretty. Also I'm seeing on the internet that
>>>> people are saying the exact layout of bitfields is compiler dependent.
>>> Any reference for this (exact layout of bitfields is compiler dependent)?
>>>
>>>> So I am wondering if these byte sized writes are correct. For that
>>>> matter, I am wondering how the GCC generated bitfield accesses line up
>>>> with clang generated BPF bytecode. Or why uapi contains a bitfield.
>>> One thing for sure is memory layout of bitfields should be the same
>>> for both clang and gcc as it is determined by C standard. Register
>>> representation and how to manipulate could be different for different
>>> compilers.
>> I was reading this thread:
>> https://github.com/Lora-net/LoRaMac-node/issues/697. It's obviously not
>> authoritative, but they sure sound confident!
>>
>> I think I've also heard it before a long time ago when I was working on
>> adding bitfield support to bpftrace.
> Wikipedia [0] also claims this:
>
>          The layout of bit fields in a C struct is
>          implementation-defined. For behavior that remains predictable
>          across compilers, it may be preferable to emulate bit fields
>          with a primitive and bit operators:
>
> [0]: https://en.wikipedia.org/wiki/Bit_field#C_programming_language

Thanks for the informaiton. I am truely not aware of bit field layout
could be different for different compilers. Does this mean source
level bitfield manipulation may not work?

uapi has bitfield is okay. compiler should do the right thing to
do load/store in bitfields. Also, the networking bitfields are
related memory layout transferring on the wire. Its memory
layout is determined (although little/big endian interpresentation
is different).

BPF_CORE_WRITE_BITFIELD 'should' also be okay since the offset/size
etc. is gotten from the compiler internals (from dwarf in more
precise term).

So looks like BPF_CORE_WRITE_BITFIELD is the way to go.
Please use it then.




^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-11-28 16:57 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-22 18:20 [PATCH ipsec-next v1 0/7] Add bpf_xdp_get_xfrm_state() kfunc Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 1/7] bpf: xfrm: " Daniel Xu
2023-11-22 23:26   ` Alexei Starovoitov
2023-11-25 20:36   ` Yonghong Song
2023-11-26  4:38     ` Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 2/7] bpf: xfrm: Add bpf_xdp_xfrm_state_release() kfunc Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 3/7] bpf: selftests: test_tunnel: Use ping -6 over ping6 Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 4/7] bpf: selftests: test_tunnel: Mount bpffs if necessary Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 5/7] bpf: selftests: test_tunnel: Use vmlinux.h declarations Daniel Xu
2023-11-26  0:34   ` Yonghong Song
2023-11-26  4:34     ` Daniel Xu
2023-11-22 18:20 ` [PATCH ipsec-next v1 6/7] bpf: selftests: test_tunnel: Disable CO-RE relocations Daniel Xu
2023-11-26  0:51   ` Yonghong Song
2023-11-26  0:54     ` Alexei Starovoitov
2023-11-26  4:22       ` Yonghong Song
2023-11-26 20:14         ` Eduard Zingerman
2023-11-27  0:04           ` Daniel Xu
2023-11-27  1:52             ` Eduard Zingerman
2023-11-27  5:44               ` Yonghong Song
2023-11-27  5:53                 ` Yonghong Song
2023-11-27 20:45                   ` Daniel Xu
2023-11-27 21:32                     ` Eduard Zingerman
2023-11-28  0:01                     ` Daniel Xu
2023-11-28  4:06                       ` Yonghong Song
2023-11-28 16:02                         ` Andrii Nakryiko
2023-11-28 16:13                         ` Daniel Xu
2023-11-28 16:17                           ` Daniel Xu
2023-11-28 16:56                             ` Yonghong Song
2023-11-28 16:19                           ` Eduard Zingerman
2023-11-27  5:20           ` Yonghong Song
2023-11-22 18:20 ` [PATCH ipsec-next v1 7/7] bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state() Daniel Xu
2023-11-22 23:28   ` Alexei Starovoitov
2023-11-24 20:59     ` Daniel Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.