bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs
@ 2022-08-22 23:56 Joanne Koong
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-22 23:56 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

This patchset is the 2nd in the dynptr series. The 1st can be found here [0].

This patchset adds skb and xdp type dynptrs, which have two main benefits for
packet parsing:
    * allowing operations on sizes that are not statically known at
      compile-time (eg variable-sized accesses).
    * more ergonomic and less brittle iteration through data (eg does not need
      manual if checking for being within bounds of data_end)

When comparing the differences in runtime for packet parsing without dynptrs
vs. with dynptrs for the more simple cases, there is no noticeable difference.
For the more complex cases where lengths are non-statically known at compile
time, there can be a significant speed-up when using dynptrs (eg a 2x speed up
for cls redirection). Patch 3 contains more details as well as examples of how
to use skb and xdp dynptrs.

[0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/

--
Changelog:

v3 = https://lore.kernel.org/bpf/20220822193442.657638-1-joannelkoong@gmail.com/
v3 -> v4
    * Forgot to commit --amend the kernel test robot error fixups

v2 = https://lore.kernel.org/bpf/20220811230501.2632393-1-joannelkoong@gmail.com/
v2 -> v3
    * Fix kernel test robot build test errors

v1 = https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/
v1 -> v2
  * Return data slices to rd-only skb dynptrs (Martin)
  * bpf_dynptr_write allows writes to frags for skb dynptrs, but always
    invalidates associated data slices (Martin)
  * Use switch casing instead of ifs (Andrii)
  * Use 0xFD for experimental kind number in the selftest (Zvi)
  * Put selftest conversions w/ dynptrs into new files (Alexei)
  * Add new selftest "test_cls_redirect_dynptr.c" 

Joanne Koong (3):
  bpf: Add skb dynptrs
  bpf: Add xdp dynptrs
  selftests/bpf: tests for using dynptrs to parse skb and xdp buffers

 include/linux/bpf.h                           |  87 +-
 include/linux/filter.h                        |   7 +
 include/uapi/linux/bpf.h                      |  59 +-
 kernel/bpf/helpers.c                          |  93 +-
 kernel/bpf/verifier.c                         | 105 +-
 net/core/filter.c                             |  99 +-
 tools/include/uapi/linux/bpf.h                |  59 +-
 .../selftests/bpf/prog_tests/cls_redirect.c   |  25 +
 .../testing/selftests/bpf/prog_tests/dynptr.c | 132 ++-
 .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +
 .../bpf/prog_tests/parse_tcp_hdr_opt.c        |  85 ++
 .../selftests/bpf/prog_tests/xdp_attach.c     |   9 +-
 .../testing/selftests/bpf/progs/dynptr_fail.c | 111 ++
 .../selftests/bpf/progs/dynptr_success.c      |  29 +
 .../bpf/progs/test_cls_redirect_dynptr.c      | 968 ++++++++++++++++++
 .../bpf/progs/test_l4lb_noinline_dynptr.c     | 468 +++++++++
 .../bpf/progs/test_parse_tcp_hdr_opt.c        | 119 +++
 .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++
 .../selftests/bpf/progs/test_xdp_dynptr.c     | 240 +++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 20 files changed, 2693 insertions(+), 115 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-22 23:56 [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Joanne Koong
@ 2022-08-22 23:56 ` Joanne Koong
  2022-08-23 23:22   ` kernel test robot
                     ` (2 more replies)
  2022-08-22 23:56 ` [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs Joanne Koong
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-22 23:56 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only. For reads and writes through the bpf_dynptr_read() and
bpf_dynptr_write() interfaces, this supports reading and writing into
data in the non-linear paged buffers. For data slices (through the
bpf_dynptr_data() interface), if the data is in a paged buffer, the user
must first call bpf_skb_pull_data() to pull the data into the linear
portion. The returned data slice from a call to bpf_dynptr_data() is of
reg type PTR_TO_PACKET | PTR_MAYBE_NULL.

Any bpf_dynptr_write() automatically invalidates any prior data slices
to the skb dynptr. This is because a bpf_dynptr_write() may be writing
to data in a paged buffer, so it will need to pull the buffer first into
the head. The reason it needs to be pulled instead of writing directly to
the paged buffers is because they may be cloned (only the head of the skb
is by default uncloned). As such, any bpf_dynptr_write() will
automatically have its prior data slices invalidated, even if the write
is to data in the skb head (the verifier has no way of differentiating
whether the write is to the head or paged buffers during program load
time). Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well. Whenever such a helper call is made,
the verifier marks any PTR_TO_PACKET reg type (which includes skb dynptr
slices since they are PTR_TO_PACKETs) as unknown. The stack trace for
this is check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()

For examples of how skb dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            | 83 +++++++++++++++++-----------
 include/linux/filter.h         |  4 ++
 include/uapi/linux/bpf.h       | 40 ++++++++++++--
 kernel/bpf/helpers.c           | 81 +++++++++++++++++++++++++---
 kernel/bpf/verifier.c          | 99 ++++++++++++++++++++++++++++------
 net/core/filter.c              | 53 ++++++++++++++++--
 tools/include/uapi/linux/bpf.h | 40 ++++++++++++--
 7 files changed, 335 insertions(+), 65 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 39bd36359c1e..30615d1a0c13 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -407,11 +407,14 @@ enum bpf_type_flag {
 	/* Size is known at compile time. */
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to sk_buff */
+	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -903,6 +906,36 @@ static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
 	return bpf_func(ctx, insnsi);
 }
 
+/* the implementation of the opaque uapi struct bpf_dynptr */
+struct bpf_dynptr_kern {
+	void *data;
+	/* Size represents the number of usable bytes of dynptr data.
+	 * If for example the offset is at 4 for a local dynptr whose data is
+	 * of type u64, the number of usable bytes is 4.
+	 *
+	 * The upper 8 bits are reserved. It is as follows:
+	 * Bits 0 - 23 = size
+	 * Bits 24 - 30 = dynptr type
+	 * Bit 31 = whether dynptr is read-only
+	 */
+	u32 size;
+	u32 offset;
+} __aligned(8);
+
+enum bpf_dynptr_type {
+	BPF_DYNPTR_TYPE_INVALID,
+	/* Points to memory that is local to the bpf program */
+	BPF_DYNPTR_TYPE_LOCAL,
+	/* Underlying data is a ringbuf record */
+	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Underlying data is a sk_buff */
+	BPF_DYNPTR_TYPE_SKB,
+	/* Underlying data is a xdp_buff */
+	BPF_DYNPTR_TYPE_XDP,
+};
+
+int bpf_dynptr_check_size(u32 size);
+
 #ifdef CONFIG_BPF_JIT
 int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
 int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
@@ -1975,6 +2008,12 @@ static inline bool has_current_bpf_ctx(void)
 {
 	return !!current->bpf_ctx;
 }
+
+void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
+		     enum bpf_dynptr_type type, u32 offset, u32 size);
+void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
+
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
@@ -2188,6 +2227,19 @@ static inline bool has_current_bpf_ctx(void)
 {
 	return false;
 }
+
+static inline void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
+				   enum bpf_dynptr_type type, u32 offset, u32 size)
+{
+}
+
+static inline void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr)
+{
+}
+
+static inline void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
@@ -2548,35 +2600,6 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
 			u32 **bin_buf, u32 num_args);
 void bpf_bprintf_cleanup(void);
 
-/* the implementation of the opaque uapi struct bpf_dynptr */
-struct bpf_dynptr_kern {
-	void *data;
-	/* Size represents the number of usable bytes of dynptr data.
-	 * If for example the offset is at 4 for a local dynptr whose data is
-	 * of type u64, the number of usable bytes is 4.
-	 *
-	 * The upper 8 bits are reserved. It is as follows:
-	 * Bits 0 - 23 = size
-	 * Bits 24 - 30 = dynptr type
-	 * Bit 31 = whether dynptr is read-only
-	 */
-	u32 size;
-	u32 offset;
-} __aligned(8);
-
-enum bpf_dynptr_type {
-	BPF_DYNPTR_TYPE_INVALID,
-	/* Points to memory that is local to the bpf program */
-	BPF_DYNPTR_TYPE_LOCAL,
-	/* Underlying data is a ringbuf record */
-	BPF_DYNPTR_TYPE_RINGBUF,
-};
-
-void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
-		     enum bpf_dynptr_type type, u32 offset, u32 size);
-void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
-int bpf_dynptr_check_size(u32 size);
-
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
 void bpf_cgroup_atype_put(int cgroup_atype);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5f21dc3c432..649063d9cbfd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1532,4 +1532,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 	return XDP_REDIRECT;
 }
 
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 934a2a8beb87..320e6b95d95c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5253,11 +5253,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5265,10 +5276,20 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5355,6 +5376,18 @@ union bpf_attr {
  *	Return
  *		Current *ktime*.
  *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5566,6 +5599,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3c1b9bbcf971..471a01a9b6ae 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1437,11 +1437,21 @@ static bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr)
 	return ptr->size & DYNPTR_RDONLY_BIT;
 }
 
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+	ptr->size |= DYNPTR_RDONLY_BIT;
+}
+
 static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type)
 {
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
+static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
+{
+	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
+}
+
 static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
@@ -1512,6 +1522,7 @@ static const struct bpf_func_proto bpf_dynptr_from_mem_proto = {
 BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src,
 	   u32, offset, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
 	if (!src->data || flags)
@@ -1521,9 +1532,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 	if (err)
 		return err;
 
-	memcpy(dst, src->data + src->offset + offset, len);
+	type = bpf_dynptr_get_type(src);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		memcpy(dst, src->data + src->offset + offset, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	default:
+		WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_read_proto = {
@@ -1540,18 +1561,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
 BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
 	   u32, len, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
-	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
+	if (!dst->data || bpf_dynptr_is_rdonly(dst))
 		return -EINVAL;
 
 	err = bpf_dynptr_check_off_len(dst, offset, len);
 	if (err)
 		return err;
 
-	memcpy(dst->data + dst->offset + offset, src, len);
+	type = bpf_dynptr_get_type(dst);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (flags)
+			return -EINVAL;
+		memcpy(dst->data + dst->offset + offset, src, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
+					     flags);
+	default:
+		WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_write_proto = {
@@ -1567,6 +1602,9 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
 
 BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
 {
+	enum bpf_dynptr_type type;
+	bool is_rdonly;
+	void *data;
 	int err;
 
 	if (!ptr->data)
@@ -1576,10 +1614,37 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 	if (err)
 		return 0;
 
-	if (bpf_dynptr_is_rdonly(ptr))
-		return 0;
+	type = bpf_dynptr_get_type(ptr);
+
+	/* Only skb dynptrs can get read-only data slices, because the
+	 * verifier enforces PTR_TO_PACKET accesses
+	 */
+	is_rdonly = bpf_dynptr_is_rdonly(ptr);
+
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (is_rdonly)
+			return 0;
+
+		data = ptr->data;
+		break;
+	case BPF_DYNPTR_TYPE_SKB:
+	{
+		struct sk_buff *skb = ptr->data;
 
-	return (unsigned long)(ptr->data + ptr->offset + offset);
+		/* if the data is paged, the caller needs to pull it first */
+		if (ptr->offset + offset + len > skb->len - skb->data_len)
+			return 0;
+
+		data = skb->data;
+		break;
+	}
+	default:
+		WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
+		return 0;
+	}
+	return (unsigned long)(data + ptr->offset + offset);
 }
 
 static const struct bpf_func_proto bpf_dynptr_data_proto = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2c1f8069f7b7..1ea295f47525 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -684,6 +684,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_LOCAL;
 	case DYNPTR_TYPE_RINGBUF:
 		return BPF_DYNPTR_TYPE_RINGBUF;
+	case DYNPTR_TYPE_SKB:
+		return BPF_DYNPTR_TYPE_SKB;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -5826,12 +5828,29 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
 }
 
-static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+static struct bpf_reg_state *get_dynptr_arg_reg(const struct bpf_func_proto *fn,
+						struct bpf_reg_state *regs)
+{
+	int i;
+
+	for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++)
+		if (arg_type_is_dynptr(fn->arg_type[i]))
+			return &regs[BPF_REG_1 + i];
+
+	return NULL;
+}
+
+static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
+						       struct bpf_reg_state *reg,
+						       int *ref_obj_id)
 {
 	struct bpf_func_state *state = func(env, reg);
 	int spi = get_spi(reg->off);
 
-	return state->stack[spi].spilled_ptr.id;
+	if (ref_obj_id)
+		*ref_obj_id = state->stack[spi].spilled_ptr.id;
+
+	return state->stack[spi].spilled_ptr.dynptr.type;
 }
 
 static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
@@ -6056,7 +6075,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			case DYNPTR_TYPE_RINGBUF:
 				err_extra = "ringbuf ";
 				break;
-			default:
+			case DYNPTR_TYPE_SKB:
+				err_extra = "skb ";
 				break;
 			}
 
@@ -7149,6 +7169,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	const struct bpf_func_proto *fn = NULL;
+	enum bpf_dynptr_type dynptr_type;
 	enum bpf_return_type ret_type;
 	enum bpf_type_flag ret_flag;
 	struct bpf_reg_state *regs;
@@ -7320,24 +7341,43 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			}
 		}
 		break;
-	case BPF_FUNC_dynptr_data:
-		for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
-			if (arg_type_is_dynptr(fn->arg_type[i])) {
-				if (meta.ref_obj_id) {
-					verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
-					return -EFAULT;
-				}
-				/* Find the id of the dynptr we're tracking the reference of */
-				meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]);
-				break;
-			}
+	case BPF_FUNC_dynptr_write:
+	{
+		struct bpf_reg_state *reg;
+
+		reg = get_dynptr_arg_reg(fn, regs);
+		if (!reg) {
+			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
+			return -EFAULT;
 		}
-		if (i == MAX_BPF_FUNC_REG_ARGS) {
+
+		/* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must
+		 * invalidate all data slices associated with it
+		 */
+		if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB)
+			changes_data = true;
+
+		break;
+	}
+	case BPF_FUNC_dynptr_data:
+	{
+		struct bpf_reg_state *reg;
+
+		reg = get_dynptr_arg_reg(fn, regs);
+		if (!reg) {
 			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
 			return -EFAULT;
 		}
+
+		if (meta.ref_obj_id) {
+			verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
+			return -EFAULT;
+		}
+
+		dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id);
 		break;
 	}
+	}
 
 	if (err)
 		return err;
@@ -7397,8 +7437,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		break;
 	case RET_PTR_TO_ALLOC_MEM:
 		mark_reg_known_zero(env, regs, BPF_REG_0);
-		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
-		regs[BPF_REG_0].mem_size = meta.mem_size;
+
+		if (func_id == BPF_FUNC_dynptr_data &&
+		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
+			regs[BPF_REG_0].range = meta.mem_size;
+		} else {
+			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
+			regs[BPF_REG_0].mem_size = meta.mem_size;
+		}
 		break;
 	case RET_PTR_TO_MEM_OR_BTF_ID:
 	{
@@ -14141,6 +14188,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto patch_call_imm;
 		}
 
+		if (insn->imm == BPF_FUNC_dynptr_from_skb) {
+			bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
+
+			insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, is_rdonly);
+			insn_buf[1] = *insn;
+			cnt = 2;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta += cnt - 1;
+			env->prog = new_prog;
+			prog = new_prog;
+			insn = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
+
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
 		 * only.
diff --git a/net/core/filter.c b/net/core/filter.c
index 1acfaffeaf32..5b204b42fb3e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
 		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
 }
 
-BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
-	   const void *, from, u32, len, u64, flags)
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags)
 {
 	void *ptr;
 
@@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
 	return 0;
 }
 
+BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len, u64, flags)
+{
+	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
+}
+
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.func		= bpf_skb_store_bytes,
 	.gpl_only	= false,
@@ -1718,8 +1724,7 @@ static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
-	   void *, to, u32, len)
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
 {
 	void *ptr;
 
@@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
 	return -EFAULT;
 }
 
+BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
+	   void *, to, u32, len)
+{
+	return __bpf_skb_load_bytes(skb, offset, to, len);
+}
+
 static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.func		= bpf_skb_load_bytes,
 	.gpl_only	= false,
@@ -1849,6 +1860,32 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+/* is_rdonly is set by the verifier */
+BPF_CALL_4(bpf_dynptr_from_skb, struct sk_buff *, skb, u64, flags,
+	   struct bpf_dynptr_kern *, ptr, u32, is_rdonly)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
+
+	if (is_rdonly)
+		bpf_dynptr_set_rdonly(ptr);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_skb_proto = {
+	.func		= bpf_dynptr_from_skb,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT,
+};
+
 BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
 {
 	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
@@ -7726,6 +7763,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_perf_event_output:
 		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -7909,6 +7948,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8104,6 +8145,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skc_lookup_tcp:
 		return &bpf_skc_lookup_tcp_proto;
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8142,6 +8185,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_skb_under_cgroup:
 		return &bpf_skb_under_cgroup_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1d6085e15fc8..3f1800a2b77c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5253,11 +5253,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5265,10 +5276,20 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5355,6 +5376,18 @@ union bpf_attr {
  *	Return
  *		Current *ktime*.
  *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5566,6 +5599,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-22 23:56 [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Joanne Koong
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
@ 2022-08-22 23:56 ` Joanne Koong
  2022-08-23  2:30   ` Kumar Kartikeya Dwivedi
  2022-08-24  1:15   ` kernel test robot
  2022-08-22 23:56 ` [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
  2022-08-23  2:31 ` [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Kumar Kartikeya Dwivedi
  3 siblings, 2 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-22 23:56 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. For data slices, direct access to
data in fragments is also permitted, but access across fragments
is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.

Any helper calls that change the underlying packet buffer (eg
bpf_xdp_adjust_head) invalidates any data slices of the associated
dynptr. Whenever such a helper call is made, the verifier marks any
PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
PTR_TO_PACKETs) as unknown. The stack trace for this is
check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()

For examples of how xdp dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  6 ++++-
 include/linux/filter.h         |  3 +++
 include/uapi/linux/bpf.h       | 25 +++++++++++++++---
 kernel/bpf/helpers.c           | 14 ++++++++++-
 kernel/bpf/verifier.c          |  8 +++++-
 net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
 7 files changed, 112 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 30615d1a0c13..455a215b6c57 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -410,11 +410,15 @@ enum bpf_type_flag {
 	/* DYNPTR points to sk_buff */
 	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to xdp_buff */
+	DYNPTR_TYPE_XDP		= BIT(12 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
+				 | DYNPTR_TYPE_XDP)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 649063d9cbfd..80f030239877 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
 int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
 			  u32 len, u64 flags);
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 320e6b95d95c..9feea29eebcd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5283,13 +5283,18 @@ union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5388,6 +5393,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5618,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 471a01a9b6ae..2b9dc4c6de04 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 		return 0;
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	case BPF_DYNPTR_TYPE_XDP:
+		return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
 	default:
 		WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
 					     flags);
+	case BPF_DYNPTR_TYPE_XDP:
+		if (flags)
+			return -EINVAL;
+		return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
 	default:
 		WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 
 	type = bpf_dynptr_get_type(ptr);
 
-	/* Only skb dynptrs can get read-only data slices, because the
+	/* Only skb and xdp dynptrs can get read-only data slices, because the
 	 * verifier enforces PTR_TO_PACKET accesses
 	 */
 	is_rdonly = bpf_dynptr_is_rdonly(ptr);
@@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 		data = skb->data;
 		break;
 	}
+	case BPF_DYNPTR_TYPE_XDP:
+		/* if the requested data in across fragments, then it cannot
+		 * be accessed directly - bpf_xdp_pointer will return NULL
+		 */
+		return (unsigned long)bpf_xdp_pointer(ptr->data,
+						      ptr->offset + offset, len);
 	default:
 		WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
 		return 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1ea295f47525..d33648eee188 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_RINGBUF;
 	case DYNPTR_TYPE_SKB:
 		return BPF_DYNPTR_TYPE_SKB;
+	case DYNPTR_TYPE_XDP:
+		return BPF_DYNPTR_TYPE_XDP;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			case DYNPTR_TYPE_SKB:
 				err_extra = "skb ";
 				break;
+			case DYNPTR_TYPE_XDP:
+				err_extra = "xdp ";
+				break;
 			}
 
 			verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
@@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 
 		if (func_id == BPF_FUNC_dynptr_data &&
-		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+		    (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
+		     dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
 			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
 			regs[BPF_REG_0].range = meta.mem_size;
 		} else {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b204b42fb3e..54fbe8f511db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3825,7 +3825,29 @@ static const struct bpf_func_proto sk_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_1(bpf_xdp_get_buff_len, struct  xdp_buff*, xdp)
+BPF_CALL_3(bpf_dynptr_from_xdp, struct xdp_buff*, xdp, u64, flags,
+	   struct bpf_dynptr_kern *, ptr)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, xdp, BPF_DYNPTR_TYPE_XDP, 0, xdp_get_buff_len(xdp));
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_xdp_proto = {
+	.func		= bpf_dynptr_from_xdp,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_XDP | MEM_UNINIT,
+};
+
+BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
 {
 	return xdp_get_buff_len(xdp);
 }
@@ -3927,7 +3949,7 @@ static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
 	}
 }
 
-static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	u32 size = xdp->data_end - xdp->data;
@@ -3958,8 +3980,7 @@ static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 	return offset + len <= size ? addr + offset : NULL;
 }
 
-BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -3975,6 +3996,12 @@ BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_load_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.func		= bpf_xdp_load_bytes,
 	.gpl_only	= false,
@@ -3985,8 +4012,7 @@ static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
-BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -4002,6 +4028,12 @@ BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_store_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
 	.func		= bpf_xdp_store_bytes,
 	.gpl_only	= false,
@@ -8009,6 +8041,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_xdp:
+		return &bpf_dynptr_from_xdp_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3f1800a2b77c..0d5b0117db2a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5283,13 +5283,18 @@ union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5388,6 +5393,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5618,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-22 23:56 [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Joanne Koong
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-08-22 23:56 ` [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs Joanne Koong
@ 2022-08-22 23:56 ` Joanne Koong
  2022-08-24 18:47   ` Andrii Nakryiko
  2022-08-23  2:31 ` [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Kumar Kartikeya Dwivedi
  3 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-22 23:56 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Test skb and xdp dynptr functionality in the following ways:

1) progs/test_cls_redirect_dynptr.c
   * Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse
     skb data

   * This is a great example of how dynptrs can be used to simplify a
     lot of the parsing logic for non-statically known values, and speed
     up execution times.

     When measuring the user + system time between the original version
     vs. using dynptrs, and averaging the time for 10 runs (using
     "time ./test_progs -t cls_redirect"), there was a 2x speed-up:
         original version: 0.053 sec
         with dynptrs: 0.025 sec

2) progs/test_xdp_dynptr.c
   * Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t xdp_bpf2bpf"):
         original version: 0.0449 sec
         with dynptrs: 0.0429 sec

3) progs/test_l4lb_noinline_dynptr.c
   * Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse
     skb data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t l4lb_all"):
         original version: 0.0502 sec
         with dynptrs: 0.055 sec

     For number of processed verifier instructions:
         original version: 6284 insns
         with dynptrs: 2538 insns

4) progs/test_parse_tcp_hdr_opt_dynptr.c
   * Add sample code for parsing tcp hdr opt lookup using dynptrs.
     This logic is lifted from a real-world use case of packet parsing in
     katran [0], a layer 4 load balancer. The original version
     "progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included
     here as well, for comparison.

5) progs/dynptr_success.c
   * Add test case "test_skb_readonly" for testing attempts at writes /
     data slices on a prog type with read-only skb ctx.

6) progs/dynptr_fail.c
   * Add test cases "skb_invalid_data_slice{1,2}" and
     "xdp_invalid_data_slice" for testing that helpers that modify the
     underlying packet buffer automatically invalidate the associated
     data slice.
   * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
     that prog types that do not support bpf_dynptr_from_skb/xdp don't
     have access to the API.
   * Add test case "skb_invalid_write" for testing that read-only skb
     dynptrs can't be written to through data slices.

[0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 .../selftests/bpf/prog_tests/cls_redirect.c   |  25 +
 .../testing/selftests/bpf/prog_tests/dynptr.c | 132 ++-
 .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +
 .../bpf/prog_tests/parse_tcp_hdr_opt.c        |  85 ++
 .../selftests/bpf/prog_tests/xdp_attach.c     |   9 +-
 .../testing/selftests/bpf/progs/dynptr_fail.c | 111 ++
 .../selftests/bpf/progs/dynptr_success.c      |  29 +
 .../bpf/progs/test_cls_redirect_dynptr.c      | 968 ++++++++++++++++++
 .../bpf/progs/test_l4lb_noinline_dynptr.c     | 468 +++++++++
 .../bpf/progs/test_parse_tcp_hdr_opt.c        | 119 +++
 .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++
 .../selftests/bpf/progs/test_xdp_dynptr.c     | 240 +++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 13 files changed, 2255 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
index 224f016b0a53..2a55f717fc07 100644
--- a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
@@ -13,6 +13,7 @@
 
 #include "progs/test_cls_redirect.h"
 #include "test_cls_redirect.skel.h"
+#include "test_cls_redirect_dynptr.skel.h"
 #include "test_cls_redirect_subprogs.skel.h"
 
 #define ENCAP_IP INADDR_LOOPBACK
@@ -446,6 +447,28 @@ static void test_cls_redirect_common(struct bpf_program *prog)
 	close_fds((int *)conns, sizeof(conns) / sizeof(conns[0][0]));
 }
 
+static void test_cls_redirect_dynptr(void)
+{
+	struct test_cls_redirect_dynptr *skel;
+	int err;
+
+	skel = test_cls_redirect_dynptr__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	skel->rodata->ENCAPSULATION_IP = htonl(ENCAP_IP);
+	skel->rodata->ENCAPSULATION_PORT = htons(ENCAP_PORT);
+
+	err = test_cls_redirect_dynptr__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	test_cls_redirect_common(skel->progs.cls_redirect);
+
+cleanup:
+	test_cls_redirect_dynptr__destroy(skel);
+}
+
 static void test_cls_redirect_inlined(void)
 {
 	struct test_cls_redirect *skel;
@@ -496,4 +519,6 @@ void test_cls_redirect(void)
 		test_cls_redirect_inlined();
 	if (test__start_subtest("cls_redirect_subprogs"))
 		test_cls_redirect_subprogs();
+	if (test__start_subtest("cls_redirect_dynptr"))
+		test_cls_redirect_dynptr();
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index bcf80b9f7c27..3ec1a8b6b6fb 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -2,51 +2,69 @@
 /* Copyright (c) 2022 Facebook */
 
 #include <test_progs.h>
+#include <network_helpers.h>
 #include "dynptr_fail.skel.h"
 #include "dynptr_success.skel.h"
 
 static size_t log_buf_sz = 1048576; /* 1 MB */
 static char obj_log_buf[1048576];
 
+enum test_setup_type {
+	/* no set up is required. the prog will just be loaded */
+	SETUP_NONE,
+	SETUP_SYSCALL_SLEEP,
+	SETUP_SKB_PROG,
+};
+
 static struct {
 	const char *prog_name;
 	const char *expected_err_msg;
+	enum test_setup_type type;
 } dynptr_tests[] = {
-	/* failure cases */
-	{"ringbuf_missing_release1", "Unreleased reference id=1"},
-	{"ringbuf_missing_release2", "Unreleased reference id=2"},
-	{"ringbuf_missing_release_callback", "Unreleased reference id"},
-	{"use_after_invalid", "Expected an initialized dynptr as arg #3"},
-	{"ringbuf_invalid_api", "type=mem expected=alloc_mem"},
-	{"add_dynptr_to_map1", "invalid indirect read from stack"},
-	{"add_dynptr_to_map2", "invalid indirect read from stack"},
-	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range"},
-	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range"},
-	{"data_slice_use_after_release1", "invalid mem access 'scalar'"},
-	{"data_slice_use_after_release2", "invalid mem access 'scalar'"},
-	{"data_slice_missing_null_check1", "invalid mem access 'mem_or_null'"},
-	{"data_slice_missing_null_check2", "invalid mem access 'mem_or_null'"},
-	{"invalid_helper1", "invalid indirect read from stack"},
-	{"invalid_helper2", "Expected an initialized dynptr as arg #3"},
-	{"invalid_write1", "Expected an initialized dynptr as arg #1"},
-	{"invalid_write2", "Expected an initialized dynptr as arg #3"},
-	{"invalid_write3", "Expected an initialized ringbuf dynptr as arg #1"},
-	{"invalid_write4", "arg 1 is an unacquired reference"},
-	{"invalid_read1", "invalid read from stack"},
-	{"invalid_read2", "cannot pass in dynptr at an offset"},
-	{"invalid_read3", "invalid read from stack"},
-	{"invalid_read4", "invalid read from stack"},
-	{"invalid_offset", "invalid write to stack"},
-	{"global", "type=map_value expected=fp"},
-	{"release_twice", "arg 1 is an unacquired reference"},
-	{"release_twice_callback", "arg 1 is an unacquired reference"},
+	/* these cases should trigger a verifier error */
+	{"ringbuf_missing_release1", "Unreleased reference id=1", SETUP_NONE},
+	{"ringbuf_missing_release2", "Unreleased reference id=2", SETUP_NONE},
+	{"ringbuf_missing_release_callback", "Unreleased reference id", SETUP_NONE},
+	{"use_after_invalid", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"ringbuf_invalid_api", "type=mem expected=alloc_mem", SETUP_NONE},
+	{"add_dynptr_to_map1", "invalid indirect read from stack", SETUP_NONE},
+	{"add_dynptr_to_map2", "invalid indirect read from stack", SETUP_NONE},
+	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range",
+		SETUP_NONE},
+	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range",
+		SETUP_NONE},
+	{"data_slice_use_after_release1", "invalid mem access 'scalar'", SETUP_NONE},
+	{"data_slice_use_after_release2", "invalid mem access 'scalar'", SETUP_NONE},
+	{"data_slice_missing_null_check1", "invalid mem access 'mem_or_null'", SETUP_NONE},
+	{"data_slice_missing_null_check2", "invalid mem access 'mem_or_null'", SETUP_NONE},
+	{"invalid_helper1", "invalid indirect read from stack", SETUP_NONE},
+	{"invalid_helper2", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"invalid_write1", "Expected an initialized dynptr as arg #1", SETUP_NONE},
+	{"invalid_write2", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"invalid_write3", "Expected an initialized ringbuf dynptr as arg #1", SETUP_NONE},
+	{"invalid_write4", "arg 1 is an unacquired reference", SETUP_NONE},
+	{"invalid_read1", "invalid read from stack", SETUP_NONE},
+	{"invalid_read2", "cannot pass in dynptr at an offset", SETUP_NONE},
+	{"invalid_read3", "invalid read from stack", SETUP_NONE},
+	{"invalid_read4", "invalid read from stack", SETUP_NONE},
+	{"invalid_offset", "invalid write to stack", SETUP_NONE},
+	{"global", "type=map_value expected=fp", SETUP_NONE},
+	{"release_twice", "arg 1 is an unacquired reference", SETUP_NONE},
+	{"release_twice_callback", "arg 1 is an unacquired reference", SETUP_NONE},
 	{"dynptr_from_mem_invalid_api",
-		"Unsupported reg type fp for bpf_dynptr_from_mem data"},
-
-	/* success cases */
-	{"test_read_write", NULL},
-	{"test_data_slice", NULL},
-	{"test_ringbuf", NULL},
+		"Unsupported reg type fp for bpf_dynptr_from_mem data", SETUP_NONE},
+	{"skb_invalid_data_slice1", "invalid mem access 'scalar'", SETUP_NONE},
+	{"skb_invalid_data_slice2", "invalid mem access 'scalar'", SETUP_NONE},
+	{"xdp_invalid_data_slice", "invalid mem access 'scalar'", SETUP_NONE},
+	{"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb", SETUP_NONE},
+	{"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp", SETUP_NONE},
+	{"skb_invalid_write", "cannot write into packet", SETUP_NONE},
+
+	/* these tests should be run and should succeed */
+	{"test_read_write", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_data_slice", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_ringbuf", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_skb_readonly", NULL, SETUP_SKB_PROG},
 };
 
 static void verify_fail(const char *prog_name, const char *expected_err_msg)
@@ -85,7 +103,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
 	dynptr_fail__destroy(skel);
 }
 
-static void verify_success(const char *prog_name)
+static void run_test(const char *prog_name, enum test_setup_type setup_type)
 {
 	struct dynptr_success *skel;
 	struct bpf_program *prog;
@@ -107,15 +125,45 @@ static void verify_success(const char *prog_name)
 	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
 		goto cleanup;
 
-	link = bpf_program__attach(prog);
-	if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
-		goto cleanup;
+	switch (setup_type) {
+	case SETUP_SYSCALL_SLEEP:
+		link = bpf_program__attach(prog);
+		if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+			goto cleanup;
 
-	usleep(1);
+		usleep(1);
 
-	ASSERT_EQ(skel->bss->err, 0, "err");
+		bpf_link__destroy(link);
+		break;
+	case SETUP_SKB_PROG:
+	{
+		int prog_fd, err;
+		char buf[64];
+
+		LIBBPF_OPTS(bpf_test_run_opts, topts,
+			    .data_in = &pkt_v4,
+			    .data_size_in = sizeof(pkt_v4),
+			    .data_out = buf,
+			    .data_size_out = sizeof(buf),
+			    .repeat = 1,
+		);
+
+		prog_fd = bpf_program__fd(prog);
+		if (!ASSERT_GE(prog_fd, 0, "prog_fd"))
+			goto cleanup;
 
-	bpf_link__destroy(link);
+		err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+		if (!ASSERT_OK(err, "test_run"))
+			goto cleanup;
+
+		break;
+	}
+	case SETUP_NONE:
+		ASSERT_EQ(0, 1, "internal error: SETUP_NONE unimplemented");
+	}
+
+	ASSERT_EQ(skel->bss->err, 0, "err");
 
 cleanup:
 	dynptr_success__destroy(skel);
@@ -133,6 +181,6 @@ void test_dynptr(void)
 			verify_fail(dynptr_tests[i].prog_name,
 				    dynptr_tests[i].expected_err_msg);
 		else
-			verify_success(dynptr_tests[i].prog_name);
+			run_test(dynptr_tests[i].prog_name, dynptr_tests[i].type);
 	}
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
index 55f733ff4109..94079c89f2e9 100644
--- a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
+++ b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
@@ -93,4 +93,6 @@ void test_l4lb_all(void)
 		test_l4lb("test_l4lb.o");
 	if (test__start_subtest("l4lb_noinline"))
 		test_l4lb("test_l4lb_noinline.o");
+	if (test__start_subtest("l4lb_noinline_dynptr"))
+		test_l4lb("test_l4lb_noinline_dynptr.o");
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
new file mode 100644
index 000000000000..0fe729ccedca
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_parse_tcp_hdr_opt.skel.h"
+#include "test_parse_tcp_hdr_opt_dynptr.skel.h"
+#include "test_tcp_hdr_options.h"
+
+struct test_pkt {
+	struct ipv6_packet pk6_v6;
+	u8 options[16];
+} __packed;
+
+struct test_pkt pkt = {
+	.pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+	.pk6_v6.iph.nexthdr = IPPROTO_TCP,
+	.pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+	.pk6_v6.tcp.urg_ptr = 123,
+	.pk6_v6.tcp.doff = 9, /* 16 bytes of options */
+
+	.options = {
+		TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
+		0, 6, 0, 0, 0, 9, TCPOPT_EOL
+	},
+};
+
+static void test_parsing(bool use_dynptr)
+{
+	char buf[128];
+	struct bpf_program *prog;
+	void *skel_ptr;
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = &pkt,
+		    .data_size_in = sizeof(pkt),
+		    .data_out = buf,
+		    .data_size_out = sizeof(buf),
+		    .repeat = 3,
+	);
+
+	if (use_dynptr) {
+		struct test_parse_tcp_hdr_opt_dynptr *skel;
+
+		skel = test_parse_tcp_hdr_opt_dynptr__open_and_load();
+		if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+			return;
+
+		pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
+		prog = skel->progs.xdp_ingress_v6;
+		skel_ptr = skel;
+	} else {
+		struct test_parse_tcp_hdr_opt *skel;
+
+		skel = test_parse_tcp_hdr_opt__open_and_load();
+		if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+			return;
+
+		pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
+		prog = skel->progs.xdp_ingress_v6;
+		skel_ptr = skel;
+	}
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts);
+	ASSERT_OK(err, "ipv6 test_run");
+	ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
+
+	if (use_dynptr) {
+		struct test_parse_tcp_hdr_opt_dynptr *skel = skel_ptr;
+
+		ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
+		test_parse_tcp_hdr_opt_dynptr__destroy(skel);
+	} else {
+		struct test_parse_tcp_hdr_opt *skel = skel_ptr;
+
+		ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
+		test_parse_tcp_hdr_opt__destroy(skel);
+	}
+}
+
+void test_parse_tcp_hdr_opt(void)
+{
+	test_parsing(false);
+	test_parsing(true);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
index 62aa3edda5e6..40d0d61af9e6 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
@@ -4,11 +4,10 @@
 #define IFINDEX_LO 1
 #define XDP_FLAGS_REPLACE		(1U << 4)
 
-void serial_test_xdp_attach(void)
+static void serial_test_xdp_attach(const char *file)
 {
 	__u32 duration = 0, id1, id2, id0 = 0, len;
 	struct bpf_object *obj1, *obj2, *obj3;
-	const char *file = "./test_xdp.o";
 	struct bpf_prog_info info = {};
 	int err, fd1, fd2, fd3;
 	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
@@ -85,3 +84,9 @@ void serial_test_xdp_attach(void)
 out_1:
 	bpf_object__close(obj1);
 }
+
+void test_xdp_attach(void)
+{
+	serial_test_xdp_attach("./test_xdp.o");
+	serial_test_xdp_attach("./test_xdp_dynptr.o");
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index b0f08ff024fb..141765b2fcb5 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -5,6 +5,7 @@
 #include <string.h>
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>
+#include <linux/if_ether.h>
 #include "bpf_misc.h"
 
 char _license[] SEC("license") = "GPL";
@@ -622,3 +623,113 @@ int dynptr_from_mem_invalid_api(void *ctx)
 
 	return 0;
 }
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?tc")
+int skb_invalid_data_slice1(struct __sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+
+	if (bpf_skb_pull_data(skb, skb->len))
+		return SK_DROP;
+
+	if (!hdr)
+		return SK_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return SK_PASS;
+}
+
+/* The data slice is invalidated whenever bpf_dynptr_write() is called */
+SEC("?tc")
+int skb_invalid_data_slice2(struct __sk_buff *skb)
+{
+	char write_data[64] = "hello there, world!!";
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+
+	bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
+
+	if (!hdr)
+		return SK_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return SK_PASS;
+}
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?xdp")
+int xdp_invalid_data_slice(struct xdp_md *xdp)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+	if (!hdr)
+		return SK_DROP;
+
+	hdr->h_proto = 9;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr)))
+		return XDP_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return XDP_PASS;
+}
+
+/* Only supported prog type can create skb-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int skb_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+
+	return 0;
+}
+
+/* Only supported prog type can create xdp-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int xdp_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_xdp(ctx, 0, &ptr);
+
+	return 0;
+}
+
+/* Read-only skb packet buffers can't be written to through data slices */
+SEC("?cgroup_skb/egress")
+int skb_invalid_write(struct __sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	__u64 *data;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+
+	data = bpf_dynptr_data(&ptr, 0, sizeof(*data));
+	if (!data)
+		return 0;
+
+	/* this should fail */
+	*data = 123;
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c
index a3a6103c8569..d0e504d7aae7 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_success.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_success.c
@@ -162,3 +162,32 @@ int test_ringbuf(void *ctx)
 	bpf_ringbuf_discard_dynptr(&ptr, 0);
 	return 0;
 }
+
+SEC("cgroup_skb/egress")
+int test_skb_readonly(struct __sk_buff *skb)
+{
+	__u8 write_data[2] = {1, 2};
+	struct bpf_dynptr ptr;
+	__u64 *data;
+	int ret;
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr)) {
+		err = 1;
+		return 0;
+	}
+
+	data = bpf_dynptr_data(&ptr, 0, sizeof(*data));
+	if (!data) {
+		err = 2;
+		return 0;
+	}
+
+	/* since cgroup skbs are read only, writes should fail */
+	ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
+	if (ret != -EINVAL) {
+		err = 3;
+		return 0;
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
new file mode 100644
index 000000000000..9549ff7f16b9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2019, 2020 Cloudflare
+
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/bpf.h>
+#include <linux/icmp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/pkt_cls.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "test_cls_redirect.h"
+
+#define offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER) + sizeof((((TYPE *)0)->MEMBER)))
+
+#define IP_OFFSET_MASK (0x1FFF)
+#define IP_MF (0x2000)
+
+char _license[] SEC("license") = "Dual BSD/GPL";
+
+/**
+ * Destination port and IP used for UDP encapsulation.
+ */
+volatile const __be16 ENCAPSULATION_PORT;
+volatile const __be32 ENCAPSULATION_IP;
+
+typedef struct {
+	uint64_t processed_packets_total;
+	uint64_t l3_protocol_packets_total_ipv4;
+	uint64_t l3_protocol_packets_total_ipv6;
+	uint64_t l4_protocol_packets_total_tcp;
+	uint64_t l4_protocol_packets_total_udp;
+	uint64_t accepted_packets_total_syn;
+	uint64_t accepted_packets_total_syn_cookies;
+	uint64_t accepted_packets_total_last_hop;
+	uint64_t accepted_packets_total_icmp_echo_request;
+	uint64_t accepted_packets_total_established;
+	uint64_t forwarded_packets_total_gue;
+	uint64_t forwarded_packets_total_gre;
+
+	uint64_t errors_total_unknown_l3_proto;
+	uint64_t errors_total_unknown_l4_proto;
+	uint64_t errors_total_malformed_ip;
+	uint64_t errors_total_fragmented_ip;
+	uint64_t errors_total_malformed_icmp;
+	uint64_t errors_total_unwanted_icmp;
+	uint64_t errors_total_malformed_icmp_pkt_too_big;
+	uint64_t errors_total_malformed_tcp;
+	uint64_t errors_total_malformed_udp;
+	uint64_t errors_total_icmp_echo_replies;
+	uint64_t errors_total_malformed_encapsulation;
+	uint64_t errors_total_encap_adjust_failed;
+	uint64_t errors_total_encap_buffer_too_small;
+	uint64_t errors_total_redirect_loop;
+	uint64_t errors_total_encap_mtu_violate;
+} metrics_t;
+
+typedef enum {
+	INVALID = 0,
+	UNKNOWN,
+	ECHO_REQUEST,
+	SYN,
+	SYN_COOKIE,
+	ESTABLISHED,
+} verdict_t;
+
+typedef struct {
+	uint16_t src, dst;
+} flow_ports_t;
+
+_Static_assert(
+	sizeof(flow_ports_t) !=
+		offsetofend(struct bpf_sock_tuple, ipv4.dport) -
+			offsetof(struct bpf_sock_tuple, ipv4.sport) - 1,
+	"flow_ports_t must match sport and dport in struct bpf_sock_tuple");
+_Static_assert(
+	sizeof(flow_ports_t) !=
+		offsetofend(struct bpf_sock_tuple, ipv6.dport) -
+			offsetof(struct bpf_sock_tuple, ipv6.sport) - 1,
+	"flow_ports_t must match sport and dport in struct bpf_sock_tuple");
+
+struct iphdr_info {
+	void *hdr;
+	__u64 len;
+};
+
+typedef int ret_t;
+
+/* This is a bit of a hack. We need a return value which allows us to
+ * indicate that the regular flow of the program should continue,
+ * while allowing functions to use XDP_PASS and XDP_DROP, etc.
+ */
+static const ret_t CONTINUE_PROCESSING = -1;
+
+/* Convenience macro to call functions which return ret_t.
+ */
+#define MAYBE_RETURN(x)                           \
+	do {                                      \
+		ret_t __ret = x;                  \
+		if (__ret != CONTINUE_PROCESSING) \
+			return __ret;             \
+	} while (0)
+
+static bool ipv4_is_fragment(const struct iphdr *ip)
+{
+	uint16_t frag_off = ip->frag_off & bpf_htons(IP_OFFSET_MASK);
+	return (ip->frag_off & bpf_htons(IP_MF)) != 0 || frag_off > 0;
+}
+
+static int pkt_parse_ipv4(struct bpf_dynptr *dynptr, __u64 *offset, struct iphdr *iphdr)
+{
+	if (bpf_dynptr_read(iphdr, sizeof(*iphdr), dynptr, *offset, 0))
+		return -1;
+
+	*offset += sizeof(*iphdr);
+
+	if (iphdr->ihl < 5)
+		return -1;
+
+	/* skip ipv4 options */
+	*offset += (iphdr->ihl - 5) * 4;
+
+	return 0;
+}
+
+/* Parse the L4 ports from a packet, assuming a layout like TCP or UDP. */
+static bool pkt_parse_icmp_l4_ports(struct bpf_dynptr *dynptr, __u64 *offset, flow_ports_t *ports)
+{
+	if (bpf_dynptr_read(ports, sizeof(*ports), dynptr, *offset, 0))
+		return false;
+
+	*offset += sizeof(*ports);
+
+	/* Ports in the L4 headers are reversed, since we are parsing an ICMP
+	 * payload which is going towards the eyeball.
+	 */
+	uint16_t dst = ports->src;
+	ports->src = ports->dst;
+	ports->dst = dst;
+	return true;
+}
+
+static uint16_t pkt_checksum_fold(uint32_t csum)
+{
+	/* The highest reasonable value for an IPv4 header
+	 * checksum requires two folds, so we just do that always.
+	 */
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+	return (uint16_t)~csum;
+}
+
+static void pkt_ipv4_checksum(struct iphdr *iph)
+{
+	iph->check = 0;
+
+	/* An IP header without options is 20 bytes. Two of those
+	 * are the checksum, which we always set to zero. Hence,
+	 * the maximum accumulated value is 18 / 2 * 0xffff = 0x8fff7,
+	 * which fits in 32 bit.
+	 */
+	_Static_assert(sizeof(struct iphdr) == 20, "iphdr must be 20 bytes");
+	uint32_t acc = 0;
+	uint16_t *ipw = (uint16_t *)iph;
+
+	for (size_t i = 0; i < sizeof(struct iphdr) / 2; i++)
+		acc += ipw[i];
+
+	iph->check = pkt_checksum_fold(acc);
+}
+
+static bool pkt_skip_ipv6_extension_headers(struct bpf_dynptr *dynptr, __u64 *offset,
+					    const struct ipv6hdr *ipv6, uint8_t *upper_proto,
+					    bool *is_fragment)
+{
+	/* We understand five extension headers.
+	 * https://tools.ietf.org/html/rfc8200#section-4.1 states that all
+	 * headers should occur once, except Destination Options, which may
+	 * occur twice. Hence we give up after 6 headers.
+	 */
+	struct {
+		uint8_t next;
+		uint8_t len;
+	} exthdr = {
+		.next = ipv6->nexthdr,
+	};
+	*is_fragment = false;
+
+	for (int i = 0; i < 6; i++) {
+		switch (exthdr.next) {
+		case IPPROTO_FRAGMENT:
+			*is_fragment = true;
+			/* NB: We don't check that hdrlen == 0 as per spec. */
+			/* fallthrough; */
+
+		case IPPROTO_HOPOPTS:
+		case IPPROTO_ROUTING:
+		case IPPROTO_DSTOPTS:
+		case IPPROTO_MH:
+			if (bpf_dynptr_read(&exthdr, sizeof(exthdr), dynptr, *offset, 0))
+				return false;
+
+			/* hdrlen is in 8-octet units, and excludes the first 8 octets. */
+			*offset += (exthdr.len + 1) * 8;
+
+			/* Decode next header */
+			break;
+
+		default:
+			/* The next header is not one of the known extension
+			 * headers, treat it as the upper layer header.
+			 *
+			 * This handles IPPROTO_NONE.
+			 *
+			 * Encapsulating Security Payload (50) and Authentication
+			 * Header (51) also end up here (and will trigger an
+			 * unknown proto error later). They have a custom header
+			 * format and seem too esoteric to care about.
+			 */
+			*upper_proto = exthdr.next;
+			return true;
+		}
+	}
+
+	/* We never found an upper layer header. */
+	return false;
+}
+
+static int pkt_parse_ipv6(struct bpf_dynptr *dynptr, __u64 *offset, struct ipv6hdr *ipv6,
+			  uint8_t *proto, bool *is_fragment)
+{
+	if (bpf_dynptr_read(ipv6, sizeof(*ipv6), dynptr, *offset, 0))
+		return -1;
+
+	*offset += sizeof(*ipv6);
+
+	if (!pkt_skip_ipv6_extension_headers(dynptr, offset, ipv6, proto, is_fragment))
+		return -1;
+
+	return 0;
+}
+
+/* Global metrics, per CPU
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, unsigned int);
+	__type(value, metrics_t);
+} metrics_map SEC(".maps");
+
+static metrics_t *get_global_metrics(void)
+{
+	uint64_t key = 0;
+	return bpf_map_lookup_elem(&metrics_map, &key);
+}
+
+static ret_t accept_locally(struct __sk_buff *skb, encap_headers_t *encap)
+{
+	const int payload_off =
+		sizeof(*encap) +
+		sizeof(struct in_addr) * encap->unigue.hop_count;
+	int32_t encap_overhead = payload_off - sizeof(struct ethhdr);
+
+	// Changing the ethertype if the encapsulated packet is ipv6
+	if (encap->gue.proto_ctype == IPPROTO_IPV6)
+		encap->eth.h_proto = bpf_htons(ETH_P_IPV6);
+
+	if (bpf_skb_adjust_room(skb, -encap_overhead, BPF_ADJ_ROOM_MAC,
+				BPF_F_ADJ_ROOM_FIXED_GSO |
+				BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
+	    bpf_csum_level(skb, BPF_CSUM_LEVEL_DEC))
+		return TC_ACT_SHOT;
+
+	return bpf_redirect(skb->ifindex, BPF_F_INGRESS);
+}
+
+static ret_t forward_with_gre(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      encap_headers_t *encap, struct in_addr *next_hop,
+			      metrics_t *metrics)
+{
+	const int payload_off =
+		sizeof(*encap) +
+		sizeof(struct in_addr) * encap->unigue.hop_count;
+	int32_t encap_overhead =
+		payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
+	int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
+	uint16_t proto = ETH_P_IP;
+	uint32_t mtu_len = 0;
+	encap_gre_t *encap_gre;
+
+	metrics->forwarded_packets_total_gre++;
+
+	/* Loop protection: the inner packet's TTL is decremented as a safeguard
+	 * against any forwarding loop. As the only interesting field is the TTL
+	 * hop limit for IPv6, it is easier to use bpf_skb_load_bytes/bpf_skb_store_bytes
+	 * as they handle the split packets if needed (no need for the data to be
+	 * in the linear section).
+	 */
+	if (encap->gue.proto_ctype == IPPROTO_IPV6) {
+		proto = ETH_P_IPV6;
+		uint8_t ttl;
+		int rc;
+
+		rc = bpf_skb_load_bytes(
+			skb, payload_off + offsetof(struct ipv6hdr, hop_limit),
+			&ttl, 1);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		if (ttl == 0) {
+			metrics->errors_total_redirect_loop++;
+			return TC_ACT_SHOT;
+		}
+
+		ttl--;
+		rc = bpf_skb_store_bytes(
+			skb, payload_off + offsetof(struct ipv6hdr, hop_limit),
+			&ttl, 1, 0);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+	} else {
+		uint8_t ttl;
+		int rc;
+
+		rc = bpf_skb_load_bytes(
+			skb, payload_off + offsetof(struct iphdr, ttl), &ttl,
+			1);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		if (ttl == 0) {
+			metrics->errors_total_redirect_loop++;
+			return TC_ACT_SHOT;
+		}
+
+		/* IPv4 also has a checksum to patch. While the TTL is only one byte,
+		 * this function only works for 2 and 4 bytes arguments (the result is
+		 * the same).
+		 */
+		rc = bpf_l3_csum_replace(
+			skb, payload_off + offsetof(struct iphdr, check), ttl,
+			ttl - 1, 2);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		ttl--;
+		rc = bpf_skb_store_bytes(
+			skb, payload_off + offsetof(struct iphdr, ttl), &ttl, 1,
+			0);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+	}
+
+	if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+		metrics->errors_total_encap_mtu_violate++;
+		return TC_ACT_SHOT;
+	}
+
+	if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
+				BPF_F_ADJ_ROOM_FIXED_GSO |
+				BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
+	    bpf_csum_level(skb, BPF_CSUM_LEVEL_INC)) {
+		metrics->errors_total_encap_adjust_failed++;
+		return TC_ACT_SHOT;
+	}
+
+	if (bpf_skb_pull_data(skb, sizeof(encap_gre_t))) {
+		metrics->errors_total_encap_buffer_too_small++;
+		return TC_ACT_SHOT;
+	}
+
+	encap_gre = bpf_dynptr_data(dynptr, 0, sizeof(encap_gre_t));
+	if (!encap_gre) {
+		metrics->errors_total_encap_buffer_too_small++;
+		return TC_ACT_SHOT;
+	}
+
+	encap_gre->ip.protocol = IPPROTO_GRE;
+	encap_gre->ip.daddr = next_hop->s_addr;
+	encap_gre->ip.saddr = ENCAPSULATION_IP;
+	encap_gre->ip.tot_len =
+		bpf_htons(bpf_ntohs(encap_gre->ip.tot_len) + delta);
+	encap_gre->gre.flags = 0;
+	encap_gre->gre.protocol = bpf_htons(proto);
+	pkt_ipv4_checksum((void *)&encap_gre->ip);
+
+	return bpf_redirect(skb->ifindex, 0);
+}
+
+static ret_t forward_to_next_hop(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+				 encap_headers_t *encap, struct in_addr *next_hop,
+				 metrics_t *metrics)
+{
+	/* swap L2 addresses */
+	/* This assumes that packets are received from a router.
+	 * So just swapping the MAC addresses here will make the packet go back to
+	 * the router, which will send it to the appropriate machine.
+	 */
+	unsigned char temp[ETH_ALEN];
+	memcpy(temp, encap->eth.h_dest, sizeof(temp));
+	memcpy(encap->eth.h_dest, encap->eth.h_source,
+	       sizeof(encap->eth.h_dest));
+	memcpy(encap->eth.h_source, temp, sizeof(encap->eth.h_source));
+
+	if (encap->unigue.next_hop == encap->unigue.hop_count - 1 &&
+	    encap->unigue.last_hop_gre) {
+		return forward_with_gre(skb, dynptr, encap, next_hop, metrics);
+	}
+
+	metrics->forwarded_packets_total_gue++;
+	uint32_t old_saddr = encap->ip.saddr;
+	encap->ip.saddr = encap->ip.daddr;
+	encap->ip.daddr = next_hop->s_addr;
+	if (encap->unigue.next_hop < encap->unigue.hop_count) {
+		encap->unigue.next_hop++;
+	}
+
+	/* Remove ip->saddr, add next_hop->s_addr */
+	const uint64_t off = offsetof(typeof(*encap), ip.check);
+	int ret = bpf_l3_csum_replace(skb, off, old_saddr, next_hop->s_addr, 4);
+	if (ret < 0) {
+		return TC_ACT_SHOT;
+	}
+
+	return bpf_redirect(skb->ifindex, 0);
+}
+
+static ret_t skip_next_hops(__u64 *offset, int n)
+{
+	__u32 res;
+	switch (n) {
+	case 1:
+		*offset += sizeof(struct in_addr);
+	case 0:
+		return CONTINUE_PROCESSING;
+
+	default:
+		return TC_ACT_SHOT;
+	}
+}
+
+/* Get the next hop from the GLB header.
+ *
+ * Sets next_hop->s_addr to 0 if there are no more hops left.
+ * pkt is positioned just after the variable length GLB header
+ * iff the call is successful.
+ */
+static ret_t get_next_hop(struct bpf_dynptr *dynptr, __u64 *offset, encap_headers_t *encap,
+			  struct in_addr *next_hop)
+{
+	if (encap->unigue.next_hop > encap->unigue.hop_count)
+		return TC_ACT_SHOT;
+
+	/* Skip "used" next hops. */
+	MAYBE_RETURN(skip_next_hops(offset, encap->unigue.next_hop));
+
+	if (encap->unigue.next_hop == encap->unigue.hop_count) {
+		/* No more next hops, we are at the end of the GLB header. */
+		next_hop->s_addr = 0;
+		return CONTINUE_PROCESSING;
+	}
+
+	if (bpf_dynptr_read(next_hop, sizeof(*next_hop), dynptr, *offset, 0))
+		return TC_ACT_SHOT;
+
+	*offset += sizeof(*next_hop);
+
+	/* Skip the remainig next hops (may be zero). */
+	return skip_next_hops(offset, encap->unigue.hop_count - encap->unigue.next_hop - 1);
+}
+
+/* Fill a bpf_sock_tuple to be used with the socket lookup functions.
+ * This is a kludge that let's us work around verifier limitations:
+ *
+ *    fill_tuple(&t, foo, sizeof(struct iphdr), 123, 321)
+ *
+ * clang will substitue a costant for sizeof, which allows the verifier
+ * to track it's value. Based on this, it can figure out the constant
+ * return value, and calling code works while still being "generic" to
+ * IPv4 and IPv6.
+ */
+static uint64_t fill_tuple(struct bpf_sock_tuple *tuple, void *iph,
+				    uint64_t iphlen, uint16_t sport, uint16_t dport)
+{
+	switch (iphlen) {
+	case sizeof(struct iphdr): {
+		struct iphdr *ipv4 = (struct iphdr *)iph;
+		tuple->ipv4.daddr = ipv4->daddr;
+		tuple->ipv4.saddr = ipv4->saddr;
+		tuple->ipv4.sport = sport;
+		tuple->ipv4.dport = dport;
+		return sizeof(tuple->ipv4);
+	}
+
+	case sizeof(struct ipv6hdr): {
+		struct ipv6hdr *ipv6 = (struct ipv6hdr *)iph;
+		memcpy(&tuple->ipv6.daddr, &ipv6->daddr,
+		       sizeof(tuple->ipv6.daddr));
+		memcpy(&tuple->ipv6.saddr, &ipv6->saddr,
+		       sizeof(tuple->ipv6.saddr));
+		tuple->ipv6.sport = sport;
+		tuple->ipv6.dport = dport;
+		return sizeof(tuple->ipv6);
+	}
+
+	default:
+		return 0;
+	}
+}
+
+static verdict_t classify_tcp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple,
+			      uint64_t tuplen, void *iph, struct tcphdr *tcp)
+{
+	struct bpf_sock *sk =
+		bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
+
+	if (sk == NULL)
+		return UNKNOWN;
+
+	if (sk->state != BPF_TCP_LISTEN) {
+		bpf_sk_release(sk);
+		return ESTABLISHED;
+	}
+
+	if (iph != NULL && tcp != NULL) {
+		/* Kludge: we've run out of arguments, but need the length of the ip header. */
+		uint64_t iphlen = sizeof(struct iphdr);
+
+		if (tuplen == sizeof(tuple->ipv6))
+			iphlen = sizeof(struct ipv6hdr);
+
+		if (bpf_tcp_check_syncookie(sk, iph, iphlen, tcp,
+					    sizeof(*tcp)) == 0) {
+			bpf_sk_release(sk);
+			return SYN_COOKIE;
+		}
+	}
+
+	bpf_sk_release(sk);
+	return UNKNOWN;
+}
+
+static verdict_t classify_udp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple, uint64_t tuplen)
+{
+	struct bpf_sock *sk =
+		bpf_sk_lookup_udp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
+
+	if (sk == NULL)
+		return UNKNOWN;
+
+	if (sk->state == BPF_TCP_ESTABLISHED) {
+		bpf_sk_release(sk);
+		return ESTABLISHED;
+	}
+
+	bpf_sk_release(sk);
+	return UNKNOWN;
+}
+
+static verdict_t classify_icmp(struct __sk_buff *skb, uint8_t proto, struct bpf_sock_tuple *tuple,
+			       uint64_t tuplen, metrics_t *metrics)
+{
+	switch (proto) {
+	case IPPROTO_TCP:
+		return classify_tcp(skb, tuple, tuplen, NULL, NULL);
+
+	case IPPROTO_UDP:
+		return classify_udp(skb, tuple, tuplen);
+
+	default:
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+}
+
+static verdict_t process_icmpv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr, __u64 *offset,
+				metrics_t *metrics)
+{
+	struct icmphdr icmp;
+	struct iphdr ipv4;
+
+	if (bpf_dynptr_read(&icmp, sizeof(icmp), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+
+	*offset += sizeof(icmp);
+
+	/* We should never receive encapsulated echo replies. */
+	if (icmp.type == ICMP_ECHOREPLY) {
+		metrics->errors_total_icmp_echo_replies++;
+		return INVALID;
+	}
+
+	if (icmp.type == ICMP_ECHO)
+		return ECHO_REQUEST;
+
+	if (icmp.type != ICMP_DEST_UNREACH || icmp.code != ICMP_FRAG_NEEDED) {
+		metrics->errors_total_unwanted_icmp++;
+		return INVALID;
+	}
+
+	if (pkt_parse_ipv4(dynptr, offset, &ipv4)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	/* The source address in the outer IP header is from the entity that
+	 * originated the ICMP message. Use the original IP header to restore
+	 * the correct flow tuple.
+	 */
+	struct bpf_sock_tuple tuple;
+	tuple.ipv4.saddr = ipv4.daddr;
+	tuple.ipv4.daddr = ipv4.saddr;
+
+	if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv4.sport)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	return classify_icmp(skb, ipv4.protocol, &tuple,
+			     sizeof(tuple.ipv4), metrics);
+}
+
+static verdict_t process_icmpv6(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+				metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct ipv6hdr ipv6;
+	struct icmp6hdr icmp6;
+	bool is_fragment;
+	uint8_t l4_proto;
+
+	if (bpf_dynptr_read(&icmp6, sizeof(icmp6), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+
+	/* We should never receive encapsulated echo replies. */
+	if (icmp6.icmp6_type == ICMPV6_ECHO_REPLY) {
+		metrics->errors_total_icmp_echo_replies++;
+		return INVALID;
+	}
+
+	if (icmp6.icmp6_type == ICMPV6_ECHO_REQUEST) {
+		return ECHO_REQUEST;
+	}
+
+	if (icmp6.icmp6_type != ICMPV6_PKT_TOOBIG) {
+		metrics->errors_total_unwanted_icmp++;
+		return INVALID;
+	}
+
+	if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	if (is_fragment) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	/* Swap source and dest addresses. */
+	memcpy(&tuple.ipv6.saddr, &ipv6.daddr, sizeof(tuple.ipv6.saddr));
+	memcpy(&tuple.ipv6.daddr, &ipv6.saddr, sizeof(tuple.ipv6.daddr));
+
+	if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv6.sport)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	return classify_icmp(skb, l4_proto, &tuple, sizeof(tuple.ipv6),
+			     metrics);
+}
+
+static verdict_t process_tcp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+			     struct iphdr_info *info, metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct tcphdr tcp;
+	uint64_t tuplen;
+
+	metrics->l4_protocol_packets_total_tcp++;
+
+	if (bpf_dynptr_read(&tcp, sizeof(tcp), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_tcp++;
+		return INVALID;
+	}
+
+	*offset += sizeof(tcp);
+
+	if (tcp.syn)
+		return SYN;
+
+	tuplen = fill_tuple(&tuple, info->hdr, info->len, tcp.source, tcp.dest);
+	return classify_tcp(skb, &tuple, tuplen, info->hdr, &tcp);
+}
+
+static verdict_t process_udp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+			     struct iphdr_info *info, metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct udphdr udph;
+	uint64_t tuplen;
+
+	metrics->l4_protocol_packets_total_udp++;
+
+	if (bpf_dynptr_read(&udph, sizeof(udph), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_udp++;
+		return INVALID;
+	}
+	*offset += sizeof(udph);
+
+	tuplen = fill_tuple(&tuple, info->hdr, info->len, udph.source, udph.dest);
+	return classify_udp(skb, &tuple, tuplen);
+}
+
+static verdict_t process_ipv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      __u64 *offset, metrics_t *metrics)
+{
+	struct iphdr ipv4;
+	struct iphdr_info info = {
+		.hdr = &ipv4,
+		.len = sizeof(ipv4),
+	};
+
+	metrics->l3_protocol_packets_total_ipv4++;
+
+	if (pkt_parse_ipv4(dynptr, offset, &ipv4)) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv4.version != 4) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv4_is_fragment(&ipv4)) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	switch (ipv4.protocol) {
+	case IPPROTO_ICMP:
+		return process_icmpv4(skb, dynptr, offset, metrics);
+
+	case IPPROTO_TCP:
+		return process_tcp(dynptr, offset, skb, &info, metrics);
+
+	case IPPROTO_UDP:
+		return process_udp(dynptr, offset, skb, &info, metrics);
+
+	default:
+		metrics->errors_total_unknown_l4_proto++;
+		return INVALID;
+	}
+}
+
+static verdict_t process_ipv6(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      __u64 *offset, metrics_t *metrics)
+{
+	struct ipv6hdr ipv6;
+	struct iphdr_info info = {
+		.hdr = &ipv6,
+		.len = sizeof(ipv6),
+	};
+	uint8_t l4_proto;
+	bool is_fragment;
+
+	metrics->l3_protocol_packets_total_ipv6++;
+
+	if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv6.version != 6) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (is_fragment) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	switch (l4_proto) {
+	case IPPROTO_ICMPV6:
+		return process_icmpv6(dynptr, offset, skb, metrics);
+
+	case IPPROTO_TCP:
+		return process_tcp(dynptr, offset, skb, &info, metrics);
+
+	case IPPROTO_UDP:
+		return process_udp(dynptr, offset, skb, &info, metrics);
+
+	default:
+		metrics->errors_total_unknown_l4_proto++;
+		return INVALID;
+	}
+}
+
+SEC("tc")
+int cls_redirect(struct __sk_buff *skb)
+{
+	struct bpf_dynptr dynptr;
+	struct in_addr next_hop;
+	/* Tracks offset of the dynptr. This will be unnecessary once
+	 * bpf_dynptr_advance() is available.
+	 */
+	__u64 off = 0;
+
+	bpf_dynptr_from_skb(skb, 0, &dynptr);
+
+	metrics_t *metrics = get_global_metrics();
+	if (metrics == NULL)
+		return TC_ACT_SHOT;
+
+	metrics->processed_packets_total++;
+
+	/* Pass bogus packets as long as we're not sure they're
+	 * destined for us.
+	 */
+	if (skb->protocol != bpf_htons(ETH_P_IP))
+		return TC_ACT_OK;
+
+	encap_headers_t *encap;
+
+	/* Make sure that all encapsulation headers are available in
+	 * the linear portion of the skb. This makes it easy to manipulate them.
+	 */
+	if (bpf_skb_pull_data(skb, sizeof(*encap)))
+		return TC_ACT_OK;
+
+	encap = bpf_dynptr_data(&dynptr, 0, sizeof(*encap));
+	if (!encap)
+		return TC_ACT_OK;
+
+	off += sizeof(*encap);
+
+	if (encap->ip.ihl != 5)
+		/* We never have any options. */
+		return TC_ACT_OK;
+
+	if (encap->ip.daddr != ENCAPSULATION_IP ||
+	    encap->ip.protocol != IPPROTO_UDP)
+		return TC_ACT_OK;
+
+	/* TODO Check UDP length? */
+	if (encap->udp.dest != ENCAPSULATION_PORT)
+		return TC_ACT_OK;
+
+	/* We now know that the packet is destined to us, we can
+	 * drop bogus ones.
+	 */
+	if (ipv4_is_fragment((void *)&encap->ip)) {
+		metrics->errors_total_fragmented_ip++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.variant != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.control != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.flags != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.hlen !=
+	    sizeof(encap->unigue) / 4 + encap->unigue.hop_count) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->unigue.version != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->unigue.reserved != 0)
+		return TC_ACT_SHOT;
+
+	MAYBE_RETURN(get_next_hop(&dynptr, &off, encap, &next_hop));
+
+	if (next_hop.s_addr == 0) {
+		metrics->accepted_packets_total_last_hop++;
+		return accept_locally(skb, encap);
+	}
+
+	verdict_t verdict;
+	switch (encap->gue.proto_ctype) {
+	case IPPROTO_IPIP:
+		verdict = process_ipv4(skb, &dynptr, &off, metrics);
+		break;
+
+	case IPPROTO_IPV6:
+		verdict = process_ipv6(skb, &dynptr, &off, metrics);
+		break;
+
+	default:
+		metrics->errors_total_unknown_l3_proto++;
+		return TC_ACT_SHOT;
+	}
+
+	switch (verdict) {
+	case INVALID:
+		/* metrics have already been bumped */
+		return TC_ACT_SHOT;
+
+	case UNKNOWN:
+		return forward_to_next_hop(skb, &dynptr, encap, &next_hop, metrics);
+
+	case ECHO_REQUEST:
+		metrics->accepted_packets_total_icmp_echo_request++;
+		break;
+
+	case SYN:
+		if (encap->unigue.forward_syn) {
+			return forward_to_next_hop(skb, &dynptr, encap, &next_hop,
+						   metrics);
+		}
+
+		metrics->accepted_packets_total_syn++;
+		break;
+
+	case SYN_COOKIE:
+		metrics->accepted_packets_total_syn_cookies++;
+		break;
+
+	case ESTABLISHED:
+		metrics->accepted_packets_total_established++;
+		break;
+	}
+
+	return accept_locally(skb, encap);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
new file mode 100644
index 000000000000..714b99e2d8b6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
@@ -0,0 +1,468 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Facebook
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/icmp.h>
+#include <linux/icmpv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <bpf/bpf_helpers.h>
+#include "test_iptunnel_common.h"
+#include <bpf/bpf_endian.h>
+
+static __always_inline __u32 rol32(__u32 word, unsigned int shift)
+{
+	return (word << shift) | (word >> ((-shift) & 31));
+}
+
+/* copy paste of jhash from kernel sources to make sure llvm
+ * can compile it into valid sequence of bpf instructions
+ */
+#define __jhash_mix(a, b, c)			\
+{						\
+	a -= c;  a ^= rol32(c, 4);  c += b;	\
+	b -= a;  b ^= rol32(a, 6);  a += c;	\
+	c -= b;  c ^= rol32(b, 8);  b += a;	\
+	a -= c;  a ^= rol32(c, 16); c += b;	\
+	b -= a;  b ^= rol32(a, 19); a += c;	\
+	c -= b;  c ^= rol32(b, 4);  b += a;	\
+}
+
+#define __jhash_final(a, b, c)			\
+{						\
+	c ^= b; c -= rol32(b, 14);		\
+	a ^= c; a -= rol32(c, 11);		\
+	b ^= a; b -= rol32(a, 25);		\
+	c ^= b; c -= rol32(b, 16);		\
+	a ^= c; a -= rol32(c, 4);		\
+	b ^= a; b -= rol32(a, 14);		\
+	c ^= b; c -= rol32(b, 24);		\
+}
+
+#define JHASH_INITVAL		0xdeadbeef
+
+typedef unsigned int u32;
+
+static __noinline u32 jhash(const void *key, u32 length, u32 initval)
+{
+	u32 a, b, c;
+	const unsigned char *k = key;
+
+	a = b = c = JHASH_INITVAL + length + initval;
+
+	while (length > 12) {
+		a += *(u32 *)(k);
+		b += *(u32 *)(k + 4);
+		c += *(u32 *)(k + 8);
+		__jhash_mix(a, b, c);
+		length -= 12;
+		k += 12;
+	}
+	switch (length) {
+	case 12: c += (u32)k[11]<<24;
+	case 11: c += (u32)k[10]<<16;
+	case 10: c += (u32)k[9]<<8;
+	case 9:  c += k[8];
+	case 8:  b += (u32)k[7]<<24;
+	case 7:  b += (u32)k[6]<<16;
+	case 6:  b += (u32)k[5]<<8;
+	case 5:  b += k[4];
+	case 4:  a += (u32)k[3]<<24;
+	case 3:  a += (u32)k[2]<<16;
+	case 2:  a += (u32)k[1]<<8;
+	case 1:  a += k[0];
+		 __jhash_final(a, b, c);
+	case 0: /* Nothing left to add */
+		break;
+	}
+
+	return c;
+}
+
+static __noinline u32 __jhash_nwords(u32 a, u32 b, u32 c, u32 initval)
+{
+	a += initval;
+	b += initval;
+	c += initval;
+	__jhash_final(a, b, c);
+	return c;
+}
+
+static __noinline u32 jhash_2words(u32 a, u32 b, u32 initval)
+{
+	return __jhash_nwords(a, b, 0, initval + JHASH_INITVAL + (2 << 2));
+}
+
+#define PCKT_FRAGMENTED 65343
+#define IPV4_HDR_LEN_NO_OPT 20
+#define IPV4_PLUS_ICMP_HDR 28
+#define IPV6_PLUS_ICMP_HDR 48
+#define RING_SIZE 2
+#define MAX_VIPS 12
+#define MAX_REALS 5
+#define CTL_MAP_SIZE 16
+#define CH_RINGS_SIZE (MAX_VIPS * RING_SIZE)
+#define F_IPV6 (1 << 0)
+#define F_HASH_NO_SRC_PORT (1 << 0)
+#define F_ICMP (1 << 0)
+#define F_SYN_SET (1 << 1)
+
+struct packet_description {
+	union {
+		__be32 src;
+		__be32 srcv6[4];
+	};
+	union {
+		__be32 dst;
+		__be32 dstv6[4];
+	};
+	union {
+		__u32 ports;
+		__u16 port16[2];
+	};
+	__u8 proto;
+	__u8 flags;
+};
+
+struct ctl_value {
+	union {
+		__u64 value;
+		__u32 ifindex;
+		__u8 mac[6];
+	};
+};
+
+struct vip_meta {
+	__u32 flags;
+	__u32 vip_num;
+};
+
+struct real_definition {
+	union {
+		__be32 dst;
+		__be32 dstv6[4];
+	};
+	__u8 flags;
+};
+
+struct vip_stats {
+	__u64 bytes;
+	__u64 pkts;
+};
+
+struct eth_hdr {
+	unsigned char eth_dest[ETH_ALEN];
+	unsigned char eth_source[ETH_ALEN];
+	unsigned short eth_proto;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_VIPS);
+	__type(key, struct vip);
+	__type(value, struct vip_meta);
+} vip_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, CH_RINGS_SIZE);
+	__type(key, __u32);
+	__type(value, __u32);
+} ch_rings SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, MAX_REALS);
+	__type(key, __u32);
+	__type(value, struct real_definition);
+} reals SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, MAX_VIPS);
+	__type(key, __u32);
+	__type(value, struct vip_stats);
+} stats SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, CTL_MAP_SIZE);
+	__type(key, __u32);
+	__type(value, struct ctl_value);
+} ctl_array SEC(".maps");
+
+static __noinline __u32 get_packet_hash(struct packet_description *pckt, bool ipv6)
+{
+	if (ipv6)
+		return jhash_2words(jhash(pckt->srcv6, 16, MAX_VIPS),
+				    pckt->ports, CH_RINGS_SIZE);
+	else
+		return jhash_2words(pckt->src, pckt->ports, CH_RINGS_SIZE);
+}
+
+static __noinline bool get_packet_dst(struct real_definition **real,
+				      struct packet_description *pckt,
+				      struct vip_meta *vip_info,
+				      bool is_ipv6)
+{
+	__u32 hash = get_packet_hash(pckt, is_ipv6);
+	__u32 key = RING_SIZE * vip_info->vip_num + hash % RING_SIZE;
+	__u32 *real_pos;
+
+	if (hash != 0x358459b7 /* jhash of ipv4 packet */  &&
+	    hash != 0x2f4bc6bb /* jhash of ipv6 packet */)
+		return false;
+
+	real_pos = bpf_map_lookup_elem(&ch_rings, &key);
+	if (!real_pos)
+		return false;
+	key = *real_pos;
+	*real = bpf_map_lookup_elem(&reals, &key);
+	if (!(*real))
+		return false;
+	return true;
+}
+
+static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
+				   struct packet_description *pckt)
+{
+	struct icmp6hdr *icmp_hdr;
+	struct ipv6hdr *ip6h;
+
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr));
+	if (!icmp_hdr)
+		return TC_ACT_SHOT;
+
+	if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
+		return TC_ACT_OK;
+	off += sizeof(struct icmp6hdr);
+	ip6h = (struct ipv6hdr *)bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
+	if (!ip6h)
+		return TC_ACT_SHOT;
+	pckt->proto = ip6h->nexthdr;
+	pckt->flags |= F_ICMP;
+	memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
+	memcpy(pckt->dstv6, ip6h->saddr.s6_addr32, 16);
+	return TC_ACT_UNSPEC;
+}
+
+static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct icmphdr *icmp_hdr;
+	struct iphdr *iph;
+
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr));
+	if (!icmp_hdr)
+		return TC_ACT_SHOT;
+	if (icmp_hdr->type != ICMP_DEST_UNREACH ||
+	    icmp_hdr->code != ICMP_FRAG_NEEDED)
+		return TC_ACT_OK;
+	off += sizeof(struct icmphdr);
+	iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
+	if (!iph || iph->ihl != 5)
+		return TC_ACT_SHOT;
+	pckt->proto = iph->protocol;
+	pckt->flags |= F_ICMP;
+	pckt->src = iph->daddr;
+	pckt->dst = iph->saddr;
+	return TC_ACT_UNSPEC;
+}
+
+static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct udphdr *udp;
+
+	udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp));
+	if (!udp)
+		return false;
+
+	if (!(pckt->flags & F_ICMP)) {
+		pckt->port16[0] = udp->source;
+		pckt->port16[1] = udp->dest;
+	} else {
+		pckt->port16[0] = udp->dest;
+		pckt->port16[1] = udp->source;
+	}
+	return true;
+}
+
+static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct tcphdr *tcp;
+
+	tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp));
+	if (!tcp)
+		return false;
+
+	if (tcp->syn)
+		pckt->flags |= F_SYN_SET;
+
+	if (!(pckt->flags & F_ICMP)) {
+		pckt->port16[0] = tcp->source;
+		pckt->port16[1] = tcp->dest;
+	} else {
+		pckt->port16[0] = tcp->dest;
+		pckt->port16[1] = tcp->source;
+	}
+	return true;
+}
+
+static __noinline int process_packet(struct bpf_dynptr *skb_ptr,
+				     struct eth_hdr *eth, __u64 off,
+				     bool is_ipv6, struct __sk_buff *skb)
+{
+	struct packet_description pckt = {};
+	struct bpf_tunnel_key tkey = {};
+	struct vip_stats *data_stats;
+	struct real_definition *dst;
+	struct vip_meta *vip_info;
+	struct ctl_value *cval;
+	__u32 v4_intf_pos = 1;
+	__u32 v6_intf_pos = 2;
+	struct ipv6hdr *ip6h;
+	struct vip vip = {};
+	struct iphdr *iph;
+	int tun_flag = 0;
+	__u16 pkt_bytes;
+	__u64 iph_len;
+	__u32 ifindex;
+	__u8 protocol;
+	__u32 vip_num;
+	int action;
+
+	tkey.tunnel_ttl = 64;
+	if (is_ipv6) {
+		ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
+		if (!ip6h)
+			return TC_ACT_SHOT;
+
+		iph_len = sizeof(struct ipv6hdr);
+		protocol = ip6h->nexthdr;
+		pckt.proto = protocol;
+		pkt_bytes = bpf_ntohs(ip6h->payload_len);
+		off += iph_len;
+		if (protocol == IPPROTO_FRAGMENT) {
+			return TC_ACT_SHOT;
+		} else if (protocol == IPPROTO_ICMPV6) {
+			action = parse_icmpv6(skb_ptr, off, &pckt);
+			if (action >= 0)
+				return action;
+			off += IPV6_PLUS_ICMP_HDR;
+		} else {
+			memcpy(pckt.srcv6, ip6h->saddr.s6_addr32, 16);
+			memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16);
+		}
+	} else {
+		iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
+		if (!iph || iph->ihl != 5)
+			return TC_ACT_SHOT;
+
+		protocol = iph->protocol;
+		pckt.proto = protocol;
+		pkt_bytes = bpf_ntohs(iph->tot_len);
+		off += IPV4_HDR_LEN_NO_OPT;
+
+		if (iph->frag_off & PCKT_FRAGMENTED)
+			return TC_ACT_SHOT;
+		if (protocol == IPPROTO_ICMP) {
+			action = parse_icmp(skb_ptr, off, &pckt);
+			if (action >= 0)
+				return action;
+			off += IPV4_PLUS_ICMP_HDR;
+		} else {
+			pckt.src = iph->saddr;
+			pckt.dst = iph->daddr;
+		}
+	}
+	protocol = pckt.proto;
+
+	if (protocol == IPPROTO_TCP) {
+		if (!parse_tcp(skb_ptr, off, &pckt))
+			return TC_ACT_SHOT;
+	} else if (protocol == IPPROTO_UDP) {
+		if (!parse_udp(skb_ptr, off, &pckt))
+			return TC_ACT_SHOT;
+	} else {
+		return TC_ACT_SHOT;
+	}
+
+	if (is_ipv6)
+		memcpy(vip.daddr.v6, pckt.dstv6, 16);
+	else
+		vip.daddr.v4 = pckt.dst;
+
+	vip.dport = pckt.port16[1];
+	vip.protocol = pckt.proto;
+	vip_info = bpf_map_lookup_elem(&vip_map, &vip);
+	if (!vip_info) {
+		vip.dport = 0;
+		vip_info = bpf_map_lookup_elem(&vip_map, &vip);
+		if (!vip_info)
+			return TC_ACT_SHOT;
+		pckt.port16[1] = 0;
+	}
+
+	if (vip_info->flags & F_HASH_NO_SRC_PORT)
+		pckt.port16[0] = 0;
+
+	if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
+		return TC_ACT_SHOT;
+
+	if (dst->flags & F_IPV6) {
+		cval = bpf_map_lookup_elem(&ctl_array, &v6_intf_pos);
+		if (!cval)
+			return TC_ACT_SHOT;
+		ifindex = cval->ifindex;
+		memcpy(tkey.remote_ipv6, dst->dstv6, 16);
+		tun_flag = BPF_F_TUNINFO_IPV6;
+	} else {
+		cval = bpf_map_lookup_elem(&ctl_array, &v4_intf_pos);
+		if (!cval)
+			return TC_ACT_SHOT;
+		ifindex = cval->ifindex;
+		tkey.remote_ipv4 = dst->dst;
+	}
+	vip_num = vip_info->vip_num;
+	data_stats = bpf_map_lookup_elem(&stats, &vip_num);
+	if (!data_stats)
+		return TC_ACT_SHOT;
+	data_stats->pkts++;
+	data_stats->bytes += pkt_bytes;
+	bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), tun_flag);
+	*(u32 *)eth->eth_dest = tkey.remote_ipv4;
+	return bpf_redirect(ifindex, 0);
+}
+
+SEC("tc")
+int balancer_ingress(struct __sk_buff *ctx)
+{
+	struct bpf_dynptr ptr;
+	struct eth_hdr *eth;
+	__u32 eth_proto;
+	__u32 nh_off;
+
+	nh_off = sizeof(struct eth_hdr);
+
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth));
+	if (!eth)
+		return TC_ACT_SHOT;
+	eth_proto = eth->eth_proto;
+	if (eth_proto == bpf_htons(ETH_P_IP))
+		return process_packet(&ptr, eth, nh_off, false, ctx);
+	else if (eth_proto == bpf_htons(ETH_P_IPV6))
+		return process_packet(&ptr, eth, nh_off, true, ctx);
+	else
+		return TC_ACT_SHOT;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
new file mode 100644
index 000000000000..79bab9b50e9e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* This parsing logic is taken from the open source library katran, a layer 4
+ * load balancer.
+ *
+ * This code logic using dynptrs can be found in test_parse_tcp_hdr_opt_dynptr.c
+ *
+ * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/tcp.h>
+#include <stdbool.h>
+#include <linux/ipv6.h>
+#include <linux/if_ether.h>
+#include "test_tcp_hdr_options.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Kind number used for experiments */
+const __u32 tcp_hdr_opt_kind_tpr = 0xFD;
+/* Length of the tcp header option */
+const __u32 tcp_hdr_opt_len_tpr = 6;
+/* maximum number of header options to check to lookup server_id */
+const __u32 tcp_hdr_opt_max_opt_checks = 15;
+
+__u32 server_id;
+
+struct hdr_opt_state {
+	__u32 server_id;
+	__u8 byte_offset;
+	__u8 hdr_bytes_remaining;
+};
+
+static int parse_hdr_opt(const struct xdp_md *xdp, struct hdr_opt_state *state)
+{
+	const void *data = (void *)(long)xdp->data;
+	const void *data_end = (void *)(long)xdp->data_end;
+	__u8 *tcp_opt, kind, hdr_len;
+
+	tcp_opt = (__u8 *)(data + state->byte_offset);
+	if (tcp_opt + 1 > data_end)
+		return -1;
+
+	kind = tcp_opt[0];
+
+	if (kind == TCPOPT_EOL)
+		return -1;
+
+	if (kind == TCPOPT_NOP) {
+		state->hdr_bytes_remaining--;
+		state->byte_offset++;
+		return 0;
+	}
+
+	if (state->hdr_bytes_remaining < 2 ||
+	    tcp_opt + sizeof(__u8) + sizeof(__u8) > data_end)
+		return -1;
+
+	hdr_len = tcp_opt[1];
+	if (hdr_len > state->hdr_bytes_remaining)
+		return -1;
+
+	if (kind == tcp_hdr_opt_kind_tpr) {
+		if (hdr_len != tcp_hdr_opt_len_tpr)
+			return -1;
+
+		if (tcp_opt + tcp_hdr_opt_len_tpr > data_end)
+			return -1;
+
+		state->server_id = *(__u32 *)&tcp_opt[2];
+		return 1;
+	}
+
+	state->hdr_bytes_remaining -= hdr_len;
+	state->byte_offset += hdr_len;
+	return 0;
+}
+
+SEC("xdp")
+int xdp_ingress_v6(struct xdp_md *xdp)
+{
+	const void *data = (void *)(long)xdp->data;
+	const void *data_end = (void *)(long)xdp->data_end;
+	struct hdr_opt_state opt_state = {};
+	__u8 tcp_hdr_opt_len = 0;
+	struct tcphdr *tcp_hdr;
+	__u64 tcp_offset = 0;
+	__u32 off;
+	int err;
+
+	tcp_offset = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+	tcp_hdr = (struct tcphdr *)(data + tcp_offset);
+	if (tcp_hdr + 1 > data_end)
+		return XDP_DROP;
+
+	tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
+	if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
+		return XDP_DROP;
+
+	opt_state.hdr_bytes_remaining = tcp_hdr_opt_len;
+	opt_state.byte_offset = sizeof(struct tcphdr) + tcp_offset;
+
+	/* max number of bytes of options in tcp header is 40 bytes */
+	for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
+		err = parse_hdr_opt(xdp, &opt_state);
+
+		if (err || !opt_state.hdr_bytes_remaining)
+			break;
+	}
+
+	if (!opt_state.server_id)
+		return XDP_DROP;
+
+	server_id = opt_state.server_id;
+
+	return XDP_PASS;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
new file mode 100644
index 000000000000..0b97a6708fc5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* This logic is lifted from a real-world use case of packet parsing, used in
+ * the open source library katran, a layer 4 load balancer.
+ *
+ * This test demonstrates how to parse packet contents using dynptrs. The
+ * original code (parsing without dynptrs) can be found in test_parse_tcp_hdr_opt.c
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/tcp.h>
+#include <stdbool.h>
+#include <linux/ipv6.h>
+#include <linux/if_ether.h>
+#include "test_tcp_hdr_options.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Kind number used for experiments */
+const __u32 tcp_hdr_opt_kind_tpr = 0xFD;
+/* Length of the tcp header option */
+const __u32 tcp_hdr_opt_len_tpr = 6;
+/* maximum number of header options to check to lookup server_id */
+const __u32 tcp_hdr_opt_max_opt_checks = 15;
+
+__u32 server_id;
+
+static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
+			 __u32 *server_id)
+{
+	__u8 *tcp_opt, kind, hdr_len;
+	__u8 *data;
+
+	data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
+			       sizeof(*server_id));
+	if (!data)
+		return -1;
+
+	kind = data[0];
+
+	if (kind == TCPOPT_EOL)
+		return -1;
+
+	if (kind == TCPOPT_NOP) {
+		*off += 1;
+		*hdr_bytes_remaining -= 1;
+		return 0;
+	}
+
+	if (*hdr_bytes_remaining < 2)
+		return -1;
+
+	hdr_len = data[1];
+	if (hdr_len > *hdr_bytes_remaining)
+		return -1;
+
+	if (kind == tcp_hdr_opt_kind_tpr) {
+		if (hdr_len != tcp_hdr_opt_len_tpr)
+			return -1;
+
+		__builtin_memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
+		return 1;
+	}
+
+	*off += hdr_len;
+	*hdr_bytes_remaining -= hdr_len;
+	return 0;
+}
+
+SEC("xdp")
+int xdp_ingress_v6(struct xdp_md *xdp)
+{
+	__u8 hdr_bytes_remaining;
+	struct tcphdr *tcp_hdr;
+	__u8 tcp_hdr_opt_len;
+	int err = 0;
+	__u32 off;
+
+	struct bpf_dynptr ptr;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+
+	off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+
+	tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
+	if (!tcp_hdr)
+		return XDP_DROP;
+
+	tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
+	if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
+		return XDP_DROP;
+
+	hdr_bytes_remaining = tcp_hdr_opt_len;
+
+	off += sizeof(struct tcphdr);
+
+	/* max number of bytes of options in tcp header is 40 bytes */
+	for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
+		err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
+
+		if (err || !hdr_bytes_remaining)
+			break;
+	}
+
+	if (!server_id)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c
new file mode 100644
index 000000000000..f9e1c36f9471
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c
@@ -0,0 +1,240 @@
+/* Copyright (c) 2022 Meta
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+#include <linux/tcp.h>
+#include <linux/pkt_cls.h>
+#include <sys/socket.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "test_iptunnel_common.h"
+
+const size_t tcphdr_sz = sizeof(struct tcphdr);
+const size_t udphdr_sz = sizeof(struct udphdr);
+const size_t ethhdr_sz = sizeof(struct ethhdr);
+const size_t iphdr_sz = sizeof(struct iphdr);
+const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u64);
+} rxcnt SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_IPTNL_ENTRIES);
+	__type(key, struct vip);
+	__type(value, struct iptnl_info);
+} vip2tnl SEC(".maps");
+
+static __always_inline void count_tx(__u32 protocol)
+{
+	__u64 *rxcnt_count;
+
+	rxcnt_count = bpf_map_lookup_elem(&rxcnt, &protocol);
+	if (rxcnt_count)
+		*rxcnt_count += 1;
+}
+
+static __always_inline int get_dport(void *trans_data, __u8 protocol)
+{
+	struct tcphdr *th;
+	struct udphdr *uh;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		th = (struct tcphdr *)trans_data;
+		return th->dest;
+	case IPPROTO_UDP:
+		uh = (struct udphdr *)trans_data;
+		return uh->dest;
+	default:
+		return 0;
+	}
+}
+
+static __always_inline void set_ethhdr(struct ethhdr *new_eth,
+				       const struct ethhdr *old_eth,
+				       const struct iptnl_info *tnl,
+				       __be16 h_proto)
+{
+	memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	memcpy(new_eth->h_dest, tnl->dmac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = h_proto;
+}
+
+static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
+{
+	struct bpf_dynptr new_xdp_ptr;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	__u32 transport_hdr_sz;
+	struct iphdr *iph;
+	__u16 *next_iph;
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+	__u32 csum = 0;
+	int i;
+
+	if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data)
+		transport_hdr_sz = udphdr_sz;
+	else
+		transport_hdr_sz = tcphdr_sz;
+
+	iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz, iphdr_sz + transport_hdr_sz);
+	if (!iph)
+		return XDP_DROP;
+
+	dport = get_dport(iph + 1, iph->protocol);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = iph->protocol;
+	vip.family = AF_INET;
+	vip.daddr.v4 = iph->daddr;
+	vip.dport = dport;
+	payload_len = bpf_ntohs(iph->tot_len);
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v4-in-v4 */
+	if (!tnl || tnl->family != AF_INET)
+		return XDP_PASS;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz))
+		return XDP_DROP;
+
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz);
+	if (!new_eth)
+		return XDP_DROP;
+
+	iph = (struct iphdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(iph + 1);
+
+	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP));
+
+	iph->version = 4;
+	iph->ihl = iphdr_sz >> 2;
+	iph->frag_off =	0;
+	iph->protocol = IPPROTO_IPIP;
+	iph->check = 0;
+	iph->tos = 0;
+	iph->tot_len = bpf_htons(payload_len + iphdr_sz);
+	iph->daddr = tnl->daddr.v4;
+	iph->saddr = tnl->saddr.v4;
+	iph->ttl = 8;
+
+	next_iph = (__u16 *)iph;
+#pragma clang loop unroll(full)
+	for (i = 0; i < iphdr_sz >> 1; i++)
+		csum += *next_iph++;
+
+	iph->check = ~((csum & 0xffff) + (csum >> 16));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
+{
+	struct bpf_dynptr new_xdp_ptr;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	__u32 transport_hdr_sz;
+	struct ipv6hdr *ip6h;
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+
+	if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data)
+		transport_hdr_sz = udphdr_sz;
+	else
+		transport_hdr_sz = tcphdr_sz;
+
+	ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz, ipv6hdr_sz + transport_hdr_sz);
+	if (!ip6h)
+		return XDP_DROP;
+
+	dport = get_dport(ip6h + 1, ip6h->nexthdr);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = ip6h->nexthdr;
+	vip.family = AF_INET6;
+	memcpy(vip.daddr.v6, ip6h->daddr.s6_addr32, sizeof(vip.daddr));
+	vip.dport = dport;
+	payload_len = ip6h->payload_len;
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v6-in-v6 */
+	if (!tnl || tnl->family != AF_INET6)
+		return XDP_PASS;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz))
+		return XDP_DROP;
+
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz);
+	if (!new_eth)
+		return XDP_DROP;
+
+	ip6h = (struct ipv6hdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(ip6h + 1);
+
+	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6));
+
+	ip6h->version = 6;
+	ip6h->priority = 0;
+	memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
+	ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz);
+	ip6h->nexthdr = IPPROTO_IPV6;
+	ip6h->hop_limit = 8;
+	memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
+	memcpy(ip6h->daddr.s6_addr32, tnl->daddr.v6, sizeof(tnl->daddr.v6));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+SEC("xdp")
+int _xdp_tx_iptunnel(struct xdp_md *xdp)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *eth;
+	__u16 h_proto;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz);
+	if (!eth)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == bpf_htons(ETH_P_IP))
+		return handle_ipv4(xdp, &ptr);
+	else if (h_proto == bpf_htons(ETH_P_IPV6))
+
+		return handle_ipv6(xdp, &ptr);
+	else
+		return XDP_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
index 6118e3ab61fc..56c9f8a3ad3d 100644
--- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -50,6 +50,7 @@ struct linum_err {
 
 #define TCPOPT_EOL		0
 #define TCPOPT_NOP		1
+#define TCPOPT_MSS		2
 #define TCPOPT_WINDOW		3
 #define TCPOPT_EXP		254
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-22 23:56 ` [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs Joanne Koong
@ 2022-08-23  2:30   ` Kumar Kartikeya Dwivedi
  2022-08-23 22:26     ` Joanne Koong
  2022-08-24  1:15   ` kernel test robot
  1 sibling, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-23  2:30 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

+Cc XDP folks

On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add xdp dynptrs, which are dynptrs whose underlying pointer points
> to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of xdp->data and xdp->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For reads and writes on the dynptr, this includes reading/writing
> from/to and across fragments. For data slices, direct access to

It's a bit awkward to have such a difference between xdp and skb
dynptr's read/write. I understand why it is the way it is, but it
still doesn't feel right. I'm not sure if we can reconcile the
differences, but it makes writing common code for both xdp and tc
harder as it needs to be aware of the differences (and then the flags
for dynptr_write would differ too). So we're 90% there but not the
whole way...

> data in fragments is also permitted, but access across fragments
> is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
>
> Any helper calls that change the underlying packet buffer (eg
> bpf_xdp_adjust_head) invalidates any data slices of the associated
> dynptr. Whenever such a helper call is made, the verifier marks any
> PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
> PTR_TO_PACKETs) as unknown. The stack trace for this is
> check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown()
>
> For examples of how xdp dynptrs can be used, please see the attached
> selftests.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  6 ++++-
>  include/linux/filter.h         |  3 +++
>  include/uapi/linux/bpf.h       | 25 +++++++++++++++---
>  kernel/bpf/helpers.c           | 14 ++++++++++-
>  kernel/bpf/verifier.c          |  8 +++++-
>  net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
>  tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
>  7 files changed, 112 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 30615d1a0c13..455a215b6c57 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -410,11 +410,15 @@ enum bpf_type_flag {
>         /* DYNPTR points to sk_buff */
>         DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
>
> +       /* DYNPTR points to xdp_buff */
> +       DYNPTR_TYPE_XDP         = BIT(12 + BPF_BASE_TYPE_BITS),
> +
>         __BPF_TYPE_FLAG_MAX,
>         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
>  };
>
> -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
> +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
> +                                | DYNPTR_TYPE_XDP)
>
>  /* Max number of base types. */
>  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 649063d9cbfd..80f030239877 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>  int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
>  int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
>                           u32 len, u64 flags);
> +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
>
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 320e6b95d95c..9feea29eebcd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5283,13 +5283,18 @@ union bpf_attr {
>   *                   and try again.
>   *
>   *                 * The data slice is automatically invalidated anytime
> - *                   **bpf_dynptr_write**\ () or a helper call that changes
> - *                   the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> + *                   **bpf_dynptr_write**\ () is called.
> + *
> + *             For skb-type and xdp-type dynptrs:
> + *                 * The data slice is automatically invalidated anytime a
> + *                   helper call that changes the underlying packet buffer
> + *                   (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
>   *                   is called.
>   *     Return
>   *             Pointer to the underlying dynptr data, NULL if the dynptr is
>   *             read-only, if the dynptr is invalid, or if the offset and length
> - *             is out of bounds or in a paged buffer for skb-type dynptrs.
> + *             is out of bounds or in a paged buffer for skb-type dynptrs or
> + *             across fragments for xdp-type dynptrs.
>   *
>   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
>   *     Description
> @@ -5388,6 +5393,19 @@ union bpf_attr {
>   *             *flags* is currently unused, it must be 0 for now.
>   *     Return
>   *             0 on success or -EINVAL if flags is not 0.
> + *
> + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
> + *     Description
> + *             Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
> + *             context.
> + *
> + *             Calls that change the *xdp_md*'s underlying packet buffer
> + *             (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
> + *             they do invalidate any data slices associated with the dynptr.
> + *
> + *             *flags* is currently unused, it must be 0 for now.
> + *     Return
> + *             0 on success, -EINVAL if flags is not 0.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5600,6 +5618,7 @@ union bpf_attr {
>         FN(tcp_raw_check_syncookie_ipv6),       \
>         FN(ktime_get_tai_ns),           \
>         FN(dynptr_from_skb),            \
> +       FN(dynptr_from_xdp),            \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 471a01a9b6ae..2b9dc4c6de04 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
>                 return 0;
>         case BPF_DYNPTR_TYPE_SKB:
>                 return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> +       case BPF_DYNPTR_TYPE_XDP:
> +               return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
>         default:
>                 WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
>                 return -EFAULT;
> @@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
>         case BPF_DYNPTR_TYPE_SKB:
>                 return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
>                                              flags);
> +       case BPF_DYNPTR_TYPE_XDP:
> +               if (flags)
> +                       return -EINVAL;
> +               return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
>         default:
>                 WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
>                 return -EFAULT;
> @@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>
>         type = bpf_dynptr_get_type(ptr);
>
> -       /* Only skb dynptrs can get read-only data slices, because the
> +       /* Only skb and xdp dynptrs can get read-only data slices, because the
>          * verifier enforces PTR_TO_PACKET accesses
>          */
>         is_rdonly = bpf_dynptr_is_rdonly(ptr);
> @@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>                 data = skb->data;
>                 break;
>         }
> +       case BPF_DYNPTR_TYPE_XDP:
> +               /* if the requested data in across fragments, then it cannot
> +                * be accessed directly - bpf_xdp_pointer will return NULL
> +                */
> +               return (unsigned long)bpf_xdp_pointer(ptr->data,
> +                                                     ptr->offset + offset, len);
>         default:
>                 WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
>                 return 0;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1ea295f47525..d33648eee188 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
>                 return BPF_DYNPTR_TYPE_RINGBUF;
>         case DYNPTR_TYPE_SKB:
>                 return BPF_DYNPTR_TYPE_SKB;
> +       case DYNPTR_TYPE_XDP:
> +               return BPF_DYNPTR_TYPE_XDP;
>         default:
>                 return BPF_DYNPTR_TYPE_INVALID;
>         }
> @@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>                         case DYNPTR_TYPE_SKB:
>                                 err_extra = "skb ";
>                                 break;
> +                       case DYNPTR_TYPE_XDP:
> +                               err_extra = "xdp ";
> +                               break;
>                         }
>
>                         verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
> @@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                 mark_reg_known_zero(env, regs, BPF_REG_0);
>
>                 if (func_id == BPF_FUNC_dynptr_data &&
> -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
>                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
>                         regs[BPF_REG_0].range = meta.mem_size;

It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
modified by comparisons with packet pointers loaded from the xdp/skb
ctx, how do we distinguish e.g. between a pkt slice obtained from some
frag in a multi-buff XDP vs pkt pointer from a linear area?

Someone can compare data_meta from ctx with PTR_TO_PACKET from
bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
access for the linear area. reg_is_init_pkt_pointer will return true
as modified range is not considered for it. Same kind of issues when
doing comparison with data_end from ctx (though maybe you won't be
able to do incorrect data access at runtime using that).

I had a pkt_uid field in my patch [0] which disallowed comparisons
among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
and that disabled comparisons for them. reg->id is used for var_off
range propagation so it cannot be reused.

Coming back to this: What we really want here is a PTR_TO_MEM with a
mem_size, so maybe you should go that route instead of PTR_TO_PACKET
(and add a type tag to maybe pretty print it also as a packet pointer
in verifier log), or add some way to distinguish slice vs non-slice
pkt pointers like I did in my patch. You might also want to add some
tests for this corner case (there are some later in [0] if you want to
reuse them).

So TBH, I kinda dislike my own solution in [0] :). The complexity does
not seem worth it. The pkt_uid distinction is more useful (and
actually would be needed) in Toke's xdp queueing series, where in a
dequeue program you have multiple xdp_mds and want scoped slice
invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
slices of some other xdp_md). Here we can just get away with normal
PTR_TO_MEM.

... Or just let me know if you handle this correctly already, or if
this won't be an actual problem :).

  [0]: https://lore.kernel.org/bpf/20220306234311.452206-3-memxor@gmail.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs
  2022-08-22 23:56 [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Joanne Koong
                   ` (2 preceding siblings ...)
  2022-08-22 23:56 ` [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
@ 2022-08-23  2:31 ` Kumar Kartikeya Dwivedi
  2022-08-23 18:52   ` Joanne Koong
  3 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-23  2:31 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev

On Tue, 23 Aug 2022 at 02:06, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> This patchset is the 2nd in the dynptr series. The 1st can be found here [0].
>
> This patchset adds skb and xdp type dynptrs, which have two main benefits for
> packet parsing:
>     * allowing operations on sizes that are not statically known at
>       compile-time (eg variable-sized accesses).
>     * more ergonomic and less brittle iteration through data (eg does not need
>       manual if checking for being within bounds of data_end)
>

Just curious: so would you be adding a dynptr interface for obtaining
data_meta slices as well in the future? Since the same manual bounds
checking is needed for data_meta vs data. How would that look in the
generic dynptr interface of data/read/write this set is trying to fit
things in?



> When comparing the differences in runtime for packet parsing without dynptrs
> vs. with dynptrs for the more simple cases, there is no noticeable difference.
> For the more complex cases where lengths are non-statically known at compile
> time, there can be a significant speed-up when using dynptrs (eg a 2x speed up
> for cls redirection). Patch 3 contains more details as well as examples of how
> to use skb and xdp dynptrs.
>
> [0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
>
> --

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs
  2022-08-23  2:31 ` [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Kumar Kartikeya Dwivedi
@ 2022-08-23 18:52   ` Joanne Koong
  2022-08-24 18:01     ` Andrii Nakryiko
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-23 18:52 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev

On Mon, Aug 22, 2022 at 7:32 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 23 Aug 2022 at 02:06, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > This patchset is the 2nd in the dynptr series. The 1st can be found here [0].
> >
> > This patchset adds skb and xdp type dynptrs, which have two main benefits for
> > packet parsing:
> >     * allowing operations on sizes that are not statically known at
> >       compile-time (eg variable-sized accesses).
> >     * more ergonomic and less brittle iteration through data (eg does not need
> >       manual if checking for being within bounds of data_end)
> >
>
> Just curious: so would you be adding a dynptr interface for obtaining
> data_meta slices as well in the future? Since the same manual bounds
> checking is needed for data_meta vs data. How would that look in the
> generic dynptr interface of data/read/write this set is trying to fit
> things in?

Oh cool, I didn't realize there is also a data_meta used in packet
parsing - thanks for bringing this up. I think there are 2 options for
how data_meta can be incorporated into the dynptr interface:

1) have a separate api "bpf_dynptr_from_{skb/xdp}_meta. We'll have to
have a function in the verifier that does something similar to
'may_access_direct_pkt_data' but for pkt data meta, since skb progs
can have different access restrictions for data vs. data_meta.

2) ideally, the flags arg would be used to indicate whether the
parsing should be for data_meta. To support this though, I think we'd
need to do access type checking within the helper instead of at the
verifier level. One idea is to pass in the env->ops ptr as a 4th arg
(manually patching it from the verifier) to the helper,  which can be
used to determine if data_meta access is permitted.

In both options, there'll be a new BPF_DYNPTR_{SKB/XDP}_META dynptr
type and data/read/write will be supported for it.

What are your thoughts?

>
>
>
> > When comparing the differences in runtime for packet parsing without dynptrs
> > vs. with dynptrs for the more simple cases, there is no noticeable difference.
> > For the more complex cases where lengths are non-statically known at compile
> > time, there can be a significant speed-up when using dynptrs (eg a 2x speed up
> > for cls redirection). Patch 3 contains more details as well as examples of how
> > to use skb and xdp dynptrs.
> >
> > [0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
> >
> > --

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-23  2:30   ` Kumar Kartikeya Dwivedi
@ 2022-08-23 22:26     ` Joanne Koong
  2022-08-24 10:39       ` Toke Høiland-Jørgensen
  2022-08-24 21:10       ` Kumar Kartikeya Dwivedi
  0 siblings, 2 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-23 22:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> +Cc XDP folks
>
> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of xdp->data and xdp->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For reads and writes on the dynptr, this includes reading/writing
> > from/to and across fragments. For data slices, direct access to
>
> It's a bit awkward to have such a difference between xdp and skb
> dynptr's read/write. I understand why it is the way it is, but it
> still doesn't feel right. I'm not sure if we can reconcile the
> differences, but it makes writing common code for both xdp and tc
> harder as it needs to be aware of the differences (and then the flags
> for dynptr_write would differ too). So we're 90% there but not the
> whole way...

Yeah, it'd be great if the behavior for skb/xdp progs could be the
same, but I'm not seeing a better solution here (unless we invalidate
data slices on writes in xdp progs, just to make it match more :P).

Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
not convinced this is much of a problem - xdp and skb programs already
have different interfaces for doing things (eg
bpf_{skb/xdp}_{store/load}_bytes).

>
> > data in fragments is also permitted, but access across fragments
> > is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
> >
> > Any helper calls that change the underlying packet buffer (eg
> > bpf_xdp_adjust_head) invalidates any data slices of the associated
> > dynptr. Whenever such a helper call is made, the verifier marks any
> > PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
> > PTR_TO_PACKETs) as unknown. The stack trace for this is
> > check_helper_call() -> clear_all_pkt_pointers() ->
> > __clear_all_pkt_pointers() -> mark_reg_unknown()
> >
> > For examples of how xdp dynptrs can be used, please see the attached
> > selftests.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  6 ++++-
> >  include/linux/filter.h         |  3 +++
> >  include/uapi/linux/bpf.h       | 25 +++++++++++++++---
> >  kernel/bpf/helpers.c           | 14 ++++++++++-
> >  kernel/bpf/verifier.c          |  8 +++++-
> >  net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
> >  tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
> >  7 files changed, 112 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 30615d1a0c13..455a215b6c57 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -410,11 +410,15 @@ enum bpf_type_flag {
> >         /* DYNPTR points to sk_buff */
> >         DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> >
> > +       /* DYNPTR points to xdp_buff */
> > +       DYNPTR_TYPE_XDP         = BIT(12 + BPF_BASE_TYPE_BITS),
> > +
> >         __BPF_TYPE_FLAG_MAX,
> >         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> >  };
> >
> > -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
> > +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
> > +                                | DYNPTR_TYPE_XDP)
> >
> >  /* Max number of base types. */
> >  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 649063d9cbfd..80f030239877 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
> >  int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
> >  int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> >                           u32 len, u64 flags);
> > +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> > +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> > +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
> >
> >  #endif /* __LINUX_FILTER_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 320e6b95d95c..9feea29eebcd 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5283,13 +5283,18 @@ union bpf_attr {
> >   *                   and try again.
> >   *
> >   *                 * The data slice is automatically invalidated anytime
> > - *                   **bpf_dynptr_write**\ () or a helper call that changes
> > - *                   the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> > + *                   **bpf_dynptr_write**\ () is called.
> > + *
> > + *             For skb-type and xdp-type dynptrs:
> > + *                 * The data slice is automatically invalidated anytime a
> > + *                   helper call that changes the underlying packet buffer
> > + *                   (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
> >   *                   is called.
> >   *     Return
> >   *             Pointer to the underlying dynptr data, NULL if the dynptr is
> >   *             read-only, if the dynptr is invalid, or if the offset and length
> > - *             is out of bounds or in a paged buffer for skb-type dynptrs.
> > + *             is out of bounds or in a paged buffer for skb-type dynptrs or
> > + *             across fragments for xdp-type dynptrs.
> >   *
> >   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
> >   *     Description
> > @@ -5388,6 +5393,19 @@ union bpf_attr {
> >   *             *flags* is currently unused, it must be 0 for now.
> >   *     Return
> >   *             0 on success or -EINVAL if flags is not 0.
> > + *
> > + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
> > + *     Description
> > + *             Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
> > + *             context.
> > + *
> > + *             Calls that change the *xdp_md*'s underlying packet buffer
> > + *             (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
> > + *             they do invalidate any data slices associated with the dynptr.
> > + *
> > + *             *flags* is currently unused, it must be 0 for now.
> > + *     Return
> > + *             0 on success, -EINVAL if flags is not 0.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -5600,6 +5618,7 @@ union bpf_attr {
> >         FN(tcp_raw_check_syncookie_ipv6),       \
> >         FN(ktime_get_tai_ns),           \
> >         FN(dynptr_from_skb),            \
> > +       FN(dynptr_from_xdp),            \
> >         /* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 471a01a9b6ae..2b9dc4c6de04 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
> >                 return 0;
> >         case BPF_DYNPTR_TYPE_SKB:
> >                 return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
> >                 return -EFAULT;
> > @@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
> >         case BPF_DYNPTR_TYPE_SKB:
> >                 return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> >                                              flags);
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               if (flags)
> > +                       return -EINVAL;
> > +               return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
> >                 return -EFAULT;
> > @@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >
> >         type = bpf_dynptr_get_type(ptr);
> >
> > -       /* Only skb dynptrs can get read-only data slices, because the
> > +       /* Only skb and xdp dynptrs can get read-only data slices, because the
> >          * verifier enforces PTR_TO_PACKET accesses
> >          */
> >         is_rdonly = bpf_dynptr_is_rdonly(ptr);
> > @@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >                 data = skb->data;
> >                 break;
> >         }
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               /* if the requested data in across fragments, then it cannot
> > +                * be accessed directly - bpf_xdp_pointer will return NULL
> > +                */
> > +               return (unsigned long)bpf_xdp_pointer(ptr->data,
> > +                                                     ptr->offset + offset, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
> >                 return 0;
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 1ea295f47525..d33648eee188 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
> >                 return BPF_DYNPTR_TYPE_RINGBUF;
> >         case DYNPTR_TYPE_SKB:
> >                 return BPF_DYNPTR_TYPE_SKB;
> > +       case DYNPTR_TYPE_XDP:
> > +               return BPF_DYNPTR_TYPE_XDP;
> >         default:
> >                 return BPF_DYNPTR_TYPE_INVALID;
> >         }
> > @@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                         case DYNPTR_TYPE_SKB:
> >                                 err_extra = "skb ";
> >                                 break;
> > +                       case DYNPTR_TYPE_XDP:
> > +                               err_extra = "xdp ";
> > +                               break;
> >                         }
> >
> >                         verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
> > @@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> >
> >                 if (func_id == BPF_FUNC_dynptr_data &&
> > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> >                         regs[BPF_REG_0].range = meta.mem_size;
>
> It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> modified by comparisons with packet pointers loaded from the xdp/skb
> ctx, how do we distinguish e.g. between a pkt slice obtained from some
> frag in a multi-buff XDP vs pkt pointer from a linear area?
>
> Someone can compare data_meta from ctx with PTR_TO_PACKET from
> bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> access for the linear area. reg_is_init_pkt_pointer will return true
> as modified range is not considered for it. Same kind of issues when
> doing comparison with data_end from ctx (though maybe you won't be
> able to do incorrect data access at runtime using that).
>
> I had a pkt_uid field in my patch [0] which disallowed comparisons
> among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> and that disabled comparisons for them. reg->id is used for var_off
> range propagation so it cannot be reused.
>
> Coming back to this: What we really want here is a PTR_TO_MEM with a
> mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> (and add a type tag to maybe pretty print it also as a packet pointer
> in verifier log), or add some way to distinguish slice vs non-slice
> pkt pointers like I did in my patch. You might also want to add some
> tests for this corner case (there are some later in [0] if you want to
> reuse them).
>
> So TBH, I kinda dislike my own solution in [0] :). The complexity does
> not seem worth it. The pkt_uid distinction is more useful (and
> actually would be needed) in Toke's xdp queueing series, where in a
> dequeue program you have multiple xdp_mds and want scoped slice
> invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> slices of some other xdp_md). Here we can just get away with normal
> PTR_TO_MEM.
>
> ... Or just let me know if you handle this correctly already, or if
> this won't be an actual problem :).

Ooh interesting, I hadn't previously taken a look at
try_match_pkt_pointers(), thanks for mentioning it :)

The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
returning PTR_TO_MEM because it seems more robust (eg if in the future
we reject x behavior on the packet data reg types, this will
automatically apply to the data slices), and because it'll keep the
logic more efficient/simpler for the case when the pkt pointer has to
be cleared after any helper that changes pkt data is called (aka the
case where the data slice gets invalidated).

What are your thoughts?

>
>   [0]: https://lore.kernel.org/bpf/20220306234311.452206-3-memxor@gmail.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
@ 2022-08-23 23:22   ` kernel test robot
  2022-08-23 23:53   ` kernel test robot
  2022-08-24 18:27   ` Andrii Nakryiko
  2 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2022-08-23 23:22 UTC (permalink / raw)
  To: Joanne Koong, bpf
  Cc: kbuild-all, andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Hi Joanne,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: arm-buildonly-randconfig-r002-20220823 (https://download.01.org/0day-ci/archive/20220824/202208240728.59W00MTW-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/a2c8a74d8f0b7fd0b0008dc9bc5ccf9887317f36
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
        git checkout a2c8a74d8f0b7fd0b0008dc9bc5ccf9887317f36
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: kernel/bpf/helpers.o: in function `bpf_dynptr_read':
>> helpers.c:(.text+0xb1c): undefined reference to `__bpf_skb_load_bytes'
   arm-linux-gnueabi-ld: kernel/bpf/helpers.o: in function `bpf_dynptr_write':
>> helpers.c:(.text+0xc30): undefined reference to `__bpf_skb_store_bytes'

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-08-23 23:22   ` kernel test robot
@ 2022-08-23 23:53   ` kernel test robot
  2022-08-24 18:27   ` Andrii Nakryiko
  2 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2022-08-23 23:53 UTC (permalink / raw)
  To: Joanne Koong, bpf
  Cc: kbuild-all, andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Hi Joanne,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: csky-randconfig-r022-20220823 (https://download.01.org/0day-ci/archive/20220824/202208240751.BRPS1SoF-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/a2c8a74d8f0b7fd0b0008dc9bc5ccf9887317f36
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
        git checkout a2c8a74d8f0b7fd0b0008dc9bc5ccf9887317f36
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=csky SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   csky-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_read':
>> kernel/bpf/helpers.c:1543: undefined reference to `__bpf_skb_load_bytes'
   csky-linux-ld: kernel/bpf/helpers.o: in function `bpf_dynptr_read':
   kernel/bpf/helpers.c:1522: undefined reference to `__bpf_skb_load_bytes'
   csky-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_write':
>> kernel/bpf/helpers.c:1584: undefined reference to `__bpf_skb_store_bytes'
   csky-linux-ld: kernel/bpf/helpers.o: in function `bpf_dynptr_write':
   kernel/bpf/helpers.c:1561: undefined reference to `__bpf_skb_store_bytes'


vim +1543 kernel/bpf/helpers.c

  1521	
  1522	BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src,
  1523		   u32, offset, u64, flags)
  1524	{
  1525		enum bpf_dynptr_type type;
  1526		int err;
  1527	
  1528		if (!src->data || flags)
  1529			return -EINVAL;
  1530	
  1531		err = bpf_dynptr_check_off_len(src, offset, len);
  1532		if (err)
  1533			return err;
  1534	
  1535		type = bpf_dynptr_get_type(src);
  1536	
  1537		switch (type) {
  1538		case BPF_DYNPTR_TYPE_LOCAL:
  1539		case BPF_DYNPTR_TYPE_RINGBUF:
  1540			memcpy(dst, src->data + src->offset + offset, len);
  1541			return 0;
  1542		case BPF_DYNPTR_TYPE_SKB:
> 1543			return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
  1544		default:
  1545			WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
  1546			return -EFAULT;
  1547		}
  1548	}
  1549	
  1550	static const struct bpf_func_proto bpf_dynptr_read_proto = {
  1551		.func		= bpf_dynptr_read,
  1552		.gpl_only	= false,
  1553		.ret_type	= RET_INTEGER,
  1554		.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
  1555		.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
  1556		.arg3_type	= ARG_PTR_TO_DYNPTR,
  1557		.arg4_type	= ARG_ANYTHING,
  1558		.arg5_type	= ARG_ANYTHING,
  1559	};
  1560	
  1561	BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
  1562		   u32, len, u64, flags)
  1563	{
  1564		enum bpf_dynptr_type type;
  1565		int err;
  1566	
  1567		if (!dst->data || bpf_dynptr_is_rdonly(dst))
  1568			return -EINVAL;
  1569	
  1570		err = bpf_dynptr_check_off_len(dst, offset, len);
  1571		if (err)
  1572			return err;
  1573	
  1574		type = bpf_dynptr_get_type(dst);
  1575	
  1576		switch (type) {
  1577		case BPF_DYNPTR_TYPE_LOCAL:
  1578		case BPF_DYNPTR_TYPE_RINGBUF:
  1579			if (flags)
  1580				return -EINVAL;
  1581			memcpy(dst->data + dst->offset + offset, src, len);
  1582			return 0;
  1583		case BPF_DYNPTR_TYPE_SKB:
> 1584			return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
  1585						     flags);
  1586		default:
  1587			WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
  1588			return -EFAULT;
  1589		}
  1590	}
  1591	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-22 23:56 ` [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs Joanne Koong
  2022-08-23  2:30   ` Kumar Kartikeya Dwivedi
@ 2022-08-24  1:15   ` kernel test robot
  1 sibling, 0 replies; 43+ messages in thread
From: kernel test robot @ 2022-08-24  1:15 UTC (permalink / raw)
  To: Joanne Koong, bpf
  Cc: kbuild-all, andrii, daniel, ast, kafai, kuba, netdev, Joanne Koong

Hi Joanne,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: microblaze-buildonly-randconfig-r003-20220823 (https://download.01.org/0day-ci/archive/20220824/202208240923.X2tgBh6Y-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/6d832ed17cea3ede1cd48f885a73144d9c4d800a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
        git checkout 6d832ed17cea3ede1cd48f885a73144d9c4d800a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=microblaze SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_read':
   kernel/bpf/helpers.c:1543: undefined reference to `__bpf_skb_load_bytes'
>> microblaze-linux-ld: kernel/bpf/helpers.c:1545: undefined reference to `__bpf_xdp_load_bytes'
   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_write':
   kernel/bpf/helpers.c:1586: undefined reference to `__bpf_skb_store_bytes'
>> microblaze-linux-ld: kernel/bpf/helpers.c:1591: undefined reference to `__bpf_xdp_store_bytes'
   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_data':
   kernel/bpf/helpers.c:1653: undefined reference to `bpf_xdp_pointer'
   pahole: .tmp_vmlinux.btf: Invalid argument
   .btf.vmlinux.bin.o: file not recognized: file format not recognized

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-23 22:26     ` Joanne Koong
@ 2022-08-24 10:39       ` Toke Høiland-Jørgensen
  2022-08-24 18:10         ` Joanne Koong
  2022-08-24 21:10       ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 43+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-08-24 10:39 UTC (permalink / raw)
  To: Joanne Koong, Kumar Kartikeya Dwivedi
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, brouer, lorenzo

Joanne Koong <joannelkoong@gmail.com> writes:

> On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>>
>> +Cc XDP folks
>>
>> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
>> >
>> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
>> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
>> > benefits. One is that they allow operations on sizes that are not
>> > statically known at compile-time (eg variable-sized accesses).
>> > Another is that parsing the packet data through dynptrs (instead of
>> > through direct access of xdp->data and xdp->data_end) can be more
>> > ergonomic and less brittle (eg does not need manual if checking for
>> > being within bounds of data_end).
>> >
>> > For reads and writes on the dynptr, this includes reading/writing
>> > from/to and across fragments. For data slices, direct access to
>>
>> It's a bit awkward to have such a difference between xdp and skb
>> dynptr's read/write. I understand why it is the way it is, but it
>> still doesn't feel right. I'm not sure if we can reconcile the
>> differences, but it makes writing common code for both xdp and tc
>> harder as it needs to be aware of the differences (and then the flags
>> for dynptr_write would differ too). So we're 90% there but not the
>> whole way...
>
> Yeah, it'd be great if the behavior for skb/xdp progs could be the
> same, but I'm not seeing a better solution here (unless we invalidate
> data slices on writes in xdp progs, just to make it match more :P).
>
> Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> not convinced this is much of a problem - xdp and skb programs already
> have different interfaces for doing things (eg
> bpf_{skb/xdp}_{store/load}_bytes).

This is true, but it's quite possible to paper over these differences
and write BPF code that works for both TC and XDP. Subtle semantic
differences in otherwise identical functions makes this harder.

Today you can write a function like:

static inline int parse_pkt(void *data, void* data_end)
{
        /* parse data */
}

And call it like:

SEC("xdp")
int parse_xdp(struct xdp_md *ctx)
{
        return parse_pkt(ctx->data, ctx->data_end);
}

SEC("tc")
int parse_tc(struct __sk_buff *skb)
{
        return parse_pkt(skb->data, skb->data_end);
}


IMO the goal should be to be able to do the equivalent for dynptrs, like:

static inline int parse_pkt(struct bpf_dynptr *ptr)
{
        __u64 *data;
        
	data = bpf_dynptr_data(ptr, 0, sizeof(*data));
	if (!data)
		return 0;
        /* parse data */
}

SEC("xdp")
int parse_xdp(struct xdp_md *ctx)
{
	struct bpf_dynptr ptr;

	bpf_dynptr_from_xdp(ctx, 0, &ptr);
        return parse_pkt(&ptr);
}

SEC("tc")
int parse_tc(struct __sk_buff *skb)
{
	struct bpf_dynptr ptr;

	bpf_dynptr_from_skb(skb, 0, &ptr);
        return parse_pkt(&ptr);
}


If the dynptr-based parse_pkt() function has to take special care to
figure out where the dynptr comes from, it makes it a lot more difficult
to write reusable packet parsing functions. So I'd be in favour of
restricting the dynptr interface to the lowest common denominator of the
skb and xdp interfaces even if that makes things slightly more awkward
in the specialised cases...

-Toke


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs
  2022-08-23 18:52   ` Joanne Koong
@ 2022-08-24 18:01     ` Andrii Nakryiko
  2022-08-24 23:18       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-24 18:01 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Kumar Kartikeya Dwivedi, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Tue, Aug 23, 2022 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 7:32 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Tue, 23 Aug 2022 at 02:06, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > This patchset is the 2nd in the dynptr series. The 1st can be found here [0].
> > >
> > > This patchset adds skb and xdp type dynptrs, which have two main benefits for
> > > packet parsing:
> > >     * allowing operations on sizes that are not statically known at
> > >       compile-time (eg variable-sized accesses).
> > >     * more ergonomic and less brittle iteration through data (eg does not need
> > >       manual if checking for being within bounds of data_end)
> > >
> >
> > Just curious: so would you be adding a dynptr interface for obtaining
> > data_meta slices as well in the future? Since the same manual bounds
> > checking is needed for data_meta vs data. How would that look in the
> > generic dynptr interface of data/read/write this set is trying to fit
> > things in?
>
> Oh cool, I didn't realize there is also a data_meta used in packet
> parsing - thanks for bringing this up. I think there are 2 options for
> how data_meta can be incorporated into the dynptr interface:
>
> 1) have a separate api "bpf_dynptr_from_{skb/xdp}_meta. We'll have to
> have a function in the verifier that does something similar to
> 'may_access_direct_pkt_data' but for pkt data meta, since skb progs
> can have different access restrictions for data vs. data_meta.
>
> 2) ideally, the flags arg would be used to indicate whether the
> parsing should be for data_meta. To support this though, I think we'd
> need to do access type checking within the helper instead of at the
> verifier level. One idea is to pass in the env->ops ptr as a 4th arg
> (manually patching it from the verifier) to the helper,  which can be
> used to determine if data_meta access is permitted.
>
> In both options, there'll be a new BPF_DYNPTR_{SKB/XDP}_META dynptr
> type and data/read/write will be supported for it.
>
> What are your thoughts?

I think separate bpf_dynptr_from_skb_meta() and
bpf_dynptr_from_xdp_meta() is cleaner than a flag. Also having a
separate helper would make it easier to disable this helper for
program types that don't have access to ctx->data_meta, right?

>
> >
> >
> >
> > > When comparing the differences in runtime for packet parsing without dynptrs
> > > vs. with dynptrs for the more simple cases, there is no noticeable difference.
> > > For the more complex cases where lengths are non-statically known at compile
> > > time, there can be a significant speed-up when using dynptrs (eg a 2x speed up
> > > for cls redirection). Patch 3 contains more details as well as examples of how
> > > to use skb and xdp dynptrs.
> > >
> > > [0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/
> > >
> > > --

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 10:39       ` Toke Høiland-Jørgensen
@ 2022-08-24 18:10         ` Joanne Koong
  2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-24 18:10 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Kumar Kartikeya Dwivedi, bpf, andrii, daniel, ast, kafai, kuba,
	netdev, brouer, lorenzo

On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Joanne Koong <joannelkoong@gmail.com> writes:
>
> > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> >>
> >> +Cc XDP folks
> >>
> >> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
> >> >
> >> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
> >> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> >> > benefits. One is that they allow operations on sizes that are not
> >> > statically known at compile-time (eg variable-sized accesses).
> >> > Another is that parsing the packet data through dynptrs (instead of
> >> > through direct access of xdp->data and xdp->data_end) can be more
> >> > ergonomic and less brittle (eg does not need manual if checking for
> >> > being within bounds of data_end).
> >> >
> >> > For reads and writes on the dynptr, this includes reading/writing
> >> > from/to and across fragments. For data slices, direct access to
> >>
> >> It's a bit awkward to have such a difference between xdp and skb
> >> dynptr's read/write. I understand why it is the way it is, but it
> >> still doesn't feel right. I'm not sure if we can reconcile the
> >> differences, but it makes writing common code for both xdp and tc
> >> harder as it needs to be aware of the differences (and then the flags
> >> for dynptr_write would differ too). So we're 90% there but not the
> >> whole way...
> >
> > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > same, but I'm not seeing a better solution here (unless we invalidate
> > data slices on writes in xdp progs, just to make it match more :P).
> >
> > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > not convinced this is much of a problem - xdp and skb programs already
> > have different interfaces for doing things (eg
> > bpf_{skb/xdp}_{store/load}_bytes).
>
> This is true, but it's quite possible to paper over these differences
> and write BPF code that works for both TC and XDP. Subtle semantic
> differences in otherwise identical functions makes this harder.
>
> Today you can write a function like:
>
> static inline int parse_pkt(void *data, void* data_end)
> {
>         /* parse data */
> }
>
> And call it like:
>
> SEC("xdp")
> int parse_xdp(struct xdp_md *ctx)
> {
>         return parse_pkt(ctx->data, ctx->data_end);
> }
>
> SEC("tc")
> int parse_tc(struct __sk_buff *skb)
> {
>         return parse_pkt(skb->data, skb->data_end);
> }
>
>
> IMO the goal should be to be able to do the equivalent for dynptrs, like:
>
> static inline int parse_pkt(struct bpf_dynptr *ptr)
> {
>         __u64 *data;
>
>         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
>         if (!data)
>                 return 0;
>         /* parse data */
> }
>
> SEC("xdp")
> int parse_xdp(struct xdp_md *ctx)
> {
>         struct bpf_dynptr ptr;
>
>         bpf_dynptr_from_xdp(ctx, 0, &ptr);
>         return parse_pkt(&ptr);
> }
>
> SEC("tc")
> int parse_tc(struct __sk_buff *skb)
> {
>         struct bpf_dynptr ptr;
>
>         bpf_dynptr_from_skb(skb, 0, &ptr);
>         return parse_pkt(&ptr);
> }
>

To clarify, this is already possible when using data slices, since the
behavior for data slices is equivalent between xdp and tc programs for
non-fragmented accesses. From looking through the selftests, I
anticipate that data slices will be the main way programs interact
with dynptrs. For the cases where the program may write into frags,
then bpf_dynptr_write will be needed (which is where functionality
between xdp and tc start differing) - today, we're not able to write
common code that writes into the frags since tc uses
bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.

I'm more and more liking the idea of limiting xdp to match the
constraints of skb given that both you, Kumar, and Jakub have
mentioned that portability between xdp and skb would be useful for
users :)

What are your thoughts on this API:

1) bpf_dynptr_data()

Before:
  for skb-type progs:
      - data slices in fragments is not supported

  for xdp-type progs:
      - data slices in fragments is supported as long as it is in a
contiguous frag (eg not across frags)

Now:
  for skb + xdp type progs:
      - data slices in fragments is not supported


2)  bpf_dynptr_write()

Before:
  for skb-type progs:
     - all data slices are invalidated after a write

  for xdp-type progs:
     - nothing

Now:
  for skb + xdp type progs:
     - all data slices are invalidated after a write

This will unite the functionality for skb and xdp programs across
bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
towards no because 1) if in the future there's some irreconcilable
aspect between skb and xdp that gets added, that'll be hard to support
since the expectation is that there is just one overall "packet
dynptr" 2) the "packet dynptr" view is not completely accurate (eg
bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
this adds some additional hardcoding in the verifier since there's no
organic mapping between prog type -> prog ctx


[0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e

>
> If the dynptr-based parse_pkt() function has to take special care to
> figure out where the dynptr comes from, it makes it a lot more difficult
> to write reusable packet parsing functions. So I'd be in favour of
> restricting the dynptr interface to the lowest common denominator of the
> skb and xdp interfaces even if that makes things slightly more awkward
> in the specialised cases...
>
> -Toke
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-08-23 23:22   ` kernel test robot
  2022-08-23 23:53   ` kernel test robot
@ 2022-08-24 18:27   ` Andrii Nakryiko
  2022-08-24 23:25     ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-24 18:27 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev

On Mon, Aug 22, 2022 at 4:57 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only. For reads and writes through the bpf_dynptr_read() and
> bpf_dynptr_write() interfaces, this supports reading and writing into
> data in the non-linear paged buffers. For data slices (through the
> bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> must first call bpf_skb_pull_data() to pull the data into the linear
> portion. The returned data slice from a call to bpf_dynptr_data() is of
> reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
>
> Any bpf_dynptr_write() automatically invalidates any prior data slices
> to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> to data in a paged buffer, so it will need to pull the buffer first into
> the head. The reason it needs to be pulled instead of writing directly to
> the paged buffers is because they may be cloned (only the head of the skb
> is by default uncloned). As such, any bpf_dynptr_write() will
> automatically have its prior data slices invalidated, even if the write
> is to data in the skb head (the verifier has no way of differentiating
> whether the write is to the head or paged buffers during program load
> time). Please note as well that any other helper calls that change the
> underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> slices of the skb dynptr as well. Whenever such a helper call is made,
> the verifier marks any PTR_TO_PACKET reg type (which includes skb dynptr
> slices since they are PTR_TO_PACKETs) as unknown. The stack trace for
> this is check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown()
>
> For examples of how skb dynptrs can be used, please see the attached
> selftests.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            | 83 +++++++++++++++++-----------
>  include/linux/filter.h         |  4 ++
>  include/uapi/linux/bpf.h       | 40 ++++++++++++--
>  kernel/bpf/helpers.c           | 81 +++++++++++++++++++++++++---
>  kernel/bpf/verifier.c          | 99 ++++++++++++++++++++++++++++------
>  net/core/filter.c              | 53 ++++++++++++++++--
>  tools/include/uapi/linux/bpf.h | 40 ++++++++++++--
>  7 files changed, 335 insertions(+), 65 deletions(-)
>

[...]

> @@ -1521,9 +1532,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
>         if (err)
>                 return err;
>
> -       memcpy(dst, src->data + src->offset + offset, len);
> +       type = bpf_dynptr_get_type(src);
>
> -       return 0;
> +       switch (type) {
> +       case BPF_DYNPTR_TYPE_LOCAL:
> +       case BPF_DYNPTR_TYPE_RINGBUF:
> +               memcpy(dst, src->data + src->offset + offset, len);
> +               return 0;
> +       case BPF_DYNPTR_TYPE_SKB:
> +               return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> +       default:
> +               WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);

nit: probably better to WARN_ONCE?

> +               return -EFAULT;
> +       }
>  }
>
>  static const struct bpf_func_proto bpf_dynptr_read_proto = {
> @@ -1540,18 +1561,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
>  BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
>            u32, len, u64, flags)
>  {
> +       enum bpf_dynptr_type type;
>         int err;
>
> -       if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> +       if (!dst->data || bpf_dynptr_is_rdonly(dst))
>                 return -EINVAL;
>
>         err = bpf_dynptr_check_off_len(dst, offset, len);
>         if (err)
>                 return err;
>
> -       memcpy(dst->data + dst->offset + offset, src, len);
> +       type = bpf_dynptr_get_type(dst);
>
> -       return 0;
> +       switch (type) {
> +       case BPF_DYNPTR_TYPE_LOCAL:
> +       case BPF_DYNPTR_TYPE_RINGBUF:
> +               if (flags)
> +                       return -EINVAL;
> +               memcpy(dst->data + dst->offset + offset, src, len);
> +               return 0;
> +       case BPF_DYNPTR_TYPE_SKB:
> +               return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> +                                            flags);
> +       default:
> +               WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);

ditto about WARN_ONCE

> +               return -EFAULT;
> +       }
>  }
>
>  static const struct bpf_func_proto bpf_dynptr_write_proto = {

[...]

> +static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
> +                                                      struct bpf_reg_state *reg,
> +                                                      int *ref_obj_id)
>  {
>         struct bpf_func_state *state = func(env, reg);
>         int spi = get_spi(reg->off);
>
> -       return state->stack[spi].spilled_ptr.id;
> +       if (ref_obj_id)
> +               *ref_obj_id = state->stack[spi].spilled_ptr.id;
> +
> +       return state->stack[spi].spilled_ptr.dynptr.type;
>  }
>
>  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> @@ -6056,7 +6075,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>                         case DYNPTR_TYPE_RINGBUF:
>                                 err_extra = "ringbuf ";
>                                 break;
> -                       default:
> +                       case DYNPTR_TYPE_SKB:
> +                               err_extra = "skb ";
>                                 break;
>                         }
>
> @@ -7149,6 +7169,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  {
>         enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
>         const struct bpf_func_proto *fn = NULL;
> +       enum bpf_dynptr_type dynptr_type;

compiler technically can complain about use of uninitialized
dynptr_type, maybe initialize it to BPF_DYNPTR_TYPE_INVALID ?

>         enum bpf_return_type ret_type;
>         enum bpf_type_flag ret_flag;
>         struct bpf_reg_state *regs;
> @@ -7320,24 +7341,43 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                         }
>                 }
>                 break;
> -       case BPF_FUNC_dynptr_data:
> -               for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
> -                       if (arg_type_is_dynptr(fn->arg_type[i])) {
> -                               if (meta.ref_obj_id) {
> -                                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> -                                       return -EFAULT;
> -                               }
> -                               /* Find the id of the dynptr we're tracking the reference of */
> -                               meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]);
> -                               break;
> -                       }
> +       case BPF_FUNC_dynptr_write:
> +       {
> +               struct bpf_reg_state *reg;
> +
> +               reg = get_dynptr_arg_reg(fn, regs);
> +               if (!reg) {
> +                       verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");

s/bpf_dynptr_data/bpf_dynptr_write/

> +                       return -EFAULT;
>                 }
> -               if (i == MAX_BPF_FUNC_REG_ARGS) {
> +
> +               /* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must
> +                * invalidate all data slices associated with it
> +                */
> +               if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB)
> +                       changes_data = true;
> +
> +               break;
> +       }
> +       case BPF_FUNC_dynptr_data:
> +       {
> +               struct bpf_reg_state *reg;
> +
> +               reg = get_dynptr_arg_reg(fn, regs);
> +               if (!reg) {
>                         verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
>                         return -EFAULT;
>                 }
> +
> +               if (meta.ref_obj_id) {
> +                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> +                       return -EFAULT;
> +               }
> +
> +               dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id);
>                 break;
>         }
> +       }
>
>         if (err)
>                 return err;
> @@ -7397,8 +7437,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                 break;
>         case RET_PTR_TO_ALLOC_MEM:
>                 mark_reg_known_zero(env, regs, BPF_REG_0);
> -               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> -               regs[BPF_REG_0].mem_size = meta.mem_size;
> +
> +               if (func_id == BPF_FUNC_dynptr_data &&
> +                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +                       regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> +                       regs[BPF_REG_0].range = meta.mem_size;
> +               } else {
> +                       regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> +                       regs[BPF_REG_0].mem_size = meta.mem_size;
> +               }
>                 break;
>         case RET_PTR_TO_MEM_OR_BTF_ID:
>         {
> @@ -14141,6 +14188,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto patch_call_imm;
>                 }
>
> +               if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> +                       bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> +
> +                       insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, is_rdonly);
> +                       insn_buf[1] = *insn;
> +                       cnt = 2;

This might have been discussed before, sorry if I missed it. But why
this special rewrite to provide read-only flag vs having two
implementations of bpf_dynptr_from_skb() depending on program types?
If program type allows read+write access, return
bpf_dynptr_from_skb_rdwr(), for those that have read-only access -
bpf_dynptr_from_skb_rdonly(), and for others return NULL proto
(disable helper)?

You can then use it very similarly for bpf_dynptr_from_skb_meta().

> +
> +                       new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +                       if (!new_prog)
> +                               return -ENOMEM;
> +
> +                       delta += cnt - 1;
> +                       env->prog = new_prog;
> +                       prog = new_prog;
> +                       insn = new_prog->insnsi + i + delta;
> +                       goto patch_call_imm;
> +               }
> +

[...]

>  BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>  {
>         return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> @@ -7726,6 +7763,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_get_socket_uid_proto;
>         case BPF_FUNC_perf_event_output:
>                 return &bpf_skb_event_output_proto;
> +       case BPF_FUNC_dynptr_from_skb:
> +               return &bpf_dynptr_from_skb_proto;
>         default:
>                 return bpf_sk_base_func_proto(func_id);
>         }
> @@ -7909,6 +7948,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_tcp_raw_check_syncookie_ipv6_proto;
>  #endif
>  #endif
> +       case BPF_FUNC_dynptr_from_skb:
> +               return &bpf_dynptr_from_skb_proto;
>         default:
>                 return bpf_sk_base_func_proto(func_id);
>         }
> @@ -8104,6 +8145,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         case BPF_FUNC_skc_lookup_tcp:
>                 return &bpf_skc_lookup_tcp_proto;
>  #endif
> +       case BPF_FUNC_dynptr_from_skb:
> +               return &bpf_dynptr_from_skb_proto;
>         default:
>                 return bpf_sk_base_func_proto(func_id);
>         }
> @@ -8142,6 +8185,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>                 return &bpf_get_smp_processor_id_proto;
>         case BPF_FUNC_skb_under_cgroup:
>                 return &bpf_skb_under_cgroup_proto;
> +       case BPF_FUNC_dynptr_from_skb:
> +               return &bpf_dynptr_from_skb_proto;

so like here you'd return read-only prototype for dynptr_from_skb
(seems like LWT programs have only read-only access, according to
may_access_direct_pkt_data), but in cases where
may_access_direct_pkt_data() allows read-write access we'd choose rdwr
variant of dynptr_from_skb?

>         default:
>                 return bpf_sk_base_func_proto(func_id);
>         }

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-22 23:56 ` [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
@ 2022-08-24 18:47   ` Andrii Nakryiko
  0 siblings, 0 replies; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-24 18:47 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev

On Mon, Aug 22, 2022 at 4:57 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Test skb and xdp dynptr functionality in the following ways:
>
> 1) progs/test_cls_redirect_dynptr.c
>    * Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse
>      skb data
>
>    * This is a great example of how dynptrs can be used to simplify a
>      lot of the parsing logic for non-statically known values, and speed
>      up execution times.
>
>      When measuring the user + system time between the original version
>      vs. using dynptrs, and averaging the time for 10 runs (using
>      "time ./test_progs -t cls_redirect"), there was a 2x speed-up:
>          original version: 0.053 sec
>          with dynptrs: 0.025 sec
>
> 2) progs/test_xdp_dynptr.c
>    * Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data
>
>      There were no noticeable diferences in user + system time between
>      the original version vs. using dynptrs. Averaging the time for 10
>      runs (run using "time ./test_progs -t xdp_bpf2bpf"):
>          original version: 0.0449 sec
>          with dynptrs: 0.0429 sec
>
> 3) progs/test_l4lb_noinline_dynptr.c
>    * Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse
>      skb data
>
>      There were no noticeable diferences in user + system time between
>      the original version vs. using dynptrs. Averaging the time for 10
>      runs (run using "time ./test_progs -t l4lb_all"):
>          original version: 0.0502 sec
>          with dynptrs: 0.055 sec
>
>      For number of processed verifier instructions:
>          original version: 6284 insns
>          with dynptrs: 2538 insns
>
> 4) progs/test_parse_tcp_hdr_opt_dynptr.c
>    * Add sample code for parsing tcp hdr opt lookup using dynptrs.
>      This logic is lifted from a real-world use case of packet parsing in
>      katran [0], a layer 4 load balancer. The original version
>      "progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included
>      here as well, for comparison.
>
> 5) progs/dynptr_success.c
>    * Add test case "test_skb_readonly" for testing attempts at writes /
>      data slices on a prog type with read-only skb ctx.
>
> 6) progs/dynptr_fail.c
>    * Add test cases "skb_invalid_data_slice{1,2}" and
>      "xdp_invalid_data_slice" for testing that helpers that modify the
>      underlying packet buffer automatically invalidate the associated
>      data slice.
>    * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
>      that prog types that do not support bpf_dynptr_from_skb/xdp don't
>      have access to the API.
>    * Add test case "skb_invalid_write" for testing that read-only skb
>      dynptrs can't be written to through data slices.
>
> [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  .../selftests/bpf/prog_tests/cls_redirect.c   |  25 +
>  .../testing/selftests/bpf/prog_tests/dynptr.c | 132 ++-
>  .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +
>  .../bpf/prog_tests/parse_tcp_hdr_opt.c        |  85 ++
>  .../selftests/bpf/prog_tests/xdp_attach.c     |   9 +-
>  .../testing/selftests/bpf/progs/dynptr_fail.c | 111 ++
>  .../selftests/bpf/progs/dynptr_success.c      |  29 +
>  .../bpf/progs/test_cls_redirect_dynptr.c      | 968 ++++++++++++++++++
>  .../bpf/progs/test_l4lb_noinline_dynptr.c     | 468 +++++++++
>  .../bpf/progs/test_parse_tcp_hdr_opt.c        | 119 +++
>  .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++
>  .../selftests/bpf/progs/test_xdp_dynptr.c     | 240 +++++
>  .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
>  13 files changed, 2255 insertions(+), 44 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c
>

Massive work on adding lots of selftests, thanks! Left few nits, but
looks good anyways:

Acked-by: Andrii Nakryiko <andrii@kernel.org>


[...]

> -       /* success cases */
> -       {"test_read_write", NULL},
> -       {"test_data_slice", NULL},
> -       {"test_ringbuf", NULL},
> +               "Unsupported reg type fp for bpf_dynptr_from_mem data", SETUP_NONE},
> +       {"skb_invalid_data_slice1", "invalid mem access 'scalar'", SETUP_NONE},
> +       {"skb_invalid_data_slice2", "invalid mem access 'scalar'", SETUP_NONE},
> +       {"xdp_invalid_data_slice", "invalid mem access 'scalar'", SETUP_NONE},
> +       {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb", SETUP_NONE},
> +       {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp", SETUP_NONE},
> +       {"skb_invalid_write", "cannot write into packet", SETUP_NONE},

nit: given SETUP_NONE is zero, you can just leave it out to make this
table a bit cleaner; bit no big deal having it explicitly as well

> +
> +       /* these tests should be run and should succeed */
> +       {"test_read_write", NULL, SETUP_SYSCALL_SLEEP},
> +       {"test_data_slice", NULL, SETUP_SYSCALL_SLEEP},
> +       {"test_ringbuf", NULL, SETUP_SYSCALL_SLEEP},
> +       {"test_skb_readonly", NULL, SETUP_SKB_PROG},
>  };
>

[...]

> +static void test_parsing(bool use_dynptr)
> +{
> +       char buf[128];
> +       struct bpf_program *prog;
> +       void *skel_ptr;
> +       int err;
> +
> +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> +                   .data_in = &pkt,
> +                   .data_size_in = sizeof(pkt),
> +                   .data_out = buf,
> +                   .data_size_out = sizeof(buf),
> +                   .repeat = 3,
> +       );
> +
> +       if (use_dynptr) {
> +               struct test_parse_tcp_hdr_opt_dynptr *skel;
> +
> +               skel = test_parse_tcp_hdr_opt_dynptr__open_and_load();
> +               if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +                       return;
> +
> +               pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
> +               prog = skel->progs.xdp_ingress_v6;
> +               skel_ptr = skel;
> +       } else {
> +               struct test_parse_tcp_hdr_opt *skel;
> +
> +               skel = test_parse_tcp_hdr_opt__open_and_load();
> +               if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +                       return;
> +
> +               pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
> +               prog = skel->progs.xdp_ingress_v6;
> +               skel_ptr = skel;
> +       }
> +
> +       err = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts);
> +       ASSERT_OK(err, "ipv6 test_run");
> +       ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
> +
> +       if (use_dynptr) {
> +               struct test_parse_tcp_hdr_opt_dynptr *skel = skel_ptr;
> +
> +               ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> +               test_parse_tcp_hdr_opt_dynptr__destroy(skel);
> +       } else {
> +               struct test_parse_tcp_hdr_opt *skel = skel_ptr;
> +
> +               ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> +               test_parse_tcp_hdr_opt__destroy(skel);
> +       }
> +}
> +
> +void test_parse_tcp_hdr_opt(void)
> +{
> +       test_parsing(false);
> +       test_parsing(true);

given this false/true argument is very non-descriptive and you
basically only share few lines of code inside test_parsing, why not
have two dedicated test_parsing_wo_dynptr and
test_parsing_with_dynptr? And probably makes sense to make those two
as two subtests?

> +}
> diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
> index 62aa3edda5e6..40d0d61af9e6 100644
> --- a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
> +++ b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
> @@ -4,11 +4,10 @@
>  #define IFINDEX_LO 1
>  #define XDP_FLAGS_REPLACE              (1U << 4)
>
> -void serial_test_xdp_attach(void)
> +static void serial_test_xdp_attach(const char *file)
>  {
>         __u32 duration = 0, id1, id2, id0 = 0, len;
>         struct bpf_object *obj1, *obj2, *obj3;
> -       const char *file = "./test_xdp.o";
>         struct bpf_prog_info info = {};
>         int err, fd1, fd2, fd3;
>         LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
> @@ -85,3 +84,9 @@ void serial_test_xdp_attach(void)
>  out_1:
>         bpf_object__close(obj1);
>  }
> +
> +void test_xdp_attach(void)
> +{
> +       serial_test_xdp_attach("./test_xdp.o");
> +       serial_test_xdp_attach("./test_xdp_dynptr.o");

nit: make into subtests?

> +}

[...]

> +/* The data slice is invalidated whenever a helper changes packet data */
> +SEC("?xdp")
> +int xdp_invalid_data_slice(struct xdp_md *xdp)
> +{
> +       struct bpf_dynptr ptr;
> +       struct ethhdr *hdr;
> +
> +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> +       hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
> +       if (!hdr)
> +               return SK_DROP;
> +
> +       hdr->h_proto = 9;
> +
> +       if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr)))
> +               return XDP_DROP;
> +
> +       /* this should fail */
> +       hdr->h_proto = 1;
> +
> +       return XDP_PASS;
> +}
> +
> +/* Only supported prog type can create skb-type dynptrs */
> +SEC("?raw_tp/sys_nanosleep")

nit: there is no sys_nanosleep raw tracepoint, is there? Just
SEC("?raw_tp") maybe, like you did in recent refactoring?

> +int skb_invalid_ctx(void *ctx)
> +{
> +       struct bpf_dynptr ptr;
> +
> +       /* this should fail */
> +       bpf_dynptr_from_skb(ctx, 0, &ptr);
> +
> +       return 0;
> +}
> +
> +/* Only supported prog type can create xdp-type dynptrs */
> +SEC("?raw_tp/sys_nanosleep")
> +int xdp_invalid_ctx(void *ctx)
> +{
> +       struct bpf_dynptr ptr;
> +
> +       /* this should fail */
> +       bpf_dynptr_from_xdp(ctx, 0, &ptr);
> +
> +       return 0;
> +}
> +

[...]

> +
> +/* Global metrics, per CPU
> + */
> +struct {
> +       __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +       __uint(max_entries, 1);
> +       __type(key, unsigned int);
> +       __type(value, metrics_t);
> +} metrics_map SEC(".maps");
> +
> +static metrics_t *get_global_metrics(void)
> +{
> +       uint64_t key = 0;
> +       return bpf_map_lookup_elem(&metrics_map, &key);
> +}
> +
> +static ret_t accept_locally(struct __sk_buff *skb, encap_headers_t *encap)
> +{
> +       const int payload_off =
> +               sizeof(*encap) +
> +               sizeof(struct in_addr) * encap->unigue.hop_count;
> +       int32_t encap_overhead = payload_off - sizeof(struct ethhdr);
> +
> +       // Changing the ethertype if the encapsulated packet is ipv6

nit: could be copy/paste from original, but let's not add C++ comments?

> +       if (encap->gue.proto_ctype == IPPROTO_IPV6)
> +               encap->eth.h_proto = bpf_htons(ETH_P_IPV6);
> +
> +       if (bpf_skb_adjust_room(skb, -encap_overhead, BPF_ADJ_ROOM_MAC,
> +                               BPF_F_ADJ_ROOM_FIXED_GSO |
> +                               BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
> +           bpf_csum_level(skb, BPF_CSUM_LEVEL_DEC))
> +               return TC_ACT_SHOT;
> +
> +       return bpf_redirect(skb->ifindex, BPF_F_INGRESS);
> +}
> +

[...]

> +       iph->version = 4;
> +       iph->ihl = iphdr_sz >> 2;
> +       iph->frag_off = 0;
> +       iph->protocol = IPPROTO_IPIP;
> +       iph->check = 0;
> +       iph->tos = 0;
> +       iph->tot_len = bpf_htons(payload_len + iphdr_sz);
> +       iph->daddr = tnl->daddr.v4;
> +       iph->saddr = tnl->saddr.v4;
> +       iph->ttl = 8;
> +
> +       next_iph = (__u16 *)iph;
> +#pragma clang loop unroll(full)

nit: probably don't need unroll?

> +       for (i = 0; i < iphdr_sz >> 1; i++)
> +               csum += *next_iph++;
> +
> +       iph->check = ~((csum & 0xffff) + (csum >> 16));
> +
> +       count_tx(vip.protocol);
> +
> +       return XDP_TX;
> +}

[...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-23 22:26     ` Joanne Koong
  2022-08-24 10:39       ` Toke Høiland-Jørgensen
@ 2022-08-24 21:10       ` Kumar Kartikeya Dwivedi
  2022-08-25 20:39         ` Joanne Koong
  1 sibling, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 21:10 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> > > [...]
> > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > >                         regs[BPF_REG_0].range = meta.mem_size;
> >
> > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > modified by comparisons with packet pointers loaded from the xdp/skb
> > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > frag in a multi-buff XDP vs pkt pointer from a linear area?
> >
> > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > access for the linear area. reg_is_init_pkt_pointer will return true
> > as modified range is not considered for it. Same kind of issues when
> > doing comparison with data_end from ctx (though maybe you won't be
> > able to do incorrect data access at runtime using that).
> >
> > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > and that disabled comparisons for them. reg->id is used for var_off
> > range propagation so it cannot be reused.
> >
> > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > (and add a type tag to maybe pretty print it also as a packet pointer
> > in verifier log), or add some way to distinguish slice vs non-slice
> > pkt pointers like I did in my patch. You might also want to add some
> > tests for this corner case (there are some later in [0] if you want to
> > reuse them).
> >
> > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > not seem worth it. The pkt_uid distinction is more useful (and
> > actually would be needed) in Toke's xdp queueing series, where in a
> > dequeue program you have multiple xdp_mds and want scoped slice
> > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > slices of some other xdp_md). Here we can just get away with normal
> > PTR_TO_MEM.
> >
> > ... Or just let me know if you handle this correctly already, or if
> > this won't be an actual problem :).
>
> Ooh interesting, I hadn't previously taken a look at
> try_match_pkt_pointers(), thanks for mentioning it :)
>
> The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> returning PTR_TO_MEM because it seems more robust (eg if in the future
> we reject x behavior on the packet data reg types, this will
> automatically apply to the data slices), and because it'll keep the
> logic more efficient/simpler for the case when the pkt pointer has to
> be cleared after any helper that changes pkt data is called (aka the
> case where the data slice gets invalidated).
>
> What are your thoughts?
>

Thinking more deeply about it, probably not, we need more work here. I
remember _now_ why I chose the pkt_uid approach (and this tells us my
commit log lacks all the details about the motivation :( ).

Consider how equivalency checking for packet pointers works in
regsafe. It is checking type, then if old range > cur range, then
offs, etc.

The problem is, while we now don't prune on access to ptr_to_pkt vs
ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
separate frags, so this assumption would be incorrect. I would be able
to trick the verifier into accessing data beyond the length of a
different frag, by first making sure one line of exploration is
verified, and then changing the register in another branch reaching
the same branch target. Helpers can take packet pointers so the access
can become a pruning point. It would think the rest of the stuff is
safe, while they are not equivalent at all. It is ok if they are bit
by bit equivalent (same type, range, off, etc.).

If you start returning false whenever you see this type tag set, it
will become too conservative (it considers reg copies of the same
dynptr_data lookup as distinct). So you need some kind of id assigned
during dynptr_data lookup to distinguish them.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 18:10         ` Joanne Koong
@ 2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
  2022-08-25 20:14             ` Joanne Koong
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 23:04 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Toke Høiland-Jørgensen, bpf, andrii, daniel, ast,
	kafai, kuba, netdev, brouer, lorenzo

On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Joanne Koong <joannelkoong@gmail.com> writes:
> > >> [...]
> > >> It's a bit awkward to have such a difference between xdp and skb
> > >> dynptr's read/write. I understand why it is the way it is, but it
> > >> still doesn't feel right. I'm not sure if we can reconcile the
> > >> differences, but it makes writing common code for both xdp and tc
> > >> harder as it needs to be aware of the differences (and then the flags
> > >> for dynptr_write would differ too). So we're 90% there but not the
> > >> whole way...
> > >
> > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > same, but I'm not seeing a better solution here (unless we invalidate
> > > data slices on writes in xdp progs, just to make it match more :P).
> > >
> > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > not convinced this is much of a problem - xdp and skb programs already
> > > have different interfaces for doing things (eg
> > > bpf_{skb/xdp}_{store/load}_bytes).
> >
> > This is true, but it's quite possible to paper over these differences
> > and write BPF code that works for both TC and XDP. Subtle semantic
> > differences in otherwise identical functions makes this harder.
> >
> > Today you can write a function like:
> >
> > static inline int parse_pkt(void *data, void* data_end)
> > {
> >         /* parse data */
> > }
> >
> > And call it like:
> >
> > SEC("xdp")
> > int parse_xdp(struct xdp_md *ctx)
> > {
> >         return parse_pkt(ctx->data, ctx->data_end);
> > }
> >
> > SEC("tc")
> > int parse_tc(struct __sk_buff *skb)
> > {
> >         return parse_pkt(skb->data, skb->data_end);
> > }
> >
> >
> > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> >
> > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > {
> >         __u64 *data;
> >
> >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> >         if (!data)
> >                 return 0;
> >         /* parse data */
> > }
> >
> > SEC("xdp")
> > int parse_xdp(struct xdp_md *ctx)
> > {
> >         struct bpf_dynptr ptr;
> >
> >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> >         return parse_pkt(&ptr);
> > }
> >
> > SEC("tc")
> > int parse_tc(struct __sk_buff *skb)
> > {
> >         struct bpf_dynptr ptr;
> >
> >         bpf_dynptr_from_skb(skb, 0, &ptr);
> >         return parse_pkt(&ptr);
> > }
> >
>
> To clarify, this is already possible when using data slices, since the
> behavior for data slices is equivalent between xdp and tc programs for
> non-fragmented accesses. From looking through the selftests, I
> anticipate that data slices will be the main way programs interact
> with dynptrs. For the cases where the program may write into frags,
> then bpf_dynptr_write will be needed (which is where functionality
> between xdp and tc start differing) - today, we're not able to write
> common code that writes into the frags since tc uses
> bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.

Yes, we cannot write code that just swaps those two calls right now.
The key difference is, the two above are separate functions already
and have different semantics. Here in this set you have the same
function, but with different semantics depending on ctx, which is the
point of contention.

>
> I'm more and more liking the idea of limiting xdp to match the
> constraints of skb given that both you, Kumar, and Jakub have
> mentioned that portability between xdp and skb would be useful for
> users :)
>
> What are your thoughts on this API:
>
> 1) bpf_dynptr_data()
>
> Before:
>   for skb-type progs:
>       - data slices in fragments is not supported
>
>   for xdp-type progs:
>       - data slices in fragments is supported as long as it is in a
> contiguous frag (eg not across frags)
>
> Now:
>   for skb + xdp type progs:
>       - data slices in fragments is not supported
>
>
> 2)  bpf_dynptr_write()
>
> Before:
>   for skb-type progs:
>      - all data slices are invalidated after a write
>
>   for xdp-type progs:
>      - nothing
>
> Now:
>   for skb + xdp type progs:
>      - all data slices are invalidated after a write
>

There is also the other option: failing to write until you pull skb,
which looks a lot better to me if we are adding this helper anyway...

> This will unite the functionality for skb and xdp programs across
> bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> towards no because 1) if in the future there's some irreconcilable
> aspect between skb and xdp that gets added, that'll be hard to support
> since the expectation is that there is just one overall "packet
> dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> this adds some additional hardcoding in the verifier since there's no
> organic mapping between prog type -> prog ctx
>

There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
misunderstood what you meant).

I also had a different tangential question (unrelated to
'reconciliation'). Let's forget about that for a moment. I was
listening to the talk here [0]. At 12:00, I can hear someone
mentioning that you'd have a dynptr for each segment/frag.

Right now, you have a skb/xdp dynptr, which is representing
discontiguous memory, i.e. you represent all linear page + fragments
using one dynptr. This seems a bit inconsistent with the idea of a
dynptr, since it's conceptually a slice of variable length memory.
dynptr_data gives you a constant length slice into dynptr's variable
length memory (which the verifier can track bounds of statically).

So thinking about the idea in [0], one way would be to change
from_xdp/from_skb helpers to take an index (into the frags array), and
return dynptr for each frag. Or determine frag from offset + len. For
the skb case it might require kmap_atomic to access the frag dynptr,
which might need a paired kunmap_atomic call as well, and it may not
return dynptr for frags in cloned skbs, but for XDP multi-buff all
that doesn't count.

Was this approach/line of thinking considered and/or rejected? What
were the reasons? Just trying to understand the background here.

What I'm wondering is that we already have helpers to do reads and
writes that you are wrapping over in dynptr interfaces, but what we're
missing is to be able to get direct access to frags (or 'next' ctx,
essentially). When we know we can avoid pulls, it might be cheaper to
then do access this way to skb, right? Especially for a case where
just a part of the header lies in the next frag?

But I'm no expert here, so I might be missing some subtleties.

  [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M









>
> [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
>
> >
> > If the dynptr-based parse_pkt() function has to take special care to
> > figure out where the dynptr comes from, it makes it a lot more difficult
> > to write reusable packet parsing functions. So I'd be in favour of
> > restricting the dynptr interface to the lowest common denominator of the
> > skb and xdp interfaces even if that makes things slightly more awkward
> > in the specialised cases...
> >
> > -Toke
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs
  2022-08-24 18:01     ` Andrii Nakryiko
@ 2022-08-24 23:18       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 23:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Wed, 24 Aug 2022 at 20:02, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 11:52 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 7:32 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Tue, 23 Aug 2022 at 02:06, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > This patchset is the 2nd in the dynptr series. The 1st can be found here [0].
> > > >
> > > > This patchset adds skb and xdp type dynptrs, which have two main benefits for
> > > > packet parsing:
> > > >     * allowing operations on sizes that are not statically known at
> > > >       compile-time (eg variable-sized accesses).
> > > >     * more ergonomic and less brittle iteration through data (eg does not need
> > > >       manual if checking for being within bounds of data_end)
> > > >
> > >
> > > Just curious: so would you be adding a dynptr interface for obtaining
> > > data_meta slices as well in the future? Since the same manual bounds
> > > checking is needed for data_meta vs data. How would that look in the
> > > generic dynptr interface of data/read/write this set is trying to fit
> > > things in?
> >
> > Oh cool, I didn't realize there is also a data_meta used in packet
> > parsing - thanks for bringing this up. I think there are 2 options for
> > how data_meta can be incorporated into the dynptr interface:
> >
> > 1) have a separate api "bpf_dynptr_from_{skb/xdp}_meta. We'll have to
> > have a function in the verifier that does something similar to
> > 'may_access_direct_pkt_data' but for pkt data meta, since skb progs
> > can have different access restrictions for data vs. data_meta.
> >
> > 2) ideally, the flags arg would be used to indicate whether the
> > parsing should be for data_meta. To support this though, I think we'd
> > need to do access type checking within the helper instead of at the
> > verifier level. One idea is to pass in the env->ops ptr as a 4th arg
> > (manually patching it from the verifier) to the helper,  which can be
> > used to determine if data_meta access is permitted.
> >
> > In both options, there'll be a new BPF_DYNPTR_{SKB/XDP}_META dynptr
> > type and data/read/write will be supported for it.
> >
> > What are your thoughts?
>
> I think separate bpf_dynptr_from_skb_meta() and
> bpf_dynptr_from_xdp_meta() is cleaner than a flag. Also having a
> separate helper would make it easier to disable this helper for
> program types that don't have access to ctx->data_meta, right?
>

Agreed, and with flags then you also need to force them to be constant
(to be able to distinguish the return type from the flag's value),
which might be too restrictive.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-24 18:27   ` Andrii Nakryiko
@ 2022-08-24 23:25     ` Kumar Kartikeya Dwivedi
  2022-08-25 21:02       ` Joanne Koong
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-24 23:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Wed, 24 Aug 2022 at 20:42, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 4:57 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only. For reads and writes through the bpf_dynptr_read() and
> > bpf_dynptr_write() interfaces, this supports reading and writing into
> > data in the non-linear paged buffers. For data slices (through the
> > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > must first call bpf_skb_pull_data() to pull the data into the linear
> > portion. The returned data slice from a call to bpf_dynptr_data() is of
> > reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
> >
> > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > to data in a paged buffer, so it will need to pull the buffer first into
> > the head. The reason it needs to be pulled instead of writing directly to
> > the paged buffers is because they may be cloned (only the head of the skb
> > is by default uncloned). As such, any bpf_dynptr_write() will
> > automatically have its prior data slices invalidated, even if the write
> > is to data in the skb head (the verifier has no way of differentiating
> > whether the write is to the head or paged buffers during program load
> > time). Please note as well that any other helper calls that change the
> > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > slices of the skb dynptr as well. Whenever such a helper call is made,
> > the verifier marks any PTR_TO_PACKET reg type (which includes skb dynptr
> > slices since they are PTR_TO_PACKETs) as unknown. The stack trace for
> > this is check_helper_call() -> clear_all_pkt_pointers() ->
> > __clear_all_pkt_pointers() -> mark_reg_unknown()
> >
> > For examples of how skb dynptrs can be used, please see the attached
> > selftests.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            | 83 +++++++++++++++++-----------
> >  include/linux/filter.h         |  4 ++
> >  include/uapi/linux/bpf.h       | 40 ++++++++++++--
> >  kernel/bpf/helpers.c           | 81 +++++++++++++++++++++++++---
> >  kernel/bpf/verifier.c          | 99 ++++++++++++++++++++++++++++------
> >  net/core/filter.c              | 53 ++++++++++++++++--
> >  tools/include/uapi/linux/bpf.h | 40 ++++++++++++--
> >  7 files changed, 335 insertions(+), 65 deletions(-)
> >
>
> [...]
>
> > @@ -1521,9 +1532,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
> >         if (err)
> >                 return err;
> >
> > -       memcpy(dst, src->data + src->offset + offset, len);
> > +       type = bpf_dynptr_get_type(src);
> >
> > -       return 0;
> > +       switch (type) {
> > +       case BPF_DYNPTR_TYPE_LOCAL:
> > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > +               memcpy(dst, src->data + src->offset + offset, len);
> > +               return 0;
> > +       case BPF_DYNPTR_TYPE_SKB:
> > +               return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > +       default:
> > +               WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
>
> nit: probably better to WARN_ONCE?
>
> > +               return -EFAULT;
> > +       }
> >  }
> >
> >  static const struct bpf_func_proto bpf_dynptr_read_proto = {
> > @@ -1540,18 +1561,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
> >  BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
> >            u32, len, u64, flags)
> >  {
> > +       enum bpf_dynptr_type type;
> >         int err;
> >
> > -       if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> > +       if (!dst->data || bpf_dynptr_is_rdonly(dst))
> >                 return -EINVAL;
> >
> >         err = bpf_dynptr_check_off_len(dst, offset, len);
> >         if (err)
> >                 return err;
> >
> > -       memcpy(dst->data + dst->offset + offset, src, len);
> > +       type = bpf_dynptr_get_type(dst);
> >
> > -       return 0;
> > +       switch (type) {
> > +       case BPF_DYNPTR_TYPE_LOCAL:
> > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > +               if (flags)
> > +                       return -EINVAL;
> > +               memcpy(dst->data + dst->offset + offset, src, len);
> > +               return 0;
> > +       case BPF_DYNPTR_TYPE_SKB:
> > +               return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> > +                                            flags);
> > +       default:
> > +               WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
>
> ditto about WARN_ONCE
>
> > +               return -EFAULT;
> > +       }
> >  }
> >
> >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
>
> [...]
>
> > +static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
> > +                                                      struct bpf_reg_state *reg,
> > +                                                      int *ref_obj_id)
> >  {
> >         struct bpf_func_state *state = func(env, reg);
> >         int spi = get_spi(reg->off);
> >
> > -       return state->stack[spi].spilled_ptr.id;
> > +       if (ref_obj_id)
> > +               *ref_obj_id = state->stack[spi].spilled_ptr.id;
> > +
> > +       return state->stack[spi].spilled_ptr.dynptr.type;
> >  }
> >
> >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > @@ -6056,7 +6075,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                         case DYNPTR_TYPE_RINGBUF:
> >                                 err_extra = "ringbuf ";
> >                                 break;
> > -                       default:
> > +                       case DYNPTR_TYPE_SKB:
> > +                               err_extra = "skb ";
> >                                 break;
> >                         }
> >
> > @@ -7149,6 +7169,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >  {
> >         enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> >         const struct bpf_func_proto *fn = NULL;
> > +       enum bpf_dynptr_type dynptr_type;
>
> compiler technically can complain about use of uninitialized
> dynptr_type, maybe initialize it to BPF_DYNPTR_TYPE_INVALID ?
>
> >         enum bpf_return_type ret_type;
> >         enum bpf_type_flag ret_flag;
> >         struct bpf_reg_state *regs;
> > @@ -7320,24 +7341,43 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                         }
> >                 }
> >                 break;
> > -       case BPF_FUNC_dynptr_data:
> > -               for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
> > -                       if (arg_type_is_dynptr(fn->arg_type[i])) {
> > -                               if (meta.ref_obj_id) {
> > -                                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> > -                                       return -EFAULT;
> > -                               }
> > -                               /* Find the id of the dynptr we're tracking the reference of */
> > -                               meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]);
> > -                               break;
> > -                       }
> > +       case BPF_FUNC_dynptr_write:
> > +       {
> > +               struct bpf_reg_state *reg;
> > +
> > +               reg = get_dynptr_arg_reg(fn, regs);
> > +               if (!reg) {
> > +                       verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
>
> s/bpf_dynptr_data/bpf_dynptr_write/
>
> > +                       return -EFAULT;
> >                 }
> > -               if (i == MAX_BPF_FUNC_REG_ARGS) {
> > +
> > +               /* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must
> > +                * invalidate all data slices associated with it
> > +                */
> > +               if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB)
> > +                       changes_data = true;
> > +
> > +               break;
> > +       }
> > +       case BPF_FUNC_dynptr_data:
> > +       {
> > +               struct bpf_reg_state *reg;
> > +
> > +               reg = get_dynptr_arg_reg(fn, regs);
> > +               if (!reg) {
> >                         verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
> >                         return -EFAULT;
> >                 }
> > +
> > +               if (meta.ref_obj_id) {
> > +                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> > +                       return -EFAULT;
> > +               }
> > +
> > +               dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id);
> >                 break;
> >         }
> > +       }
> >
> >         if (err)
> >                 return err;
> > @@ -7397,8 +7437,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                 break;
> >         case RET_PTR_TO_ALLOC_MEM:
> >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> > -               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > -               regs[BPF_REG_0].mem_size = meta.mem_size;
> > +
> > +               if (func_id == BPF_FUNC_dynptr_data &&
> > +                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +                       regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > +                       regs[BPF_REG_0].range = meta.mem_size;
> > +               } else {
> > +                       regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > +                       regs[BPF_REG_0].mem_size = meta.mem_size;
> > +               }
> >                 break;
> >         case RET_PTR_TO_MEM_OR_BTF_ID:
> >         {
> > @@ -14141,6 +14188,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >                         goto patch_call_imm;
> >                 }
> >
> > +               if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > +                       bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > +
> > +                       insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, is_rdonly);
> > +                       insn_buf[1] = *insn;
> > +                       cnt = 2;
>
> This might have been discussed before, sorry if I missed it. But why
> this special rewrite to provide read-only flag vs having two
> implementations of bpf_dynptr_from_skb() depending on program types?
> If program type allows read+write access, return
> bpf_dynptr_from_skb_rdwr(), for those that have read-only access -
> bpf_dynptr_from_skb_rdonly(), and for others return NULL proto
> (disable helper)?
>
> You can then use it very similarly for bpf_dynptr_from_skb_meta().
>

Related question, it seems we know statically if dynptr is read only
or not, so why even do all this hidden parameter passing and instead
just reject writes directly? You only need to be able to set
MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
seems simpler than checking it at runtime. Verifier already handles
MEM_RDONLY generically, you only need to add the guard for
check_packet_acces (and check_helper_mem_access for meta->raw_mode
under pkt case), and rejecting dynptr_write seems like a if condition.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
@ 2022-08-25 20:14             ` Joanne Koong
  2022-08-25 21:53             ` Andrii Nakryiko
  2022-08-26  6:37             ` Martin KaFai Lau
  2 siblings, 0 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-25 20:14 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Toke Høiland-Jørgensen, bpf, andrii, daniel, ast,
	kafai, kuba, netdev, brouer, lorenzo

On Wed, Aug 24, 2022 at 4:04 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Joanne Koong <joannelkoong@gmail.com> writes:
> > > >> [...]
> > > >> It's a bit awkward to have such a difference between xdp and skb
> > > >> dynptr's read/write. I understand why it is the way it is, but it
> > > >> still doesn't feel right. I'm not sure if we can reconcile the
> > > >> differences, but it makes writing common code for both xdp and tc
> > > >> harder as it needs to be aware of the differences (and then the flags
> > > >> for dynptr_write would differ too). So we're 90% there but not the
> > > >> whole way...
> > > >
> > > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > > same, but I'm not seeing a better solution here (unless we invalidate
> > > > data slices on writes in xdp progs, just to make it match more :P).
> > > >
> > > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > > not convinced this is much of a problem - xdp and skb programs already
> > > > have different interfaces for doing things (eg
> > > > bpf_{skb/xdp}_{store/load}_bytes).
> > >
> > > This is true, but it's quite possible to paper over these differences
> > > and write BPF code that works for both TC and XDP. Subtle semantic
> > > differences in otherwise identical functions makes this harder.
> > >
> > > Today you can write a function like:
> > >
> > > static inline int parse_pkt(void *data, void* data_end)
> > > {
> > >         /* parse data */
> > > }
> > >
> > > And call it like:
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         return parse_pkt(ctx->data, ctx->data_end);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         return parse_pkt(skb->data, skb->data_end);
> > > }
> > >
> > >
> > > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> > >
> > > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > > {
> > >         __u64 *data;
> > >
> > >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> > >         if (!data)
> > >                 return 0;
> > >         /* parse data */
> > > }
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_skb(skb, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> >
> > To clarify, this is already possible when using data slices, since the
> > behavior for data slices is equivalent between xdp and tc programs for
> > non-fragmented accesses. From looking through the selftests, I
> > anticipate that data slices will be the main way programs interact
> > with dynptrs. For the cases where the program may write into frags,
> > then bpf_dynptr_write will be needed (which is where functionality
> > between xdp and tc start differing) - today, we're not able to write
> > common code that writes into the frags since tc uses
> > bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.
>
> Yes, we cannot write code that just swaps those two calls right now.
> The key difference is, the two above are separate functions already
> and have different semantics. Here in this set you have the same
> function, but with different semantics depending on ctx, which is the
> point of contention.
>
> >
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
> >
>
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

This was also previously discussed in [0].

After using skb dynptrs for the test_cls_redirect_dynptr.c selftest
[1], I'm convinced that allowing bpf_dynptr_writes into frags is the
correct way to go. There are instances where the data is probably in
head with a slight off-chance that it's in a fragment - being able to
call bpf_dynptr_write makes it super easy whereas the alternative is
either 1) always pulling, which in most cases will be unnecessary or
2) adding special casing for the case where the bpf_dynptr_write
fails. As well, failing to write until you pull the skb will hurt
being able to write common code that both xdp/skb can use.


[0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#md6c17d9916f5937a9ae9dfca11e815e4b89009fb

[1] https://lore.kernel.org/bpf/20220822235649.2218031-4-joannelkoong@gmail.com/

>
> > This will unite the functionality for skb and xdp programs across
> > bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> > we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> > common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> > towards no because 1) if in the future there's some irreconcilable
> > aspect between skb and xdp that gets added, that'll be hard to support
> > since the expectation is that there is just one overall "packet
> > dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> > bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> > this adds some additional hardcoding in the verifier since there's no
> > organic mapping between prog type -> prog ctx
> >
>
> There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
> misunderstood what you meant).

Cool, that should be pretty straightforwarwd then; from what it looks
like, btf_get_prog_ctx_type() + 1 will give the btf id and then from
that use btf_type_by_id() to get the corresponding struct btf_type,
then use btf_type->name_off to check whether the name corresponds to
"__sk_buff" or "xdp_md"

>
> I also had a different tangential question (unrelated to
> 'reconciliation'). Let's forget about that for a moment. I was
> listening to the talk here [0]. At 12:00, I can hear someone
> mentioning that you'd have a dynptr for each segment/frag.
>
> Right now, you have a skb/xdp dynptr, which is representing
> discontiguous memory, i.e. you represent all linear page + fragments
> using one dynptr. This seems a bit inconsistent with the idea of a
> dynptr, since it's conceptually a slice of variable length memory.
> dynptr_data gives you a constant length slice into dynptr's variable
> length memory (which the verifier can track bounds of statically).
>
> So thinking about the idea in [0], one way would be to change
> from_xdp/from_skb helpers to take an index (into the frags array), and
> return dynptr for each frag. Or determine frag from offset + len. For
> the skb case it might require kmap_atomic to access the frag dynptr,
> which might need a paired kunmap_atomic call as well, and it may not
> return dynptr for frags in cloned skbs, but for XDP multi-buff all
> that doesn't count.
>
> Was this approach/line of thinking considered and/or rejected? What
> were the reasons? Just trying to understand the background here.
>
> What I'm wondering is that we already have helpers to do reads and
> writes that you are wrapping over in dynptr interfaces, but what we're
> missing is to be able to get direct access to frags (or 'next' ctx,
> essentially). When we know we can avoid pulls, it might be cheaper to
> then do access this way to skb, right? Especially for a case where
> just a part of the header lies in the next frag?

Imo, having a separate dynptr for each frag is confusing for users. My
intuition is that in the majority of cases, the user doesn't care how
it's laid out at the frag level, they just want to write some data at
some offset and length. If every frag is a separate dynptr, then the
user needs to be aware whether they're crossing frag (and thus dynptr)
boundaries. In cases where the user doesn't know if the data is across
head/frag or across frags, then their prog code will need to be more
complex to handle these different scenarios. There's the additional
issue as well that this doesn't work for cloned skb frags - now users
need to know whether the skb is cloned or uncloned to know whether to
pull first. Imo it makes the interface more confusing. I'm still not
sure what is gained by having separate dynptrs for separate frags.

If the goal is to avoid pulls, then we're able to do that without
having separate dynptrs. If a write is to an uncloned frag, then
bpf_dynptr_write() can kmap, write the data, and kunmap (though it's
still unclear as to whether this should be done, since maybe it's
faster to pull and do large numbers of writes instead of multiple
kmap/kunmap calls).

>
> But I'm no expert here, so I might be missing some subtleties.
>
>   [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M
>
>
>
>
>
>
>
>
>
> >
> > [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
> >
> > >
> > > If the dynptr-based parse_pkt() function has to take special care to
> > > figure out where the dynptr comes from, it makes it a lot more difficult
> > > to write reusable packet parsing functions. So I'd be in favour of
> > > restricting the dynptr interface to the lowest common denominator of the
> > > skb and xdp interfaces even if that makes things slightly more awkward
> > > in the specialised cases...
> > >
> > > -Toke
> > >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 21:10       ` Kumar Kartikeya Dwivedi
@ 2022-08-25 20:39         ` Joanne Koong
  2022-08-25 23:18           ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-25 20:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > > > [...]
> > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > >
> > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > >
> > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > as modified range is not considered for it. Same kind of issues when
> > > doing comparison with data_end from ctx (though maybe you won't be
> > > able to do incorrect data access at runtime using that).
> > >
> > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > and that disabled comparisons for them. reg->id is used for var_off
> > > range propagation so it cannot be reused.
> > >
> > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > in verifier log), or add some way to distinguish slice vs non-slice
> > > pkt pointers like I did in my patch. You might also want to add some
> > > tests for this corner case (there are some later in [0] if you want to
> > > reuse them).
> > >
> > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > not seem worth it. The pkt_uid distinction is more useful (and
> > > actually would be needed) in Toke's xdp queueing series, where in a
> > > dequeue program you have multiple xdp_mds and want scoped slice
> > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > slices of some other xdp_md). Here we can just get away with normal
> > > PTR_TO_MEM.
> > >
> > > ... Or just let me know if you handle this correctly already, or if
> > > this won't be an actual problem :).
> >
> > Ooh interesting, I hadn't previously taken a look at
> > try_match_pkt_pointers(), thanks for mentioning it :)
> >
> > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > we reject x behavior on the packet data reg types, this will
> > automatically apply to the data slices), and because it'll keep the
> > logic more efficient/simpler for the case when the pkt pointer has to
> > be cleared after any helper that changes pkt data is called (aka the
> > case where the data slice gets invalidated).
> >
> > What are your thoughts?
> >
>
> Thinking more deeply about it, probably not, we need more work here. I
> remember _now_ why I chose the pkt_uid approach (and this tells us my
> commit log lacks all the details about the motivation :( ).
>
> Consider how equivalency checking for packet pointers works in
> regsafe. It is checking type, then if old range > cur range, then
> offs, etc.
>
> The problem is, while we now don't prune on access to ptr_to_pkt vs
> ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> separate frags, so this assumption would be incorrect. I would be able
> to trick the verifier into accessing data beyond the length of a
> different frag, by first making sure one line of exploration is
> verified, and then changing the register in another branch reaching
> the same branch target. Helpers can take packet pointers so the access
> can become a pruning point. It would think the rest of the stuff is
> safe, while they are not equivalent at all. It is ok if they are bit
> by bit equivalent (same type, range, off, etc.).

Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
of ptr_to_pkt since if old range > cur range, then if old range is
acceptable cur range must definitely be acceptable?

>
> If you start returning false whenever you see this type tag set, it
> will become too conservative (it considers reg copies of the same
> dynptr_data lookup as distinct). So you need some kind of id assigned
> during dynptr_data lookup to distinguish them.

What about if the dynptr_pkt type tag is set, then we compare the
ranges as well? If the ranges are the same, then we return true, else
false. Does that work?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-24 23:25     ` Kumar Kartikeya Dwivedi
@ 2022-08-25 21:02       ` Joanne Koong
  2022-08-26  0:18         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-25 21:02 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Wed, Aug 24, 2022 at 4:26 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 20:42, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 4:57 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > benefits. One is that they allow operations on sizes that are not
> > > statically known at compile-time (eg variable-sized accesses).
> > > Another is that parsing the packet data through dynptrs (instead of
> > > through direct access of skb->data and skb->data_end) can be more
> > > ergonomic and less brittle (eg does not need manual if checking for
> > > being within bounds of data_end).
> > >
> > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > read-only. For reads and writes through the bpf_dynptr_read() and
> > > bpf_dynptr_write() interfaces, this supports reading and writing into
> > > data in the non-linear paged buffers. For data slices (through the
> > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > portion. The returned data slice from a call to bpf_dynptr_data() is of
> > > reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
> > >
> > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > to data in a paged buffer, so it will need to pull the buffer first into
> > > the head. The reason it needs to be pulled instead of writing directly to
> > > the paged buffers is because they may be cloned (only the head of the skb
> > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > automatically have its prior data slices invalidated, even if the write
> > > is to data in the skb head (the verifier has no way of differentiating
> > > whether the write is to the head or paged buffers during program load
> > > time). Please note as well that any other helper calls that change the
> > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > slices of the skb dynptr as well. Whenever such a helper call is made,
> > > the verifier marks any PTR_TO_PACKET reg type (which includes skb dynptr
> > > slices since they are PTR_TO_PACKETs) as unknown. The stack trace for
> > > this is check_helper_call() -> clear_all_pkt_pointers() ->
> > > __clear_all_pkt_pointers() -> mark_reg_unknown()
> > >
> > > For examples of how skb dynptrs can be used, please see the attached
> > > selftests.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >  include/linux/bpf.h            | 83 +++++++++++++++++-----------
> > >  include/linux/filter.h         |  4 ++
> > >  include/uapi/linux/bpf.h       | 40 ++++++++++++--
> > >  kernel/bpf/helpers.c           | 81 +++++++++++++++++++++++++---
> > >  kernel/bpf/verifier.c          | 99 ++++++++++++++++++++++++++++------
> > >  net/core/filter.c              | 53 ++++++++++++++++--
> > >  tools/include/uapi/linux/bpf.h | 40 ++++++++++++--
> > >  7 files changed, 335 insertions(+), 65 deletions(-)
> > >
> >
> > [...]
> >
> > > @@ -1521,9 +1532,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
> > >         if (err)
> > >                 return err;
> > >
> > > -       memcpy(dst, src->data + src->offset + offset, len);
> > > +       type = bpf_dynptr_get_type(src);
> > >
> > > -       return 0;
> > > +       switch (type) {
> > > +       case BPF_DYNPTR_TYPE_LOCAL:
> > > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > > +               memcpy(dst, src->data + src->offset + offset, len);
> > > +               return 0;
> > > +       case BPF_DYNPTR_TYPE_SKB:
> > > +               return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > > +       default:
> > > +               WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
> >
> > nit: probably better to WARN_ONCE?
> >
> > > +               return -EFAULT;
> > > +       }
> > >  }
> > >
> > >  static const struct bpf_func_proto bpf_dynptr_read_proto = {
> > > @@ -1540,18 +1561,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
> > >  BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
> > >            u32, len, u64, flags)
> > >  {
> > > +       enum bpf_dynptr_type type;
> > >         int err;
> > >
> > > -       if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> > > +       if (!dst->data || bpf_dynptr_is_rdonly(dst))
> > >                 return -EINVAL;
> > >
> > >         err = bpf_dynptr_check_off_len(dst, offset, len);
> > >         if (err)
> > >                 return err;
> > >
> > > -       memcpy(dst->data + dst->offset + offset, src, len);
> > > +       type = bpf_dynptr_get_type(dst);
> > >
> > > -       return 0;
> > > +       switch (type) {
> > > +       case BPF_DYNPTR_TYPE_LOCAL:
> > > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > > +               if (flags)
> > > +                       return -EINVAL;
> > > +               memcpy(dst->data + dst->offset + offset, src, len);
> > > +               return 0;
> > > +       case BPF_DYNPTR_TYPE_SKB:
> > > +               return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> > > +                                            flags);
> > > +       default:
> > > +               WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
> >
> > ditto about WARN_ONCE
> >
> > > +               return -EFAULT;
> > > +       }
> > >  }
> > >
> > >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> >
> > [...]
> >
> > > +static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
> > > +                                                      struct bpf_reg_state *reg,
> > > +                                                      int *ref_obj_id)
> > >  {
> > >         struct bpf_func_state *state = func(env, reg);
> > >         int spi = get_spi(reg->off);
> > >
> > > -       return state->stack[spi].spilled_ptr.id;
> > > +       if (ref_obj_id)
> > > +               *ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > +
> > > +       return state->stack[spi].spilled_ptr.dynptr.type;
> > >  }
> > >
> > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > @@ -6056,7 +6075,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > >                         case DYNPTR_TYPE_RINGBUF:
> > >                                 err_extra = "ringbuf ";
> > >                                 break;
> > > -                       default:
> > > +                       case DYNPTR_TYPE_SKB:
> > > +                               err_extra = "skb ";
> > >                                 break;
> > >                         }
> > >
> > > @@ -7149,6 +7169,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >  {
> > >         enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> > >         const struct bpf_func_proto *fn = NULL;
> > > +       enum bpf_dynptr_type dynptr_type;
> >
> > compiler technically can complain about use of uninitialized
> > dynptr_type, maybe initialize it to BPF_DYNPTR_TYPE_INVALID ?
> >
> > >         enum bpf_return_type ret_type;
> > >         enum bpf_type_flag ret_flag;
> > >         struct bpf_reg_state *regs;
> > > @@ -7320,24 +7341,43 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >                         }
> > >                 }
> > >                 break;
> > > -       case BPF_FUNC_dynptr_data:
> > > -               for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
> > > -                       if (arg_type_is_dynptr(fn->arg_type[i])) {
> > > -                               if (meta.ref_obj_id) {
> > > -                                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> > > -                                       return -EFAULT;
> > > -                               }
> > > -                               /* Find the id of the dynptr we're tracking the reference of */
> > > -                               meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]);
> > > -                               break;
> > > -                       }
> > > +       case BPF_FUNC_dynptr_write:
> > > +       {
> > > +               struct bpf_reg_state *reg;
> > > +
> > > +               reg = get_dynptr_arg_reg(fn, regs);
> > > +               if (!reg) {
> > > +                       verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
> >
> > s/bpf_dynptr_data/bpf_dynptr_write/
> >
> > > +                       return -EFAULT;
> > >                 }
> > > -               if (i == MAX_BPF_FUNC_REG_ARGS) {
> > > +
> > > +               /* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must
> > > +                * invalidate all data slices associated with it
> > > +                */
> > > +               if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB)
> > > +                       changes_data = true;
> > > +
> > > +               break;
> > > +       }
> > > +       case BPF_FUNC_dynptr_data:
> > > +       {
> > > +               struct bpf_reg_state *reg;
> > > +
> > > +               reg = get_dynptr_arg_reg(fn, regs);
> > > +               if (!reg) {
> > >                         verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
> > >                         return -EFAULT;
> > >                 }
> > > +
> > > +               if (meta.ref_obj_id) {
> > > +                       verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> > > +                       return -EFAULT;
> > > +               }
> > > +
> > > +               dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id);
> > >                 break;
> > >         }
> > > +       }
> > >
> > >         if (err)
> > >                 return err;
> > > @@ -7397,8 +7437,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >                 break;
> > >         case RET_PTR_TO_ALLOC_MEM:
> > >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> > > -               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > -               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > +
> > > +               if (func_id == BPF_FUNC_dynptr_data &&
> > > +                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > +                       regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > +                       regs[BPF_REG_0].range = meta.mem_size;
> > > +               } else {
> > > +                       regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > +                       regs[BPF_REG_0].mem_size = meta.mem_size;
> > > +               }
> > >                 break;
> > >         case RET_PTR_TO_MEM_OR_BTF_ID:
> > >         {
> > > @@ -14141,6 +14188,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >                         goto patch_call_imm;
> > >                 }
> > >
> > > +               if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > +                       bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > +
> > > +                       insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, is_rdonly);
> > > +                       insn_buf[1] = *insn;
> > > +                       cnt = 2;
> >
> > This might have been discussed before, sorry if I missed it. But why
> > this special rewrite to provide read-only flag vs having two
> > implementations of bpf_dynptr_from_skb() depending on program types?
> > If program type allows read+write access, return
> > bpf_dynptr_from_skb_rdwr(), for those that have read-only access -
> > bpf_dynptr_from_skb_rdonly(), and for others return NULL proto
> > (disable helper)?

Ooh I love this idea, thanks Andrii! I'll add this in v5.

> >
> > You can then use it very similarly for bpf_dynptr_from_skb_meta().
> >
>
> Related question, it seems we know statically if dynptr is read only
> or not, so why even do all this hidden parameter passing and instead
> just reject writes directly? You only need to be able to set
> MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> seems simpler than checking it at runtime. Verifier already handles
> MEM_RDONLY generically, you only need to add the guard for
> check_packet_acces (and check_helper_mem_access for meta->raw_mode
> under pkt case), and rejecting dynptr_write seems like a if condition.

There will be other helper functions that do writes (eg memcpy to
dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
dynptrs, ...) so it's more scalable if we reject these at runtime
rather than enforce these at the verifier level. I also think it's
cleaner to keep the verifier logic as simple as possible and do the
checking in the helper.

There's a prior discussion about this in v1 [0] as well.

[0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf3fe3965bc1852b07b8f2d306d09818b35acf3c1

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
  2022-08-25 20:14             ` Joanne Koong
@ 2022-08-25 21:53             ` Andrii Nakryiko
  2022-08-26  6:37             ` Martin KaFai Lau
  2 siblings, 0 replies; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-25 21:53 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Joanne Koong, Toke Høiland-Jørgensen, bpf, andrii,
	daniel, ast, kafai, kuba, netdev, brouer, lorenzo

On Wed, Aug 24, 2022 at 4:04 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Joanne Koong <joannelkoong@gmail.com> writes:
> > > >> [...]
> > > >> It's a bit awkward to have such a difference between xdp and skb
> > > >> dynptr's read/write. I understand why it is the way it is, but it
> > > >> still doesn't feel right. I'm not sure if we can reconcile the
> > > >> differences, but it makes writing common code for both xdp and tc
> > > >> harder as it needs to be aware of the differences (and then the flags
> > > >> for dynptr_write would differ too). So we're 90% there but not the
> > > >> whole way...
> > > >
> > > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > > same, but I'm not seeing a better solution here (unless we invalidate
> > > > data slices on writes in xdp progs, just to make it match more :P).
> > > >
> > > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > > not convinced this is much of a problem - xdp and skb programs already
> > > > have different interfaces for doing things (eg
> > > > bpf_{skb/xdp}_{store/load}_bytes).
> > >
> > > This is true, but it's quite possible to paper over these differences
> > > and write BPF code that works for both TC and XDP. Subtle semantic
> > > differences in otherwise identical functions makes this harder.
> > >
> > > Today you can write a function like:
> > >
> > > static inline int parse_pkt(void *data, void* data_end)
> > > {
> > >         /* parse data */
> > > }
> > >
> > > And call it like:
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         return parse_pkt(ctx->data, ctx->data_end);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         return parse_pkt(skb->data, skb->data_end);
> > > }
> > >
> > >
> > > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> > >
> > > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > > {
> > >         __u64 *data;
> > >
> > >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> > >         if (!data)
> > >                 return 0;
> > >         /* parse data */
> > > }
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_skb(skb, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> >
> > To clarify, this is already possible when using data slices, since the
> > behavior for data slices is equivalent between xdp and tc programs for
> > non-fragmented accesses. From looking through the selftests, I
> > anticipate that data slices will be the main way programs interact
> > with dynptrs. For the cases where the program may write into frags,
> > then bpf_dynptr_write will be needed (which is where functionality
> > between xdp and tc start differing) - today, we're not able to write
> > common code that writes into the frags since tc uses
> > bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.
>
> Yes, we cannot write code that just swaps those two calls right now.
> The key difference is, the two above are separate functions already
> and have different semantics. Here in this set you have the same
> function, but with different semantics depending on ctx, which is the
> point of contention.

bpf_dynptr_{read,write}() are effectively virtual methods of dynptr
and we have few different implementations of dynptr, depending on what
data they are wrapping, so having different semantics depending on ctx
makes sense, if they are dictated by good reasons (like good
performance for skb and xdp).

>
> >
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
> >
>
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

Wouldn't this kill performance for typical cases when writes don't go
into frags?

>
> > This will unite the functionality for skb and xdp programs across
> > bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> > we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> > common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> > towards no because 1) if in the future there's some irreconcilable
> > aspect between skb and xdp that gets added, that'll be hard to support
> > since the expectation is that there is just one overall "packet
> > dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> > bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> > this adds some additional hardcoding in the verifier since there's no
> > organic mapping between prog type -> prog ctx
> >
>
> There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
> misunderstood what you meant).
>
> I also had a different tangential question (unrelated to
> 'reconciliation'). Let's forget about that for a moment. I was
> listening to the talk here [0]. At 12:00, I can hear someone
> mentioning that you'd have a dynptr for each segment/frag.

One of those people was me :) I think what you are referring to was an
idea that for frags we might end a special iterator function that will
call a callback for each linear frag. And in such case linear frag
will be presented to callback as dynptr (just as an interface to
statically unknown memory region, it will be DYNPTR_LOCAL at that
point, probably).

It's different from DYNPTR_SKB that Joanne is adding here, though.

>
> Right now, you have a skb/xdp dynptr, which is representing
> discontiguous memory, i.e. you represent all linear page + fragments
> using one dynptr. This seems a bit inconsistent with the idea of a
> dynptr, since it's conceptually a slice of variable length memory.
> dynptr_data gives you a constant length slice into dynptr's variable
> length memory (which the verifier can track bounds of statically).

All the data in skb is conceptually a slice of variable length memory
with no gaps in between. It's just physically discontiguous, but there
is still a linear offset addressing (conceptually). So I think it
makes sense to represent skb data as a whole as one SKB dynptr. There
are technical restrictions on direct memory access through
bpf_dynptr_data(), inevitablly.


>
> So thinking about the idea in [0], one way would be to change
> from_xdp/from_skb helpers to take an index (into the frags array), and
> return dynptr for each frag. Or determine frag from offset + len. For
> the skb case it might require kmap_atomic to access the frag dynptr,
> which might need a paired kunmap_atomic call as well, and it may not
> return dynptr for frags in cloned skbs, but for XDP multi-buff all
> that doesn't count.

We could do something like bpf_dynptr_from_skb_frag(skb, N) and get
access to frag #N (where we can special case 0 being linear skb data),
but that would be DYNPTR_LOCAL at that point, probably? And it's a
separate extension in addition to DYNPTR_SKB. Having generic
bpf_dynptr_{read,write}() working over entire skb range is very
useful, while this per-frag dynptr would prevent this. So let's keep
those two separate.

>
> Was this approach/line of thinking considered and/or rejected? What
> were the reasons? Just trying to understand the background here.
>
> What I'm wondering is that we already have helpers to do reads and
> writes that you are wrapping over in dynptr interfaces, but what we're
> missing is to be able to get direct access to frags (or 'next' ctx,
> essentially). When we know we can avoid pulls, it might be cheaper to
> then do access this way to skb, right? Especially for a case where
> just a part of the header lies in the next frag?
>
> But I'm no expert here, so I might be missing some subtleties.
>
>   [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M
>
>
>
>
>
>
>
>
>
> >
> > [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
> >
> > >
> > > If the dynptr-based parse_pkt() function has to take special care to
> > > figure out where the dynptr comes from, it makes it a lot more difficult
> > > to write reusable packet parsing functions. So I'd be in favour of
> > > restricting the dynptr interface to the lowest common denominator of the
> > > skb and xdp interfaces even if that makes things slightly more awkward
> > > in the specialised cases...
> > >
> > > -Toke
> > >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-25 20:39         ` Joanne Koong
@ 2022-08-25 23:18           ` Kumar Kartikeya Dwivedi
  2022-08-26 18:23             ` Joanne Koong
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-25 23:18 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > > > [...]
> > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > >
> > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > >
> > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > as modified range is not considered for it. Same kind of issues when
> > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > able to do incorrect data access at runtime using that).
> > > >
> > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > range propagation so it cannot be reused.
> > > >
> > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > pkt pointers like I did in my patch. You might also want to add some
> > > > tests for this corner case (there are some later in [0] if you want to
> > > > reuse them).
> > > >
> > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > slices of some other xdp_md). Here we can just get away with normal
> > > > PTR_TO_MEM.
> > > >
> > > > ... Or just let me know if you handle this correctly already, or if
> > > > this won't be an actual problem :).
> > >
> > > Ooh interesting, I hadn't previously taken a look at
> > > try_match_pkt_pointers(), thanks for mentioning it :)
> > >
> > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > we reject x behavior on the packet data reg types, this will
> > > automatically apply to the data slices), and because it'll keep the
> > > logic more efficient/simpler for the case when the pkt pointer has to
> > > be cleared after any helper that changes pkt data is called (aka the
> > > case where the data slice gets invalidated).
> > >
> > > What are your thoughts?
> > >
> >
> > Thinking more deeply about it, probably not, we need more work here. I
> > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > commit log lacks all the details about the motivation :( ).
> >
> > Consider how equivalency checking for packet pointers works in
> > regsafe. It is checking type, then if old range > cur range, then
> > offs, etc.
> >
> > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > separate frags, so this assumption would be incorrect. I would be able
> > to trick the verifier into accessing data beyond the length of a
> > different frag, by first making sure one line of exploration is
> > verified, and then changing the register in another branch reaching
> > the same branch target. Helpers can take packet pointers so the access
> > can become a pruning point. It would think the rest of the stuff is
> > safe, while they are not equivalent at all. It is ok if they are bit
> > by bit equivalent (same type, range, off, etc.).
>
> Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> of ptr_to_pkt since if old range > cur range, then if old range is
> acceptable cur range must definitely be acceptable?

No, my description was bad :).
We return false when old_range > cur_range, i.e. the path is
considered safe and not explored again when old_range <= cur_range
(pruning), otherwise we continue verifying.
Consider if it was doing pkt[cur_range + 1] access in the path we are
about to explore again (already verified for old_range). That is <=
old_range, but > cur_range, so it would be problematic if we pruned
search for old_range > cur_range.

So maybe it won't be a problem here, and just the current range checks
for pkt pointer slices is fine even if they belong to different frags.
I didn't craft any test case when writing my previous reply.
Especially since you will disable comparisons, one cannot relearn
range again using var_off + comparison, which closes another loophole.

It just seems simpler to me to be a bit more conservative, since it is
only an optimization. There might be some corner case lurking we can't
think of right now. But I leave the judgement up to you if you can
reason about it. In either case it would be good to include some
comments in the commit log about all this.

Meanwhile, looking at the current code, I'm more inclined to suggest
PTR_TO_MEM (and handle invalidation specially), but again, I will
leave it up to you to decide.

When we do += var_off to a pkt reg, its range is reset to zero,
compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
check against the actual size for an access, which is a bit more
flexible. The reason to reset range is that it will be relearned using
comparisons and transferred to copies (reg->id is assigned for each +=
var_off), which doesn't apply to slice pointers (essentially the only
reason to keep them is being able to pick them for invalidation), we
try to disable the rest of the pkt pointer magic in the verifier,
anyway.

pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
it can reach that point with range == 0 after += var_off, and
zero_size_allowed == true), only seems to be used by netronome's ebpf
offload for now, but still a bit confusing if slice pkt pointers cause
this to change.

>
> >
> > If you start returning false whenever you see this type tag set, it
> > will become too conservative (it considers reg copies of the same
> > dynptr_data lookup as distinct). So you need some kind of id assigned
> > during dynptr_data lookup to distinguish them.
>
> What about if the dynptr_pkt type tag is set, then we compare the
> ranges as well? If the ranges are the same, then we return true, else
> false. Does that work?

Seems like it, and true part is already covered by memcmp at the start
of the function, I think.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-25 21:02       ` Joanne Koong
@ 2022-08-26  0:18         ` Kumar Kartikeya Dwivedi
  2022-08-26 18:44           ` Joanne Koong
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26  0:18 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> [...]
> >
> > Related question, it seems we know statically if dynptr is read only
> > or not, so why even do all this hidden parameter passing and instead
> > just reject writes directly? You only need to be able to set
> > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > seems simpler than checking it at runtime. Verifier already handles
> > MEM_RDONLY generically, you only need to add the guard for
> > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > under pkt case), and rejecting dynptr_write seems like a if condition.
>
> There will be other helper functions that do writes (eg memcpy to
> dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> dynptrs, ...) so it's more scalable if we reject these at runtime
> rather than enforce these at the verifier level. I also think it's
> cleaner to keep the verifier logic as simple as possible and do the
> checking in the helper.

I won't be pushing this further, since you know what you plan to add
in the future better, but I still disagree.

I'm guessing there might be dynptrs where this read only property is
set dynamically at runtime, which is why you want to go this route?
I.e. you might not know statically whether dynptr is read only or not?

My main confusion is the inconsistency here.

Right now the patch implicitly relies on may_access_direct_pkt_data to
protect slices returned from dynptr_data, instead of setting
MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
needed. So indirectly, you are relying on knowing statically whether
the dynptr is read only or not. But then you also set this bit at
runtime.

So you reject some cases at load time, and the rest of them only at
runtime. Direct writes to dynptr slice fails load, writes through
helper does not (only fails at runtime).

Also, dynptr_data needs to know whether dynptr is read only
statically, to protect writes to its returned pointer, unless you
decide to introduce another helper for the dynamic rdonly bit case
(like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
works for some rdonly dynptrs (known to be rdonly statically, like
this skb one), but not for others.

I also don't agree about the complexity or scalability part, all the
infra and precedence is already there. We already have similar checks
for meta->raw_mode where we reject writes to read only pointers in
check_helper_mem_access.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
  2022-08-25 20:14             ` Joanne Koong
  2022-08-25 21:53             ` Andrii Nakryiko
@ 2022-08-26  6:37             ` Martin KaFai Lau
  2022-08-26  6:50               ` Martin KaFai Lau
  2022-08-26 19:09               ` Kumar Kartikeya Dwivedi
  2 siblings, 2 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2022-08-26  6:37 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen, bpf,
	andrii, daniel, ast, kuba, netdev, brouer, lorenzo

On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
I don't think it is necessary (or help) to restrict xdp slice from getting
a fragment.  In any case, the xdp prog has to deal with the case
that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
Although unlikely, it still need to handle it (dynptr_data returns NULL)
properly by using bpf_dynptr_read().  The same that the skb case
also needs to handle dynptr_data returning NULL.

Something like Kumar's sample in [0] should work for both
xdp and skb dynptr but replace the helpers with
bpf_dynptr_{data,read,write}().

[0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c

> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
I wonder if the 'Before' behavior can be kept as is.

The bpf prog that runs in both xdp and bpf should be
the one always expecting the data-slice will be invalidated and
it has to call the bpf_dynptr_data() again after writing.
Yes, it is unnecessary for xdp but the bpf prog needs to the
same anyway if the verifier was the one enforcing it for
both skb and xdp dynptr type.

If the bpf prog is written for xdp alone, then it has
no need to re-get the bpf_dynptr_data() after writing.

> >
> 
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-26  6:37             ` Martin KaFai Lau
@ 2022-08-26  6:50               ` Martin KaFai Lau
  2022-08-26 19:09               ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 43+ messages in thread
From: Martin KaFai Lau @ 2022-08-26  6:50 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Kumar Kartikeya Dwivedi, Toke Høiland-Jørgensen, bpf,
	andrii, daniel, ast, kuba, netdev, brouer, lorenzo

On Thu, Aug 25, 2022 at 11:37:08PM -0700, Martin KaFai Lau wrote:
> On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > I'm more and more liking the idea of limiting xdp to match the
> > > constraints of skb given that both you, Kumar, and Jakub have
> > > mentioned that portability between xdp and skb would be useful for
> > > users :)
> > >
> > > What are your thoughts on this API:
> > >
> > > 1) bpf_dynptr_data()
> > >
> > > Before:
> > >   for skb-type progs:
> > >       - data slices in fragments is not supported
> > >
> > >   for xdp-type progs:
> > >       - data slices in fragments is supported as long as it is in a
> > > contiguous frag (eg not across frags)
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >       - data slices in fragments is not supported
> I don't think it is necessary (or help) to restrict xdp slice from getting
> a fragment.  In any case, the xdp prog has to deal with the case
> that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> Although unlikely, it still need to handle it (dynptr_data returns NULL)
> properly by using bpf_dynptr_read().  The same that the skb case
> also needs to handle dynptr_data returning NULL.
> 
> Something like Kumar's sample in [0] should work for both
> xdp and skb dynptr but replace the helpers with
> bpf_dynptr_{data,read,write}().
> 
> [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
> 
> > >
> > >
> > > 2)  bpf_dynptr_write()
> > >
> > > Before:
> > >   for skb-type progs:
> > >      - all data slices are invalidated after a write
> > >
> > >   for xdp-type progs:
> > >      - nothing
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >      - all data slices are invalidated after a write
> I wonder if the 'Before' behavior can be kept as is.
> 
> The bpf prog that runs in both xdp and bpf should be
typo: both xdp and *skb

> the one always expecting the data-slice will be invalidated and
> it has to call the bpf_dynptr_data() again after writing.
> Yes, it is unnecessary for xdp but the bpf prog needs to the
> same anyway if the verifier was the one enforcing it for
> both skb and xdp dynptr type.
> 
> If the bpf prog is written for xdp alone, then it has
> no need to re-get the bpf_dynptr_data() after writing.
> 
> > >
> > 
> > There is also the other option: failing to write until you pull skb,
> > which looks a lot better to me if we are adding this helper anyway...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-25 23:18           ` Kumar Kartikeya Dwivedi
@ 2022-08-26 18:23             ` Joanne Koong
  2022-08-26 18:31               ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-26 18:23 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Thu, Aug 25, 2022 at 4:19 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > > > [...]
> > > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > > >
> > > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > > >
> > > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > > as modified range is not considered for it. Same kind of issues when
> > > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > > able to do incorrect data access at runtime using that).
> > > > >
> > > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > > range propagation so it cannot be reused.
> > > > >
> > > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > > pkt pointers like I did in my patch. You might also want to add some
> > > > > tests for this corner case (there are some later in [0] if you want to
> > > > > reuse them).
> > > > >
> > > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > > slices of some other xdp_md). Here we can just get away with normal
> > > > > PTR_TO_MEM.
> > > > >
> > > > > ... Or just let me know if you handle this correctly already, or if
> > > > > this won't be an actual problem :).
> > > >
> > > > Ooh interesting, I hadn't previously taken a look at
> > > > try_match_pkt_pointers(), thanks for mentioning it :)
> > > >
> > > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > > we reject x behavior on the packet data reg types, this will
> > > > automatically apply to the data slices), and because it'll keep the
> > > > logic more efficient/simpler for the case when the pkt pointer has to
> > > > be cleared after any helper that changes pkt data is called (aka the
> > > > case where the data slice gets invalidated).
> > > >
> > > > What are your thoughts?
> > > >
> > >
> > > Thinking more deeply about it, probably not, we need more work here. I
> > > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > > commit log lacks all the details about the motivation :( ).
> > >
> > > Consider how equivalency checking for packet pointers works in
> > > regsafe. It is checking type, then if old range > cur range, then
> > > offs, etc.
> > >
> > > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > > separate frags, so this assumption would be incorrect. I would be able
> > > to trick the verifier into accessing data beyond the length of a
> > > different frag, by first making sure one line of exploration is
> > > verified, and then changing the register in another branch reaching
> > > the same branch target. Helpers can take packet pointers so the access
> > > can become a pruning point. It would think the rest of the stuff is
> > > safe, while they are not equivalent at all. It is ok if they are bit
> > > by bit equivalent (same type, range, off, etc.).
> >
> > Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> > cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> > of ptr_to_pkt since if old range > cur range, then if old range is
> > acceptable cur range must definitely be acceptable?
>
> No, my description was bad :).
> We return false when old_range > cur_range, i.e. the path is
> considered safe and not explored again when old_range <= cur_range
> (pruning), otherwise we continue verifying.
> Consider if it was doing pkt[cur_range + 1] access in the path we are
> about to explore again (already verified for old_range). That is <=
> old_range, but > cur_range, so it would be problematic if we pruned
> search for old_range > cur_range.

Does "old_range" here refer to the range that was already previously
verified as safe by the verifier? And "cur_range" is the new range
that we are trying to figure out is safe or not?

When you say "we return false when old_range > cur_range", what
function are we returning false from?

>
> So maybe it won't be a problem here, and just the current range checks
> for pkt pointer slices is fine even if they belong to different frags.
> I didn't craft any test case when writing my previous reply.
> Especially since you will disable comparisons, one cannot relearn
> range again using var_off + comparison, which closes another loophole.
>
> It just seems simpler to me to be a bit more conservative, since it is
> only an optimization. There might be some corner case lurking we can't
> think of right now. But I leave the judgement up to you if you can
> reason about it. In either case it would be good to include some
> comments in the commit log about all this.
>
> Meanwhile, looking at the current code, I'm more inclined to suggest
> PTR_TO_MEM (and handle invalidation specially), but again, I will
> leave it up to you to decide.
>
> When we do += var_off to a pkt reg, its range is reset to zero,
> compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
> check against the actual size for an access, which is a bit more
> flexible. The reason to reset range is that it will be relearned using
> comparisons and transferred to copies (reg->id is assigned for each +=

Can you direct me to the function where this relearning happens? thanks!

> var_off), which doesn't apply to slice pointers (essentially the only
> reason to keep them is being able to pick them for invalidation), we
> try to disable the rest of the pkt pointer magic in the verifier,
> anyway.
>
> pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
> it can reach that point with range == 0 after += var_off, and
> zero_size_allowed == true), only seems to be used by netronome's ebpf
> offload for now, but still a bit confusing if slice pkt pointers cause
> this to change.

My major takeaway from this discussion is that there's a lot of extra
subtleties when reg is PTR_TO_PACKET :) I'm going to delve deeper into
the source code, but from a glance, I think you're right that just
assigning PTR_TO_MEM for the data slice will probably make things a
lot more straightforward. thanks for the discussion!

>
> >
> > >
> > > If you start returning false whenever you see this type tag set, it
> > > will become too conservative (it considers reg copies of the same
> > > dynptr_data lookup as distinct). So you need some kind of id assigned
> > > during dynptr_data lookup to distinguish them.
> >
> > What about if the dynptr_pkt type tag is set, then we compare the
> > ranges as well? If the ranges are the same, then we return true, else
> > false. Does that work?
>
> Seems like it, and true part is already covered by memcmp at the start
> of the function, I think.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-26 18:23             ` Joanne Koong
@ 2022-08-26 18:31               ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26 18:31 UTC (permalink / raw)
  To: Joanne Koong
  Cc: bpf, andrii, daniel, ast, kafai, kuba, netdev, toke, brouer, lorenzo

On Fri, 26 Aug 2022 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 4:19 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > > > [...]
> > > > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > > > >
> > > > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > > > >
> > > > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > > > as modified range is not considered for it. Same kind of issues when
> > > > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > > > able to do incorrect data access at runtime using that).
> > > > > >
> > > > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > > > range propagation so it cannot be reused.
> > > > > >
> > > > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > > > pkt pointers like I did in my patch. You might also want to add some
> > > > > > tests for this corner case (there are some later in [0] if you want to
> > > > > > reuse them).
> > > > > >
> > > > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > > > slices of some other xdp_md). Here we can just get away with normal
> > > > > > PTR_TO_MEM.
> > > > > >
> > > > > > ... Or just let me know if you handle this correctly already, or if
> > > > > > this won't be an actual problem :).
> > > > >
> > > > > Ooh interesting, I hadn't previously taken a look at
> > > > > try_match_pkt_pointers(), thanks for mentioning it :)
> > > > >
> > > > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > > > we reject x behavior on the packet data reg types, this will
> > > > > automatically apply to the data slices), and because it'll keep the
> > > > > logic more efficient/simpler for the case when the pkt pointer has to
> > > > > be cleared after any helper that changes pkt data is called (aka the
> > > > > case where the data slice gets invalidated).
> > > > >
> > > > > What are your thoughts?
> > > > >
> > > >
> > > > Thinking more deeply about it, probably not, we need more work here. I
> > > > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > > > commit log lacks all the details about the motivation :( ).
> > > >
> > > > Consider how equivalency checking for packet pointers works in
> > > > regsafe. It is checking type, then if old range > cur range, then
> > > > offs, etc.
> > > >
> > > > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > > > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > > > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > > > separate frags, so this assumption would be incorrect. I would be able
> > > > to trick the verifier into accessing data beyond the length of a
> > > > different frag, by first making sure one line of exploration is
> > > > verified, and then changing the register in another branch reaching
> > > > the same branch target. Helpers can take packet pointers so the access
> > > > can become a pruning point. It would think the rest of the stuff is
> > > > safe, while they are not equivalent at all. It is ok if they are bit
> > > > by bit equivalent (same type, range, off, etc.).
> > >
> > > Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> > > cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> > > of ptr_to_pkt since if old range > cur range, then if old range is
> > > acceptable cur range must definitely be acceptable?
> >
> > No, my description was bad :).
> > We return false when old_range > cur_range, i.e. the path is
> > considered safe and not explored again when old_range <= cur_range
> > (pruning), otherwise we continue verifying.
> > Consider if it was doing pkt[cur_range + 1] access in the path we are
> > about to explore again (already verified for old_range). That is <=
> > old_range, but > cur_range, so it would be problematic if we pruned
> > search for old_range > cur_range.
>
> Does "old_range" here refer to the range that was already previously
> verified as safe by the verifier? And "cur_range" is the new range
> that we are trying to figure out is safe or not?

Yes, it's the range of the same reg in the already safe verified state
from the point we are about to explore again (and considering
'pruning' the search if the current state would satisfy the safety
properties as well).

>
> When you say "we return false when old_range > cur_range", what
> function are we returning false from?
>

We return false from the regsafe function, i.e. it isn't safe if the
old range was greater than the current range, otherwise we match other
stuff before we return true (like off being equal, and var_off of old
being within cur's).

> >
> > So maybe it won't be a problem here, and just the current range checks
> > for pkt pointer slices is fine even if they belong to different frags.
> > I didn't craft any test case when writing my previous reply.
> > Especially since you will disable comparisons, one cannot relearn
> > range again using var_off + comparison, which closes another loophole.
> >
> > It just seems simpler to me to be a bit more conservative, since it is
> > only an optimization. There might be some corner case lurking we can't
> > think of right now. But I leave the judgement up to you if you can
> > reason about it. In either case it would be good to include some
> > comments in the commit log about all this.
> >
> > Meanwhile, looking at the current code, I'm more inclined to suggest
> > PTR_TO_MEM (and handle invalidation specially), but again, I will
> > leave it up to you to decide.
> >
> > When we do += var_off to a pkt reg, its range is reset to zero,
> > compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
> > check against the actual size for an access, which is a bit more
> > flexible. The reason to reset range is that it will be relearned using
> > comparisons and transferred to copies (reg->id is assigned for each +=
>
> Can you direct me to the function where this relearning happens? thanks!
>

See the code around the comment:
/* something was added to pkt_ptr, set range to zero */
where it memsets the range to 0,
then in adjust_ptr_min_max_vals,
and then you see how find_good_pkt_pointers also takes into account
reg->umax_value + off before setting reg->range after your next
comparison (because at runtime the variable offset is already added to
the pointer).

> > var_off), which doesn't apply to slice pointers (essentially the only
> > reason to keep them is being able to pick them for invalidation), we
> > try to disable the rest of the pkt pointer magic in the verifier,
> > anyway.
> >
> > pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
> > it can reach that point with range == 0 after += var_off, and
> > zero_size_allowed == true), only seems to be used by netronome's ebpf
> > offload for now, but still a bit confusing if slice pkt pointers cause
> > this to change.
>
> My major takeaway from this discussion is that there's a lot of extra
> subtleties when reg is PTR_TO_PACKET :) I'm going to delve deeper into
> the source code, but from a glance, I think you're right that just
> assigning PTR_TO_MEM for the data slice will probably make things a
> lot more straightforward. thanks for the discussion!

Exactly. I think this is an internal verifier detail so we can
definitely go back and change this later, but IMO we'd avoid a lot of
headache if we just choose PTR_TO_MEM, since PTR_TO_PACKET has a lot
of special semantics.

>
> >
> > >
> > > >
> > > > If you start returning false whenever you see this type tag set, it
> > > > will become too conservative (it considers reg copies of the same
> > > > dynptr_data lookup as distinct). So you need some kind of id assigned
> > > > during dynptr_data lookup to distinguish them.
> > >
> > > What about if the dynptr_pkt type tag is set, then we compare the
> > > ranges as well? If the ranges are the same, then we return true, else
> > > false. Does that work?
> >
> > Seems like it, and true part is already covered by memcmp at the start
> > of the function, I think.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-26  0:18         ` Kumar Kartikeya Dwivedi
@ 2022-08-26 18:44           ` Joanne Koong
  2022-08-26 18:51             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-26 18:44 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > [...]
> > >
> > > Related question, it seems we know statically if dynptr is read only
> > > or not, so why even do all this hidden parameter passing and instead
> > > just reject writes directly? You only need to be able to set
> > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > seems simpler than checking it at runtime. Verifier already handles
> > > MEM_RDONLY generically, you only need to add the guard for
> > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > under pkt case), and rejecting dynptr_write seems like a if condition.
> >
> > There will be other helper functions that do writes (eg memcpy to
> > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > dynptrs, ...) so it's more scalable if we reject these at runtime
> > rather than enforce these at the verifier level. I also think it's
> > cleaner to keep the verifier logic as simple as possible and do the
> > checking in the helper.
>
> I won't be pushing this further, since you know what you plan to add
> in the future better, but I still disagree.
>
> I'm guessing there might be dynptrs where this read only property is
> set dynamically at runtime, which is why you want to go this route?
> I.e. you might not know statically whether dynptr is read only or not?
>
> My main confusion is the inconsistency here.
>
> Right now the patch implicitly relies on may_access_direct_pkt_data to
> protect slices returned from dynptr_data, instead of setting
> MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> needed. So indirectly, you are relying on knowing statically whether
> the dynptr is read only or not. But then you also set this bit at
> runtime.
>
> So you reject some cases at load time, and the rest of them only at
> runtime. Direct writes to dynptr slice fails load, writes through
> helper does not (only fails at runtime).
>
> Also, dynptr_data needs to know whether dynptr is read only
> statically, to protect writes to its returned pointer, unless you
> decide to introduce another helper for the dynamic rdonly bit case
> (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> works for some rdonly dynptrs (known to be rdonly statically, like
> this skb one), but not for others.
>
> I also don't agree about the complexity or scalability part, all the
> infra and precedence is already there. We already have similar checks
> for meta->raw_mode where we reject writes to read only pointers in
> check_helper_mem_access.

My point about scalability is that if we reject bpf_dynptr_write() at
load time, then we must reject any future dynptr helper that does any
writing at load time as well, to be consistent.

I don't feel strongly about whether we reject at load time or run
time. Rejecting at load time instead of runtime doesn't seem that
useful to me, but there's a good chance I'm wrong here since Martin
stated that he prefers rejecting at load time as well.

As for the added complexity part, what I mean is that we'll need to
keep track of some more stuff to support this, such as whether the
dynptr is read only and which helper functions need to check whether
the dynptr is read only or not.

On the whole, I think given that both Martin and you have specified
your preferences for this, we should reject at load time instead of
runtime. I'll make this change for v5 :)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-26 18:44           ` Joanne Koong
@ 2022-08-26 18:51             ` Kumar Kartikeya Dwivedi
  2022-08-26 19:49               ` Joanne Koong
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26 18:51 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > [...]
> > > >
> > > > Related question, it seems we know statically if dynptr is read only
> > > > or not, so why even do all this hidden parameter passing and instead
> > > > just reject writes directly? You only need to be able to set
> > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > seems simpler than checking it at runtime. Verifier already handles
> > > > MEM_RDONLY generically, you only need to add the guard for
> > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > >
> > > There will be other helper functions that do writes (eg memcpy to
> > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > rather than enforce these at the verifier level. I also think it's
> > > cleaner to keep the verifier logic as simple as possible and do the
> > > checking in the helper.
> >
> > I won't be pushing this further, since you know what you plan to add
> > in the future better, but I still disagree.
> >
> > I'm guessing there might be dynptrs where this read only property is
> > set dynamically at runtime, which is why you want to go this route?
> > I.e. you might not know statically whether dynptr is read only or not?
> >
> > My main confusion is the inconsistency here.
> >
> > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > protect slices returned from dynptr_data, instead of setting
> > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > needed. So indirectly, you are relying on knowing statically whether
> > the dynptr is read only or not. But then you also set this bit at
> > runtime.
> >
> > So you reject some cases at load time, and the rest of them only at
> > runtime. Direct writes to dynptr slice fails load, writes through
> > helper does not (only fails at runtime).
> >
> > Also, dynptr_data needs to know whether dynptr is read only
> > statically, to protect writes to its returned pointer, unless you
> > decide to introduce another helper for the dynamic rdonly bit case
> > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > works for some rdonly dynptrs (known to be rdonly statically, like
> > this skb one), but not for others.
> >
> > I also don't agree about the complexity or scalability part, all the
> > infra and precedence is already there. We already have similar checks
> > for meta->raw_mode where we reject writes to read only pointers in
> > check_helper_mem_access.
>
> My point about scalability is that if we reject bpf_dynptr_write() at
> load time, then we must reject any future dynptr helper that does any
> writing at load time as well, to be consistent.
>
> I don't feel strongly about whether we reject at load time or run
> time. Rejecting at load time instead of runtime doesn't seem that
> useful to me, but there's a good chance I'm wrong here since Martin
> stated that he prefers rejecting at load time as well.
>
> As for the added complexity part, what I mean is that we'll need to
> keep track of some more stuff to support this, such as whether the
> dynptr is read only and which helper functions need to check whether
> the dynptr is read only or not.

What I'm trying to understand is how dynptr_data is supposed to work
if this dynptr read only bit is only known at runtime. Or will it be
always known statically so that it can set returned pointer as read
only? Because then it doesn't seem it is required or useful to track
the readonly bit at runtime.

It is fine if _everything_ checks it at runtime, but that doesn't seem
possible, hence the question. We would need a new slice helper that
only returns read-only slices, because dynptr_data can return rw
slices currently and it is already UAPI so changing that is not
possible anymore.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-26  6:37             ` Martin KaFai Lau
  2022-08-26  6:50               ` Martin KaFai Lau
@ 2022-08-26 19:09               ` Kumar Kartikeya Dwivedi
  2022-08-26 20:47                 ` Joanne Koong
  1 sibling, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26 19:09 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joanne Koong, Toke Høiland-Jørgensen, bpf, andrii,
	daniel, ast, kuba, netdev, brouer, lorenzo

On Fri, 26 Aug 2022 at 08:37, Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > I'm more and more liking the idea of limiting xdp to match the
> > > constraints of skb given that both you, Kumar, and Jakub have
> > > mentioned that portability between xdp and skb would be useful for
> > > users :)
> > >
> > > What are your thoughts on this API:
> > >
> > > 1) bpf_dynptr_data()
> > >
> > > Before:
> > >   for skb-type progs:
> > >       - data slices in fragments is not supported
> > >
> > >   for xdp-type progs:
> > >       - data slices in fragments is supported as long as it is in a
> > > contiguous frag (eg not across frags)
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >       - data slices in fragments is not supported
> I don't think it is necessary (or help) to restrict xdp slice from getting
> a fragment.  In any case, the xdp prog has to deal with the case
> that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> Although unlikely, it still need to handle it (dynptr_data returns NULL)
> properly by using bpf_dynptr_read().  The same that the skb case
> also needs to handle dynptr_data returning NULL.
>
> Something like Kumar's sample in [0] should work for both
> xdp and skb dynptr but replace the helpers with
> bpf_dynptr_{data,read,write}().
>
> [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
>
> > >
> > >
> > > 2)  bpf_dynptr_write()
> > >
> > > Before:
> > >   for skb-type progs:
> > >      - all data slices are invalidated after a write
> > >
> > >   for xdp-type progs:
> > >      - nothing
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >      - all data slices are invalidated after a write
> I wonder if the 'Before' behavior can be kept as is.
>
> The bpf prog that runs in both xdp and bpf should be
> the one always expecting the data-slice will be invalidated and
> it has to call the bpf_dynptr_data() again after writing.
> Yes, it is unnecessary for xdp but the bpf prog needs to the
> same anyway if the verifier was the one enforcing it for
> both skb and xdp dynptr type.
>
> If the bpf prog is written for xdp alone, then it has
> no need to re-get the bpf_dynptr_data() after writing.
>

Yeah, compared to the alternative, I guess it's better how it is right
now. It doesn't seem possible to meaningfully hide the differences
without penalizing either case. It also wouldn't be good to make it
less useful for XDP, since the whole point of this and the previous
effort was exposing bpf_xdp_pointer to the user.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-26 18:51             ` Kumar Kartikeya Dwivedi
@ 2022-08-26 19:49               ` Joanne Koong
  2022-08-26 20:54                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Joanne Koong @ 2022-08-26 19:49 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Fri, Aug 26, 2022 at 11:52 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > [...]
> > > > >
> > > > > Related question, it seems we know statically if dynptr is read only
> > > > > or not, so why even do all this hidden parameter passing and instead
> > > > > just reject writes directly? You only need to be able to set
> > > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > > seems simpler than checking it at runtime. Verifier already handles
> > > > > MEM_RDONLY generically, you only need to add the guard for
> > > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > > >
> > > > There will be other helper functions that do writes (eg memcpy to
> > > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > > rather than enforce these at the verifier level. I also think it's
> > > > cleaner to keep the verifier logic as simple as possible and do the
> > > > checking in the helper.
> > >
> > > I won't be pushing this further, since you know what you plan to add
> > > in the future better, but I still disagree.
> > >
> > > I'm guessing there might be dynptrs where this read only property is
> > > set dynamically at runtime, which is why you want to go this route?
> > > I.e. you might not know statically whether dynptr is read only or not?
> > >
> > > My main confusion is the inconsistency here.
> > >
> > > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > > protect slices returned from dynptr_data, instead of setting
> > > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > > needed. So indirectly, you are relying on knowing statically whether
> > > the dynptr is read only or not. But then you also set this bit at
> > > runtime.
> > >
> > > So you reject some cases at load time, and the rest of them only at
> > > runtime. Direct writes to dynptr slice fails load, writes through
> > > helper does not (only fails at runtime).
> > >
> > > Also, dynptr_data needs to know whether dynptr is read only
> > > statically, to protect writes to its returned pointer, unless you
> > > decide to introduce another helper for the dynamic rdonly bit case
> > > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > > works for some rdonly dynptrs (known to be rdonly statically, like
> > > this skb one), but not for others.
> > >
> > > I also don't agree about the complexity or scalability part, all the
> > > infra and precedence is already there. We already have similar checks
> > > for meta->raw_mode where we reject writes to read only pointers in
> > > check_helper_mem_access.
> >
> > My point about scalability is that if we reject bpf_dynptr_write() at
> > load time, then we must reject any future dynptr helper that does any
> > writing at load time as well, to be consistent.
> >
> > I don't feel strongly about whether we reject at load time or run
> > time. Rejecting at load time instead of runtime doesn't seem that
> > useful to me, but there's a good chance I'm wrong here since Martin
> > stated that he prefers rejecting at load time as well.
> >
> > As for the added complexity part, what I mean is that we'll need to
> > keep track of some more stuff to support this, such as whether the
> > dynptr is read only and which helper functions need to check whether
> > the dynptr is read only or not.
>
> What I'm trying to understand is how dynptr_data is supposed to work
> if this dynptr read only bit is only known at runtime. Or will it be
> always known statically so that it can set returned pointer as read
> only? Because then it doesn't seem it is required or useful to track
> the readonly bit at runtime.

I think it'll always be known statically whether the dynptr is
read-only or not. If we make all writable dynptr helper functions
reject read-only dynptrs at load time instead of run time, then yes we
can remove the read-only bit in the bpf_dynptr_kern struct.

There's also the question of whether this constraint (eg all read-only
writes are rejected at load time) is too rigid - for example, what if
in the future we want to add a helper function where if a certain
condition is met, then we write some number of bytes, else we read
some number of bytes? This would be not possible to add then, since
we'll only know at runtime whether the condition is met.

I personally lean towards rejecting helper function writes at runtime,
but if you think it's a non-trivial benefit to reject at load time
instead, I'm fine going with that.

>
> It is fine if _everything_ checks it at runtime, but that doesn't seem
> possible, hence the question. We would need a new slice helper that
> only returns read-only slices, because dynptr_data can return rw
> slices currently and it is already UAPI so changing that is not
> possible anymore.

I don't agree that if bpf_dynptr_write() is checked at runtime, then
bpf_dynptr_data must also be checked at runtime to be consistent. I
think it's fine if writes through helper functions are rejected at
runtime, and writes through direct access are rejected at load time.
That doesn't seem inconsistent to me.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs
  2022-08-26 19:09               ` Kumar Kartikeya Dwivedi
@ 2022-08-26 20:47                 ` Joanne Koong
  0 siblings, 0 replies; 43+ messages in thread
From: Joanne Koong @ 2022-08-26 20:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Martin KaFai Lau, Toke Høiland-Jørgensen, bpf, andrii,
	daniel, ast, kuba, netdev, brouer, lorenzo

On Fri, Aug 26, 2022 at 12:09 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 26 Aug 2022 at 08:37, Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > I'm more and more liking the idea of limiting xdp to match the
> > > > constraints of skb given that both you, Kumar, and Jakub have
> > > > mentioned that portability between xdp and skb would be useful for
> > > > users :)
> > > >
> > > > What are your thoughts on this API:
> > > >
> > > > 1) bpf_dynptr_data()
> > > >
> > > > Before:
> > > >   for skb-type progs:
> > > >       - data slices in fragments is not supported
> > > >
> > > >   for xdp-type progs:
> > > >       - data slices in fragments is supported as long as it is in a
> > > > contiguous frag (eg not across frags)
> > > >
> > > > Now:
> > > >   for skb + xdp type progs:
> > > >       - data slices in fragments is not supported
> > I don't think it is necessary (or help) to restrict xdp slice from getting
> > a fragment.  In any case, the xdp prog has to deal with the case
> > that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> > Although unlikely, it still need to handle it (dynptr_data returns NULL)
> > properly by using bpf_dynptr_read().  The same that the skb case
> > also needs to handle dynptr_data returning NULL.
> >
> > Something like Kumar's sample in [0] should work for both
> > xdp and skb dynptr but replace the helpers with
> > bpf_dynptr_{data,read,write}().
> >
> > [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
> >
> > > >
> > > >
> > > > 2)  bpf_dynptr_write()
> > > >
> > > > Before:
> > > >   for skb-type progs:
> > > >      - all data slices are invalidated after a write
> > > >
> > > >   for xdp-type progs:
> > > >      - nothing
> > > >
> > > > Now:
> > > >   for skb + xdp type progs:
> > > >      - all data slices are invalidated after a write
> > I wonder if the 'Before' behavior can be kept as is.
> >
> > The bpf prog that runs in both xdp and bpf should be
> > the one always expecting the data-slice will be invalidated and
> > it has to call the bpf_dynptr_data() again after writing.
> > Yes, it is unnecessary for xdp but the bpf prog needs to the
> > same anyway if the verifier was the one enforcing it for
> > both skb and xdp dynptr type.
> >
> > If the bpf prog is written for xdp alone, then it has
> > no need to re-get the bpf_dynptr_data() after writing.
> >
>
> Yeah, compared to the alternative, I guess it's better how it is right
> now. It doesn't seem possible to meaningfully hide the differences
> without penalizing either case. It also wouldn't be good to make it
> less useful for XDP, since the whole point of this and the previous
> effort was exposing bpf_xdp_pointer to the user.

I'll keep it as is for v5.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-26 19:49               ` Joanne Koong
@ 2022-08-26 20:54                 ` Kumar Kartikeya Dwivedi
  2022-08-27  5:36                   ` Andrii Nakryiko
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-26 20:54 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Andrii Nakryiko, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Fri, 26 Aug 2022 at 21:49, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 11:52 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > [...]
> > > > > >
> > > > > > Related question, it seems we know statically if dynptr is read only
> > > > > > or not, so why even do all this hidden parameter passing and instead
> > > > > > just reject writes directly? You only need to be able to set
> > > > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > > > seems simpler than checking it at runtime. Verifier already handles
> > > > > > MEM_RDONLY generically, you only need to add the guard for
> > > > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > > > >
> > > > > There will be other helper functions that do writes (eg memcpy to
> > > > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > > > rather than enforce these at the verifier level. I also think it's
> > > > > cleaner to keep the verifier logic as simple as possible and do the
> > > > > checking in the helper.
> > > >
> > > > I won't be pushing this further, since you know what you plan to add
> > > > in the future better, but I still disagree.
> > > >
> > > > I'm guessing there might be dynptrs where this read only property is
> > > > set dynamically at runtime, which is why you want to go this route?
> > > > I.e. you might not know statically whether dynptr is read only or not?
> > > >
> > > > My main confusion is the inconsistency here.
> > > >
> > > > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > > > protect slices returned from dynptr_data, instead of setting
> > > > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > > > needed. So indirectly, you are relying on knowing statically whether
> > > > the dynptr is read only or not. But then you also set this bit at
> > > > runtime.
> > > >
> > > > So you reject some cases at load time, and the rest of them only at
> > > > runtime. Direct writes to dynptr slice fails load, writes through
> > > > helper does not (only fails at runtime).
> > > >
> > > > Also, dynptr_data needs to know whether dynptr is read only
> > > > statically, to protect writes to its returned pointer, unless you
> > > > decide to introduce another helper for the dynamic rdonly bit case
> > > > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > > > works for some rdonly dynptrs (known to be rdonly statically, like
> > > > this skb one), but not for others.
> > > >
> > > > I also don't agree about the complexity or scalability part, all the
> > > > infra and precedence is already there. We already have similar checks
> > > > for meta->raw_mode where we reject writes to read only pointers in
> > > > check_helper_mem_access.
> > >
> > > My point about scalability is that if we reject bpf_dynptr_write() at
> > > load time, then we must reject any future dynptr helper that does any
> > > writing at load time as well, to be consistent.
> > >
> > > I don't feel strongly about whether we reject at load time or run
> > > time. Rejecting at load time instead of runtime doesn't seem that
> > > useful to me, but there's a good chance I'm wrong here since Martin
> > > stated that he prefers rejecting at load time as well.
> > >
> > > As for the added complexity part, what I mean is that we'll need to
> > > keep track of some more stuff to support this, such as whether the
> > > dynptr is read only and which helper functions need to check whether
> > > the dynptr is read only or not.
> >
> > What I'm trying to understand is how dynptr_data is supposed to work
> > if this dynptr read only bit is only known at runtime. Or will it be
> > always known statically so that it can set returned pointer as read
> > only? Because then it doesn't seem it is required or useful to track
> > the readonly bit at runtime.
>
> I think it'll always be known statically whether the dynptr is
> read-only or not. If we make all writable dynptr helper functions
> reject read-only dynptrs at load time instead of run time, then yes we
> can remove the read-only bit in the bpf_dynptr_kern struct.
>
> There's also the question of whether this constraint (eg all read-only
> writes are rejected at load time) is too rigid - for example, what if
> in the future we want to add a helper function where if a certain
> condition is met, then we write some number of bytes, else we read
> some number of bytes? This would be not possible to add then, since
> we'll only know at runtime whether the condition is met.
>
> I personally lean towards rejecting helper function writes at runtime,
> but if you think it's a non-trivial benefit to reject at load time
> instead, I'm fine going with that.
>

My personal opinion is this:

When I am working with a statically known read only dynptr, it is like
declaring a variable const. Every function expecting it to be
non-const should fail compilation, and trying to mutate the variables
through writes should also fail compilation. For BPF compilation is
analogous to program load.

It might be that said variable is not const, then those operations may
fail at runtime due to some other reason. Being dynamically read-only
is then a runtime failure condition, which will cause failure at
runtime. Both are distinct cases in my mind, and it is fine to fail at
runtime when we don't know. In general, you save a lot of time of the
user in my opinion (esp. people new to things) if you reject known
incorrectness as early as possible.

E.g. load a dynptr from a map, where the field accepts storing both
read only and non-read only ones. Then it is expected that writes may
fail at runtime. That also allows you to switch read-only ness at
runtime back to rw. But if the field only expects rdonly dynptr,
verifier knows that the type is const statically, so it triggers
failures for writes at load time instead.

Taking this a step further, you may even store rw dynptr to a map
field expecting rdonly dynptr. That's like returning a const pointer
from a function for a rw memory, where you want to limit access of the
user, even better if you do it statically. Then functions trying to
write to dynptr loaded from said map field will fail load itself,
while others having access to rw dynptr can do it just fine.

When the verifier does not know, it does not know. There will be such
cases when you make const-ness a runtime property.

> >
> > It is fine if _everything_ checks it at runtime, but that doesn't seem
> > possible, hence the question. We would need a new slice helper that
> > only returns read-only slices, because dynptr_data can return rw
> > slices currently and it is already UAPI so changing that is not
> > possible anymore.
>
> I don't agree that if bpf_dynptr_write() is checked at runtime, then
> bpf_dynptr_data must also be checked at runtime to be consistent. I
> think it's fine if writes through helper functions are rejected at
> runtime, and writes through direct access are rejected at load time.
> That doesn't seem inconsistent to me.

My point was more that dynptr_data cannot propagate runtime
read-only-ness to its returned pointer. The verifier has to know
statically, at which point I don't see why we can't just reject other
cases at load anyway.

When we have dynptrs which have const-ness as a runtime property, it
is ok to support that by failing at runtime (but then you'll have a
hard time deciding how you want dynptr_data to work, most likely
you'll need another helper which returns only a rdonly slice, when it
fails, we call dynptr_data for rw slice).

But as I said before, I don't know how dynptr is going to evolve in
the future, so you'll have a better idea, and I'll leave it up to you
decide how you want to design its API. Enough words exchanged about
this :).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-26 20:54                 ` Kumar Kartikeya Dwivedi
@ 2022-08-27  5:36                   ` Andrii Nakryiko
  2022-08-27  7:11                     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-27  5:36 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Fri, Aug 26, 2022 at 1:54 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 26 Aug 2022 at 21:49, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 11:52 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > [...]
> > > > > > >
> > > > > > > Related question, it seems we know statically if dynptr is read only
> > > > > > > or not, so why even do all this hidden parameter passing and instead
> > > > > > > just reject writes directly? You only need to be able to set
> > > > > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > > > > seems simpler than checking it at runtime. Verifier already handles
> > > > > > > MEM_RDONLY generically, you only need to add the guard for
> > > > > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > > > > >
> > > > > > There will be other helper functions that do writes (eg memcpy to
> > > > > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > > > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > > > > rather than enforce these at the verifier level. I also think it's
> > > > > > cleaner to keep the verifier logic as simple as possible and do the
> > > > > > checking in the helper.
> > > > >
> > > > > I won't be pushing this further, since you know what you plan to add
> > > > > in the future better, but I still disagree.
> > > > >
> > > > > I'm guessing there might be dynptrs where this read only property is
> > > > > set dynamically at runtime, which is why you want to go this route?
> > > > > I.e. you might not know statically whether dynptr is read only or not?
> > > > >
> > > > > My main confusion is the inconsistency here.
> > > > >
> > > > > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > > > > protect slices returned from dynptr_data, instead of setting
> > > > > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > > > > needed. So indirectly, you are relying on knowing statically whether
> > > > > the dynptr is read only or not. But then you also set this bit at
> > > > > runtime.
> > > > >
> > > > > So you reject some cases at load time, and the rest of them only at
> > > > > runtime. Direct writes to dynptr slice fails load, writes through
> > > > > helper does not (only fails at runtime).
> > > > >
> > > > > Also, dynptr_data needs to know whether dynptr is read only
> > > > > statically, to protect writes to its returned pointer, unless you
> > > > > decide to introduce another helper for the dynamic rdonly bit case
> > > > > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > > > > works for some rdonly dynptrs (known to be rdonly statically, like
> > > > > this skb one), but not for others.
> > > > >
> > > > > I also don't agree about the complexity or scalability part, all the
> > > > > infra and precedence is already there. We already have similar checks
> > > > > for meta->raw_mode where we reject writes to read only pointers in
> > > > > check_helper_mem_access.
> > > >
> > > > My point about scalability is that if we reject bpf_dynptr_write() at
> > > > load time, then we must reject any future dynptr helper that does any
> > > > writing at load time as well, to be consistent.
> > > >
> > > > I don't feel strongly about whether we reject at load time or run
> > > > time. Rejecting at load time instead of runtime doesn't seem that
> > > > useful to me, but there's a good chance I'm wrong here since Martin
> > > > stated that he prefers rejecting at load time as well.
> > > >
> > > > As for the added complexity part, what I mean is that we'll need to
> > > > keep track of some more stuff to support this, such as whether the
> > > > dynptr is read only and which helper functions need to check whether
> > > > the dynptr is read only or not.
> > >
> > > What I'm trying to understand is how dynptr_data is supposed to work
> > > if this dynptr read only bit is only known at runtime. Or will it be
> > > always known statically so that it can set returned pointer as read
> > > only? Because then it doesn't seem it is required or useful to track
> > > the readonly bit at runtime.
> >
> > I think it'll always be known statically whether the dynptr is
> > read-only or not. If we make all writable dynptr helper functions
> > reject read-only dynptrs at load time instead of run time, then yes we
> > can remove the read-only bit in the bpf_dynptr_kern struct.
> >
> > There's also the question of whether this constraint (eg all read-only
> > writes are rejected at load time) is too rigid - for example, what if
> > in the future we want to add a helper function where if a certain
> > condition is met, then we write some number of bytes, else we read
> > some number of bytes? This would be not possible to add then, since
> > we'll only know at runtime whether the condition is met.
> >
> > I personally lean towards rejecting helper function writes at runtime,
> > but if you think it's a non-trivial benefit to reject at load time
> > instead, I'm fine going with that.
> >
>
> My personal opinion is this:
>
> When I am working with a statically known read only dynptr, it is like
> declaring a variable const. Every function expecting it to be
> non-const should fail compilation, and trying to mutate the variables
> through writes should also fail compilation. For BPF compilation is
> analogous to program load.
>
> It might be that said variable is not const, then those operations may
> fail at runtime due to some other reason. Being dynamically read-only
> is then a runtime failure condition, which will cause failure at
> runtime. Both are distinct cases in my mind, and it is fine to fail at
> runtime when we don't know. In general, you save a lot of time of the
> user in my opinion (esp. people new to things) if you reject known
> incorrectness as early as possible.
>
> E.g. load a dynptr from a map, where the field accepts storing both
> read only and non-read only ones. Then it is expected that writes may
> fail at runtime. That also allows you to switch read-only ness at
> runtime back to rw. But if the field only expects rdonly dynptr,
> verifier knows that the type is const statically, so it triggers
> failures for writes at load time instead.
>
> Taking this a step further, you may even store rw dynptr to a map
> field expecting rdonly dynptr. That's like returning a const pointer
> from a function for a rw memory, where you want to limit access of the
> user, even better if you do it statically. Then functions trying to
> write to dynptr loaded from said map field will fail load itself,
> while others having access to rw dynptr can do it just fine.
>
> When the verifier does not know, it does not know. There will be such
> cases when you make const-ness a runtime property.
>
> > >
> > > It is fine if _everything_ checks it at runtime, but that doesn't seem
> > > possible, hence the question. We would need a new slice helper that
> > > only returns read-only slices, because dynptr_data can return rw
> > > slices currently and it is already UAPI so changing that is not
> > > possible anymore.
> >
> > I don't agree that if bpf_dynptr_write() is checked at runtime, then
> > bpf_dynptr_data must also be checked at runtime to be consistent. I
> > think it's fine if writes through helper functions are rejected at
> > runtime, and writes through direct access are rejected at load time.
> > That doesn't seem inconsistent to me.
>
> My point was more that dynptr_data cannot propagate runtime
> read-only-ness to its returned pointer. The verifier has to know
> statically, at which point I don't see why we can't just reject other
> cases at load anyway.

I think the right answer here is to not make bpf_dynptr_data() return
direct pointer of changing read-only-ness. Maybe the right answer here
is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
read-only?

By saying that read-only-ness of dynptr should be statically known and
rejecting some dynptr functions at load time places us at the mercy of
verifier's complete knowledge of application logic, which is exactly
against the spirit of dynptr.

It's only slightly tangential, but I still dread my experience proving
to BPF verifier that some value is strictly greater than zero for BPF
helper that expected ARG_CONST_SIZE (not ARG_CONST_SIZE_OR_ZERO).
There were also cases were absolutely correct program had to be
mangled just to prove to BPF verifier that it indeed can return just 0
or 1, etc. This is not to bash BPF verifier, but just to point out
that sometimes unnecessary strictness adds nothing but unnecessary
pain to user. So, let's not reject anything at load, we can check all
that at runtime and return NULL.

But bpf_dynptr_data_rdonly() seems useful for cases where we know we
are not going to write

>
> When we have dynptrs which have const-ness as a runtime property, it
> is ok to support that by failing at runtime (but then you'll have a
> hard time deciding how you want dynptr_data to work, most likely
> you'll need another helper which returns only a rdonly slice, when it
> fails, we call dynptr_data for rw slice).
>
> But as I said before, I don't know how dynptr is going to evolve in
> the future, so you'll have a better idea, and I'll leave it up to you
> decide how you want to design its API. Enough words exchanged about
> this :).

directionally, dynptr is about offloading decisions to runtime, so I
think avoiding unnecessary restrictions at verification time is the
right trade off

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27  5:36                   ` Andrii Nakryiko
@ 2022-08-27  7:11                     ` Kumar Kartikeya Dwivedi
  2022-08-27 17:21                       ` Andrii Nakryiko
  0 siblings, 1 reply; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-27  7:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sat, 27 Aug 2022 at 07:37, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 1:54 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Fri, 26 Aug 2022 at 21:49, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Fri, Aug 26, 2022 at 11:52 AM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > [...]
> > > > > > > >
> > > > > > > > Related question, it seems we know statically if dynptr is read only
> > > > > > > > or not, so why even do all this hidden parameter passing and instead
> > > > > > > > just reject writes directly? You only need to be able to set
> > > > > > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > > > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > > > > > seems simpler than checking it at runtime. Verifier already handles
> > > > > > > > MEM_RDONLY generically, you only need to add the guard for
> > > > > > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > > > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > > > > > >
> > > > > > > There will be other helper functions that do writes (eg memcpy to
> > > > > > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > > > > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > > > > > rather than enforce these at the verifier level. I also think it's
> > > > > > > cleaner to keep the verifier logic as simple as possible and do the
> > > > > > > checking in the helper.
> > > > > >
> > > > > > I won't be pushing this further, since you know what you plan to add
> > > > > > in the future better, but I still disagree.
> > > > > >
> > > > > > I'm guessing there might be dynptrs where this read only property is
> > > > > > set dynamically at runtime, which is why you want to go this route?
> > > > > > I.e. you might not know statically whether dynptr is read only or not?
> > > > > >
> > > > > > My main confusion is the inconsistency here.
> > > > > >
> > > > > > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > > > > > protect slices returned from dynptr_data, instead of setting
> > > > > > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > > > > > needed. So indirectly, you are relying on knowing statically whether
> > > > > > the dynptr is read only or not. But then you also set this bit at
> > > > > > runtime.
> > > > > >
> > > > > > So you reject some cases at load time, and the rest of them only at
> > > > > > runtime. Direct writes to dynptr slice fails load, writes through
> > > > > > helper does not (only fails at runtime).
> > > > > >
> > > > > > Also, dynptr_data needs to know whether dynptr is read only
> > > > > > statically, to protect writes to its returned pointer, unless you
> > > > > > decide to introduce another helper for the dynamic rdonly bit case
> > > > > > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > > > > > works for some rdonly dynptrs (known to be rdonly statically, like
> > > > > > this skb one), but not for others.
> > > > > >
> > > > > > I also don't agree about the complexity or scalability part, all the
> > > > > > infra and precedence is already there. We already have similar checks
> > > > > > for meta->raw_mode where we reject writes to read only pointers in
> > > > > > check_helper_mem_access.
> > > > >
> > > > > My point about scalability is that if we reject bpf_dynptr_write() at
> > > > > load time, then we must reject any future dynptr helper that does any
> > > > > writing at load time as well, to be consistent.
> > > > >
> > > > > I don't feel strongly about whether we reject at load time or run
> > > > > time. Rejecting at load time instead of runtime doesn't seem that
> > > > > useful to me, but there's a good chance I'm wrong here since Martin
> > > > > stated that he prefers rejecting at load time as well.
> > > > >
> > > > > As for the added complexity part, what I mean is that we'll need to
> > > > > keep track of some more stuff to support this, such as whether the
> > > > > dynptr is read only and which helper functions need to check whether
> > > > > the dynptr is read only or not.
> > > >
> > > > What I'm trying to understand is how dynptr_data is supposed to work
> > > > if this dynptr read only bit is only known at runtime. Or will it be
> > > > always known statically so that it can set returned pointer as read
> > > > only? Because then it doesn't seem it is required or useful to track
> > > > the readonly bit at runtime.
> > >
> > > I think it'll always be known statically whether the dynptr is
> > > read-only or not. If we make all writable dynptr helper functions
> > > reject read-only dynptrs at load time instead of run time, then yes we
> > > can remove the read-only bit in the bpf_dynptr_kern struct.
> > >
> > > There's also the question of whether this constraint (eg all read-only
> > > writes are rejected at load time) is too rigid - for example, what if
> > > in the future we want to add a helper function where if a certain
> > > condition is met, then we write some number of bytes, else we read
> > > some number of bytes? This would be not possible to add then, since
> > > we'll only know at runtime whether the condition is met.
> > >
> > > I personally lean towards rejecting helper function writes at runtime,
> > > but if you think it's a non-trivial benefit to reject at load time
> > > instead, I'm fine going with that.
> > >
> >
> > My personal opinion is this:
> >
> > When I am working with a statically known read only dynptr, it is like
> > declaring a variable const. Every function expecting it to be
> > non-const should fail compilation, and trying to mutate the variables
> > through writes should also fail compilation. For BPF compilation is
> > analogous to program load.
> >
> > It might be that said variable is not const, then those operations may
> > fail at runtime due to some other reason. Being dynamically read-only
> > is then a runtime failure condition, which will cause failure at
> > runtime. Both are distinct cases in my mind, and it is fine to fail at
> > runtime when we don't know. In general, you save a lot of time of the
> > user in my opinion (esp. people new to things) if you reject known
> > incorrectness as early as possible.
> >
> > E.g. load a dynptr from a map, where the field accepts storing both
> > read only and non-read only ones. Then it is expected that writes may
> > fail at runtime. That also allows you to switch read-only ness at
> > runtime back to rw. But if the field only expects rdonly dynptr,
> > verifier knows that the type is const statically, so it triggers
> > failures for writes at load time instead.
> >
> > Taking this a step further, you may even store rw dynptr to a map
> > field expecting rdonly dynptr. That's like returning a const pointer
> > from a function for a rw memory, where you want to limit access of the
> > user, even better if you do it statically. Then functions trying to
> > write to dynptr loaded from said map field will fail load itself,
> > while others having access to rw dynptr can do it just fine.
> >
> > When the verifier does not know, it does not know. There will be such
> > cases when you make const-ness a runtime property.
> >
> > > >
> > > > It is fine if _everything_ checks it at runtime, but that doesn't seem
> > > > possible, hence the question. We would need a new slice helper that
> > > > only returns read-only slices, because dynptr_data can return rw
> > > > slices currently and it is already UAPI so changing that is not
> > > > possible anymore.
> > >
> > > I don't agree that if bpf_dynptr_write() is checked at runtime, then
> > > bpf_dynptr_data must also be checked at runtime to be consistent. I
> > > think it's fine if writes through helper functions are rejected at
> > > runtime, and writes through direct access are rejected at load time.
> > > That doesn't seem inconsistent to me.
> >
> > My point was more that dynptr_data cannot propagate runtime
> > read-only-ness to its returned pointer. The verifier has to know
> > statically, at which point I don't see why we can't just reject other
> > cases at load anyway.
>
> I think the right answer here is to not make bpf_dynptr_data() return
> direct pointer of changing read-only-ness. Maybe the right answer here
> is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> read-only?

Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
dynptr?

And yes, you're kind of painting yourself in a corner if you allow
dynptr_data to work with statically ro dynptrs now, ascertaining the
ro bit for the returned slice, and then later you plan to add dynptrs
whose ro vs rw is not known to the verifier statically. Then it works
well for statically known ones (returning both ro and rw slices), but
has to return NULL at runtime for statically unknown read only ones,
and always rw slice in verifier state for them.

>
> By saying that read-only-ness of dynptr should be statically known and
> rejecting some dynptr functions at load time places us at the mercy of
> verifier's complete knowledge of application logic, which is exactly
> against the spirit of dynptr.
>

Right, that might be too restrictive if we require them to be
statically read only.

But it's not about forcing it to be statically ro, it is more about
rejecting load when we know the program is incorrect (e.g. the types
are incorrect when passed to helpers), otherwise we throw the error at
runtime anyway, which seems to be the convention afaicu. But maybe I
missed the memo and we gradually want to move away from such strict
static checks.

I view the situation here similar to if we were rejecting direct
writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
runtime check in the helper writing to it as rw memory arg. It's as if
we pretend it's part of the 'type' of the register when doing direct
writes, but then ignore it while matching it against the said helper's
argument type.

> It's only slightly tangential, but I still dread my experience proving
> to BPF verifier that some value is strictly greater than zero for BPF
> helper that expected ARG_CONST_SIZE (not ARG_CONST_SIZE_OR_ZERO).
> There were also cases were absolutely correct program had to be
> mangled just to prove to BPF verifier that it indeed can return just 0
> or 1, etc. This is not to bash BPF verifier, but just to point out
> that sometimes unnecessary strictness adds nothing but unnecessary
> pain to user. So, let's not reject anything at load, we can check all
> that at runtime and return NULL.
>
> But bpf_dynptr_data_rdonly() seems useful for cases where we know we
> are not going to write
>
> >
> > When we have dynptrs which have const-ness as a runtime property, it
> > is ok to support that by failing at runtime (but then you'll have a
> > hard time deciding how you want dynptr_data to work, most likely
> > you'll need another helper which returns only a rdonly slice, when it
> > fails, we call dynptr_data for rw slice).
> >
> > But as I said before, I don't know how dynptr is going to evolve in
> > the future, so you'll have a better idea, and I'll leave it up to you
> > decide how you want to design its API. Enough words exchanged about
> > this :).
>
> directionally, dynptr is about offloading decisions to runtime, so I
> think avoiding unnecessary restrictions at verification time is the
> right trade off

Fair enough, I guess I've made my point. Let's go with what you both
feel would be best.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27  7:11                     ` Kumar Kartikeya Dwivedi
@ 2022-08-27 17:21                       ` Andrii Nakryiko
  2022-08-27 18:32                         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-27 17:21 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sat, Aug 27, 2022 at 12:12 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 27 Aug 2022 at 07:37, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 1:54 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Fri, 26 Aug 2022 at 21:49, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 26, 2022 at 11:52 AM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > >
> > > > > On Fri, 26 Aug 2022 at 20:44, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Aug 25, 2022 at 5:19 PM Kumar Kartikeya Dwivedi
> > > > > > <memxor@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, 25 Aug 2022 at 23:02, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > Related question, it seems we know statically if dynptr is read only
> > > > > > > > > or not, so why even do all this hidden parameter passing and instead
> > > > > > > > > just reject writes directly? You only need to be able to set
> > > > > > > > > MEM_RDONLY on dynptr_data returned PTR_TO_PACKETs, and reject
> > > > > > > > > dynptr_write when dynptr type is xdp/skb (and ctx is only one). That
> > > > > > > > > seems simpler than checking it at runtime. Verifier already handles
> > > > > > > > > MEM_RDONLY generically, you only need to add the guard for
> > > > > > > > > check_packet_acces (and check_helper_mem_access for meta->raw_mode
> > > > > > > > > under pkt case), and rejecting dynptr_write seems like a if condition.
> > > > > > > >
> > > > > > > > There will be other helper functions that do writes (eg memcpy to
> > > > > > > > dynptrs, strncpy to dynptrs, probe read user to dynptrs, hashing
> > > > > > > > dynptrs, ...) so it's more scalable if we reject these at runtime
> > > > > > > > rather than enforce these at the verifier level. I also think it's
> > > > > > > > cleaner to keep the verifier logic as simple as possible and do the
> > > > > > > > checking in the helper.
> > > > > > >
> > > > > > > I won't be pushing this further, since you know what you plan to add
> > > > > > > in the future better, but I still disagree.
> > > > > > >
> > > > > > > I'm guessing there might be dynptrs where this read only property is
> > > > > > > set dynamically at runtime, which is why you want to go this route?
> > > > > > > I.e. you might not know statically whether dynptr is read only or not?
> > > > > > >
> > > > > > > My main confusion is the inconsistency here.
> > > > > > >
> > > > > > > Right now the patch implicitly relies on may_access_direct_pkt_data to
> > > > > > > protect slices returned from dynptr_data, instead of setting
> > > > > > > MEM_RDONLY on the returned PTR_TO_PACKET. Which is fine, it's not
> > > > > > > needed. So indirectly, you are relying on knowing statically whether
> > > > > > > the dynptr is read only or not. But then you also set this bit at
> > > > > > > runtime.
> > > > > > >
> > > > > > > So you reject some cases at load time, and the rest of them only at
> > > > > > > runtime. Direct writes to dynptr slice fails load, writes through
> > > > > > > helper does not (only fails at runtime).
> > > > > > >
> > > > > > > Also, dynptr_data needs to know whether dynptr is read only
> > > > > > > statically, to protect writes to its returned pointer, unless you
> > > > > > > decide to introduce another helper for the dynamic rdonly bit case
> > > > > > > (like dynptr_data_rdonly). Then you have a mismatch, where dynptr_data
> > > > > > > works for some rdonly dynptrs (known to be rdonly statically, like
> > > > > > > this skb one), but not for others.
> > > > > > >
> > > > > > > I also don't agree about the complexity or scalability part, all the
> > > > > > > infra and precedence is already there. We already have similar checks
> > > > > > > for meta->raw_mode where we reject writes to read only pointers in
> > > > > > > check_helper_mem_access.
> > > > > >
> > > > > > My point about scalability is that if we reject bpf_dynptr_write() at
> > > > > > load time, then we must reject any future dynptr helper that does any
> > > > > > writing at load time as well, to be consistent.
> > > > > >
> > > > > > I don't feel strongly about whether we reject at load time or run
> > > > > > time. Rejecting at load time instead of runtime doesn't seem that
> > > > > > useful to me, but there's a good chance I'm wrong here since Martin
> > > > > > stated that he prefers rejecting at load time as well.
> > > > > >
> > > > > > As for the added complexity part, what I mean is that we'll need to
> > > > > > keep track of some more stuff to support this, such as whether the
> > > > > > dynptr is read only and which helper functions need to check whether
> > > > > > the dynptr is read only or not.
> > > > >
> > > > > What I'm trying to understand is how dynptr_data is supposed to work
> > > > > if this dynptr read only bit is only known at runtime. Or will it be
> > > > > always known statically so that it can set returned pointer as read
> > > > > only? Because then it doesn't seem it is required or useful to track
> > > > > the readonly bit at runtime.
> > > >
> > > > I think it'll always be known statically whether the dynptr is
> > > > read-only or not. If we make all writable dynptr helper functions
> > > > reject read-only dynptrs at load time instead of run time, then yes we
> > > > can remove the read-only bit in the bpf_dynptr_kern struct.
> > > >
> > > > There's also the question of whether this constraint (eg all read-only
> > > > writes are rejected at load time) is too rigid - for example, what if
> > > > in the future we want to add a helper function where if a certain
> > > > condition is met, then we write some number of bytes, else we read
> > > > some number of bytes? This would be not possible to add then, since
> > > > we'll only know at runtime whether the condition is met.
> > > >
> > > > I personally lean towards rejecting helper function writes at runtime,
> > > > but if you think it's a non-trivial benefit to reject at load time
> > > > instead, I'm fine going with that.
> > > >
> > >
> > > My personal opinion is this:
> > >
> > > When I am working with a statically known read only dynptr, it is like
> > > declaring a variable const. Every function expecting it to be
> > > non-const should fail compilation, and trying to mutate the variables
> > > through writes should also fail compilation. For BPF compilation is
> > > analogous to program load.
> > >
> > > It might be that said variable is not const, then those operations may
> > > fail at runtime due to some other reason. Being dynamically read-only
> > > is then a runtime failure condition, which will cause failure at
> > > runtime. Both are distinct cases in my mind, and it is fine to fail at
> > > runtime when we don't know. In general, you save a lot of time of the
> > > user in my opinion (esp. people new to things) if you reject known
> > > incorrectness as early as possible.
> > >
> > > E.g. load a dynptr from a map, where the field accepts storing both
> > > read only and non-read only ones. Then it is expected that writes may
> > > fail at runtime. That also allows you to switch read-only ness at
> > > runtime back to rw. But if the field only expects rdonly dynptr,
> > > verifier knows that the type is const statically, so it triggers
> > > failures for writes at load time instead.
> > >
> > > Taking this a step further, you may even store rw dynptr to a map
> > > field expecting rdonly dynptr. That's like returning a const pointer
> > > from a function for a rw memory, where you want to limit access of the
> > > user, even better if you do it statically. Then functions trying to
> > > write to dynptr loaded from said map field will fail load itself,
> > > while others having access to rw dynptr can do it just fine.
> > >
> > > When the verifier does not know, it does not know. There will be such
> > > cases when you make const-ness a runtime property.
> > >
> > > > >
> > > > > It is fine if _everything_ checks it at runtime, but that doesn't seem
> > > > > possible, hence the question. We would need a new slice helper that
> > > > > only returns read-only slices, because dynptr_data can return rw
> > > > > slices currently and it is already UAPI so changing that is not
> > > > > possible anymore.
> > > >
> > > > I don't agree that if bpf_dynptr_write() is checked at runtime, then
> > > > bpf_dynptr_data must also be checked at runtime to be consistent. I
> > > > think it's fine if writes through helper functions are rejected at
> > > > runtime, and writes through direct access are rejected at load time.
> > > > That doesn't seem inconsistent to me.
> > >
> > > My point was more that dynptr_data cannot propagate runtime
> > > read-only-ness to its returned pointer. The verifier has to know
> > > statically, at which point I don't see why we can't just reject other
> > > cases at load anyway.
> >
> > I think the right answer here is to not make bpf_dynptr_data() return
> > direct pointer of changing read-only-ness. Maybe the right answer here
> > is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> > non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> > read-only?
>
> Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
> work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
> dynptr?

Right, that's what I proposed:

  "bpf_dynptr_data_rdonly(), that will return NULL for non-read-only dynptr"

so if you pass read-write dynptr, it will return NULL (because it's
unsafe to take writable direct pointer).

bpf_dynptr_data_rdonly() should still work fine with both rdonly and
read-write dynptr.
bpf_dynptr_data() only works (in the sense returns non-NULL) for
read-write dynptr only.


>
> And yes, you're kind of painting yourself in a corner if you allow
> dynptr_data to work with statically ro dynptrs now, ascertaining the
> ro bit for the returned slice, and then later you plan to add dynptrs
> whose ro vs rw is not known to the verifier statically. Then it works
> well for statically known ones (returning both ro and rw slices), but
> has to return NULL at runtime for statically unknown read only ones,
> and always rw slice in verifier state for them.

Right, will be both inconsistent and puzzling.

>
> >
> > By saying that read-only-ness of dynptr should be statically known and
> > rejecting some dynptr functions at load time places us at the mercy of
> > verifier's complete knowledge of application logic, which is exactly
> > against the spirit of dynptr.
> >
>
> Right, that might be too restrictive if we require them to be
> statically read only.
>
> But it's not about forcing it to be statically ro, it is more about
> rejecting load when we know the program is incorrect (e.g. the types
> are incorrect when passed to helpers), otherwise we throw the error at
> runtime anyway, which seems to be the convention afaicu. But maybe I
> missed the memo and we gradually want to move away from such strict
> static checks.
>
> I view the situation here similar to if we were rejecting direct
> writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
> runtime check in the helper writing to it as rw memory arg. It's as if
> we pretend it's part of the 'type' of the register when doing direct
> writes, but then ignore it while matching it against the said helper's
> argument type.

I disagree, it's not the same. bpf_dynptr_data/bpf_dynptr_data_rdonly
turns completely dynamic dynptr into static slice of memory. Only
after that point it makes sense for BPF verifier to reject something.
Until then it's not incorrect. BPF program will always have to deal
with some dynptr operations potentially failing. dynptr can always be
NULL internally, you can still call bpf_dynptr_xxx() operations on it,
they will just do nothing and return error. That doesn't make BPF
program incorrect.

As I said, dynptr is about shifting static verification to runtime
checks for stuff that BPF verifier can't or shouldn't verify
statically. Like dynptr's dynamic size, for example. We've used a
special data/data_end pointer types to try to solve this for skb, but
it is quite painful in practice and relies on compiler generating
"good" sequence of instructions understandable to BPF verifier.

dynptr takes a different approach, shifts validation to runtime and
"interfaces" with static verification through bpf_dynptr_data() and
similar APIs.

>
> > It's only slightly tangential, but I still dread my experience proving
> > to BPF verifier that some value is strictly greater than zero for BPF
> > helper that expected ARG_CONST_SIZE (not ARG_CONST_SIZE_OR_ZERO).
> > There were also cases were absolutely correct program had to be
> > mangled just to prove to BPF verifier that it indeed can return just 0
> > or 1, etc. This is not to bash BPF verifier, but just to point out
> > that sometimes unnecessary strictness adds nothing but unnecessary
> > pain to user. So, let's not reject anything at load, we can check all
> > that at runtime and return NULL.
> >
> > But bpf_dynptr_data_rdonly() seems useful for cases where we know we
> > are not going to write
> >
> > >
> > > When we have dynptrs which have const-ness as a runtime property, it
> > > is ok to support that by failing at runtime (but then you'll have a
> > > hard time deciding how you want dynptr_data to work, most likely
> > > you'll need another helper which returns only a rdonly slice, when it
> > > fails, we call dynptr_data for rw slice).
> > >
> > > But as I said before, I don't know how dynptr is going to evolve in
> > > the future, so you'll have a better idea, and I'll leave it up to you
> > > decide how you want to design its API. Enough words exchanged about
> > > this :).
> >
> > directionally, dynptr is about offloading decisions to runtime, so I
> > think avoiding unnecessary restrictions at verification time is the
> > right trade off
>
> Fair enough, I guess I've made my point. Let's go with what you both
> feel would be best.

Sounds good. I tried to point out the "philosophy" behind dynptr. It
doesn't preclude making the BPF verifier smarter and track more
things. But it's a deliberate design of dynptr to shift more checks to
runtime because in a lot of cases it makes more sense than "fighting
BPF verifier". This "fighting verifier" phrase is like a meme now.
I've heard this exact phrase from multiple unrelated engineers. BPF
verifier should be helpful, we should engineers having to "fight" it,
so overly strict checks from verifier should be avoided if they don't
compromise correctness, IMO.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27 17:21                       ` Andrii Nakryiko
@ 2022-08-27 18:32                         ` Kumar Kartikeya Dwivedi
  2022-08-27 19:16                           ` Kumar Kartikeya Dwivedi
  2022-08-27 23:03                           ` Andrii Nakryiko
  0 siblings, 2 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-27 18:32 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sat, 27 Aug 2022 at 19:22, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Sat, Aug 27, 2022 at 12:12 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> > [...]
> > >
> > > I think the right answer here is to not make bpf_dynptr_data() return
> > > direct pointer of changing read-only-ness. Maybe the right answer here
> > > is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> > > non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> > > read-only?
> >
> > Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
> > work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
> > dynptr?
>
> Right, that's what I proposed:
>
>   "bpf_dynptr_data_rdonly(), that will return NULL for non-read-only dynptr"
>
> so if you pass read-write dynptr, it will return NULL (because it's
> unsafe to take writable direct pointer).
>
> bpf_dynptr_data_rdonly() should still work fine with both rdonly and
> read-write dynptr.
> bpf_dynptr_data() only works (in the sense returns non-NULL) for
> read-write dynptr only.
>
>
> >
> > And yes, you're kind of painting yourself in a corner if you allow
> > dynptr_data to work with statically ro dynptrs now, ascertaining the
> > ro bit for the returned slice, and then later you plan to add dynptrs
> > whose ro vs rw is not known to the verifier statically. Then it works
> > well for statically known ones (returning both ro and rw slices), but
> > has to return NULL at runtime for statically unknown read only ones,
> > and always rw slice in verifier state for them.
>
> Right, will be both inconsistent and puzzling.
>
> >
> > >
> > > By saying that read-only-ness of dynptr should be statically known and
> > > rejecting some dynptr functions at load time places us at the mercy of
> > > verifier's complete knowledge of application logic, which is exactly
> > > against the spirit of dynptr.
> > >
> >
> > Right, that might be too restrictive if we require them to be
> > statically read only.
> >
> > But it's not about forcing it to be statically ro, it is more about
> > rejecting load when we know the program is incorrect (e.g. the types
> > are incorrect when passed to helpers), otherwise we throw the error at
> > runtime anyway, which seems to be the convention afaicu. But maybe I
> > missed the memo and we gradually want to move away from such strict
> > static checks.
> >
> > I view the situation here similar to if we were rejecting direct
> > writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
> > runtime check in the helper writing to it as rw memory arg. It's as if
> > we pretend it's part of the 'type' of the register when doing direct
> > writes, but then ignore it while matching it against the said helper's
> > argument type.
>
> I disagree, it's not the same. bpf_dynptr_data/bpf_dynptr_data_rdonly
> turns completely dynamic dynptr into static slice of memory. Only
> after that point it makes sense for BPF verifier to reject something.
> Until then it's not incorrect. BPF program will always have to deal
> with some dynptr operations potentially failing. dynptr can always be
> NULL internally, you can still call bpf_dynptr_xxx() operations on it,
> they will just do nothing and return error. That doesn't make BPF
> program incorrect.

Let me just explain one last time why I'm unable to swallow this
'completely dynamic dynptr' explanation, because it is not treated as
completely dynamic by all dynptr helpers.

No pushback, but it would be great if you could further help me wrap
my head around this, so that we're in sync for future discussions.

So you say you may not know the type of dynptr (read-only, rw, local,
ringbuf, etc.). Hence you want to treat every dynptr as 'dynamic
dynptr' you know nothing about even when you do know some information
about it statically (e.g. if it's on stack). You don't want to reject
things early at load even if you have all the info to do so. You want
operations on it to fail at runtime instead.

If you cannot track ro vs rw in the future statically, you won't be be
able to track local or ringbuf or skb either (since both are part of
the type of the dynptr). If you can, you can just as well encode
const-ness as part of the type where you declare it (e.g. in a map
field where the value is assigned dynamically, where you say
dynptr_ringbuf etc.). Type comprises local vs skb vs ringbuf | ro vs
rw for me. But maybe I could be wrong.

So following this line of reasoning, will you be relaxing the argument
type of helpers like bpf_ringbuf_submit_dynptr? Right now they take
'dynamic dynptr' as arg, but also expect DYNPTR_TYPE_RINGBUF, so you
reject load when it's not a ringbuf dynptr. Will it then fallback to
checking the type at runtime when the type will not be known? But then
it will also permit passing local or skb dynptr in the future which it
rejects right now.

I'm just hoping you are able to see why it's looking a bit
inconsistent to me. If DYNPTR_TYPE_RINGBUF has to be checked, it felt
to me like DYNPTR_RDONLY is as much part of that kind of type safety
wrt helpers. It would be set on the dynptr when skb passed to
dynptr_from_skb is rdonly in some program types, along with
DYNPTR_TYPE_SKB, and expect to be matched when invoking helpers on
such dynptrs.

It is fine to push checks to runtime, especially when you won't know
the type, because verifier cannot reasonably track the dynptr type
then.

But right now there's still some level of state you maintain in the
verifier about dynptrs (like it's type), and it seems to me like some
helpers are using that state to reject things at load time, while some
other helper will ignore it and fallback to runtime checks.

I hope this is a clear enough description to atleast justify why I'm
(still) a bit confused.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27 18:32                         ` Kumar Kartikeya Dwivedi
@ 2022-08-27 19:16                           ` Kumar Kartikeya Dwivedi
  2022-08-27 23:03                           ` Andrii Nakryiko
  1 sibling, 0 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-27 19:16 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sat, 27 Aug 2022 at 20:32, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Sat, 27 Aug 2022 at 19:22, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, Aug 27, 2022 at 12:12 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > > [...]
> > > >
> > > > I think the right answer here is to not make bpf_dynptr_data() return
> > > > direct pointer of changing read-only-ness. Maybe the right answer here
> > > > is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> > > > non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> > > > read-only?
> > >
> > > Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
> > > work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
> > > dynptr?
> >
> > Right, that's what I proposed:
> >
> >   "bpf_dynptr_data_rdonly(), that will return NULL for non-read-only dynptr"
> >
> > so if you pass read-write dynptr, it will return NULL (because it's
> > unsafe to take writable direct pointer).
> >
> > bpf_dynptr_data_rdonly() should still work fine with both rdonly and
> > read-write dynptr.
> > bpf_dynptr_data() only works (in the sense returns non-NULL) for
> > read-write dynptr only.
> >
> >
> > >
> > > And yes, you're kind of painting yourself in a corner if you allow
> > > dynptr_data to work with statically ro dynptrs now, ascertaining the
> > > ro bit for the returned slice, and then later you plan to add dynptrs
> > > whose ro vs rw is not known to the verifier statically. Then it works
> > > well for statically known ones (returning both ro and rw slices), but
> > > has to return NULL at runtime for statically unknown read only ones,
> > > and always rw slice in verifier state for them.
> >
> > Right, will be both inconsistent and puzzling.
> >
> > >
> > > >
> > > > By saying that read-only-ness of dynptr should be statically known and
> > > > rejecting some dynptr functions at load time places us at the mercy of
> > > > verifier's complete knowledge of application logic, which is exactly
> > > > against the spirit of dynptr.
> > > >
> > >
> > > Right, that might be too restrictive if we require them to be
> > > statically read only.
> > >
> > > But it's not about forcing it to be statically ro, it is more about
> > > rejecting load when we know the program is incorrect (e.g. the types
> > > are incorrect when passed to helpers), otherwise we throw the error at
> > > runtime anyway, which seems to be the convention afaicu. But maybe I
> > > missed the memo and we gradually want to move away from such strict
> > > static checks.
> > >
> > > I view the situation here similar to if we were rejecting direct
> > > writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
> > > runtime check in the helper writing to it as rw memory arg. It's as if
> > > we pretend it's part of the 'type' of the register when doing direct
> > > writes, but then ignore it while matching it against the said helper's
> > > argument type.
> >
> > I disagree, it's not the same. bpf_dynptr_data/bpf_dynptr_data_rdonly
> > turns completely dynamic dynptr into static slice of memory. Only
> > after that point it makes sense for BPF verifier to reject something.
> > Until then it's not incorrect. BPF program will always have to deal
> > with some dynptr operations potentially failing. dynptr can always be
> > NULL internally, you can still call bpf_dynptr_xxx() operations on it,
> > they will just do nothing and return error. That doesn't make BPF
> > program incorrect.
>
> Let me just explain one last time why I'm unable to swallow this
> 'completely dynamic dynptr' explanation, because it is not treated as
> completely dynamic by all dynptr helpers.
>
> No pushback, but it would be great if you could further help me wrap
> my head around this, so that we're in sync for future discussions.
>
> So you say you may not know the type of dynptr (read-only, rw, local,
> ringbuf, etc.). Hence you want to treat every dynptr as 'dynamic
> dynptr' you know nothing about even when you do know some information
> about it statically (e.g. if it's on stack). You don't want to reject
> things early at load even if you have all the info to do so. You want
> operations on it to fail at runtime instead.
>
> If you cannot track ro vs rw in the future statically, you won't be be
> able to track local or ringbuf or skb either (since both are part of
> the type of the dynptr). If you can, you can just as well encode
> const-ness as part of the type where you declare it (e.g. in a map
> field where the value is assigned dynamically, where you say
> dynptr_ringbuf etc.). Type comprises local vs skb vs ringbuf | ro vs
> rw for me. But maybe I could be wrong.
>
> So following this line of reasoning, will you be relaxing the argument
> type of helpers like bpf_ringbuf_submit_dynptr? Right now they take
> 'dynamic dynptr' as arg, but also expect DYNPTR_TYPE_RINGBUF, so you
> reject load when it's not a ringbuf dynptr. Will it then fallback to
> checking the type at runtime when the type will not be known? But then
> it will also permit passing local or skb dynptr in the future which it
> rejects right now.
>
> I'm just hoping you are able to see why it's looking a bit
> inconsistent to me. If DYNPTR_TYPE_RINGBUF has to be checked, it felt
> to me like DYNPTR_RDONLY is as much part of that kind of type safety
> wrt helpers. It would be set on the dynptr when skb passed to
> dynptr_from_skb is rdonly in some program types, along with
> DYNPTR_TYPE_SKB, and expect to be matched when invoking helpers on
> such dynptrs.
>
> It is fine to push checks to runtime, especially when you won't know
> the type, because verifier cannot reasonably track the dynptr type
> then.
>

To further expand on this point, in my mind when you have actual
'completely dynamic dynptrs', you will most likely track them as
DYNPTR_TYPE_UNKNOWN in the verifier. Then you would capture free bits
to encode the types at runtime, and start checking that in helpers.

All helpers will start taking such DYNPTR_TYPE_UNKNOWN, or you can
have a casting helper like we already have for normal pointers.
No need for extra verifier complexity to teach it to recognize type
when it is unknown. It will then be checked dynamically at runtime for
dynptr operations.

Being strictly type safe by default for helper arguments is not going
to lead you to 'fighting the verifier'. Very much the opposite, the
verifier is helping you catch errors that will rather occur at runtime
anyway.

> But right now there's still some level of state you maintain in the
> verifier about dynptrs (like it's type), and it seems to me like some
> helpers are using that state to reject things at load time, while some
> other helper will ignore it and fallback to runtime checks.
>
> I hope this is a clear enough description to atleast justify why I'm
> (still) a bit confused.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27 18:32                         ` Kumar Kartikeya Dwivedi
  2022-08-27 19:16                           ` Kumar Kartikeya Dwivedi
@ 2022-08-27 23:03                           ` Andrii Nakryiko
  2022-08-27 23:47                             ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 43+ messages in thread
From: Andrii Nakryiko @ 2022-08-27 23:03 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sat, Aug 27, 2022 at 11:33 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Sat, 27 Aug 2022 at 19:22, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sat, Aug 27, 2022 at 12:12 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > > [...]
> > > >
> > > > I think the right answer here is to not make bpf_dynptr_data() return
> > > > direct pointer of changing read-only-ness. Maybe the right answer here
> > > > is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> > > > non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> > > > read-only?
> > >
> > > Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
> > > work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
> > > dynptr?
> >
> > Right, that's what I proposed:
> >
> >   "bpf_dynptr_data_rdonly(), that will return NULL for non-read-only dynptr"
> >
> > so if you pass read-write dynptr, it will return NULL (because it's
> > unsafe to take writable direct pointer).
> >
> > bpf_dynptr_data_rdonly() should still work fine with both rdonly and
> > read-write dynptr.
> > bpf_dynptr_data() only works (in the sense returns non-NULL) for
> > read-write dynptr only.
> >
> >
> > >
> > > And yes, you're kind of painting yourself in a corner if you allow
> > > dynptr_data to work with statically ro dynptrs now, ascertaining the
> > > ro bit for the returned slice, and then later you plan to add dynptrs
> > > whose ro vs rw is not known to the verifier statically. Then it works
> > > well for statically known ones (returning both ro and rw slices), but
> > > has to return NULL at runtime for statically unknown read only ones,
> > > and always rw slice in verifier state for them.
> >
> > Right, will be both inconsistent and puzzling.
> >
> > >
> > > >
> > > > By saying that read-only-ness of dynptr should be statically known and
> > > > rejecting some dynptr functions at load time places us at the mercy of
> > > > verifier's complete knowledge of application logic, which is exactly
> > > > against the spirit of dynptr.
> > > >
> > >
> > > Right, that might be too restrictive if we require them to be
> > > statically read only.
> > >
> > > But it's not about forcing it to be statically ro, it is more about
> > > rejecting load when we know the program is incorrect (e.g. the types
> > > are incorrect when passed to helpers), otherwise we throw the error at
> > > runtime anyway, which seems to be the convention afaicu. But maybe I
> > > missed the memo and we gradually want to move away from such strict
> > > static checks.
> > >
> > > I view the situation here similar to if we were rejecting direct
> > > writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
> > > runtime check in the helper writing to it as rw memory arg. It's as if
> > > we pretend it's part of the 'type' of the register when doing direct
> > > writes, but then ignore it while matching it against the said helper's
> > > argument type.
> >
> > I disagree, it's not the same. bpf_dynptr_data/bpf_dynptr_data_rdonly
> > turns completely dynamic dynptr into static slice of memory. Only
> > after that point it makes sense for BPF verifier to reject something.
> > Until then it's not incorrect. BPF program will always have to deal
> > with some dynptr operations potentially failing. dynptr can always be
> > NULL internally, you can still call bpf_dynptr_xxx() operations on it,
> > they will just do nothing and return error. That doesn't make BPF
> > program incorrect.
>
> Let me just explain one last time why I'm unable to swallow this
> 'completely dynamic dynptr' explanation, because it is not treated as
> completely dynamic by all dynptr helpers.
>
> No pushback, but it would be great if you could further help me wrap
> my head around this, so that we're in sync for future discussions.
>
> So you say you may not know the type of dynptr (read-only, rw, local,
> ringbuf, etc.). Hence you want to treat every dynptr as 'dynamic

No, that's not what I said. I didn't talk about dynptr type (ringbuf
vs skb vs local). Type we have to track statically precisely to limit
what kind of helpers can be used on dynptrs of specific type.

But note that there are helpers that don't care about dynptr and, in
theory, should work with any dynptr. Because dynptr is an abstraction
of "logically continuous range of memory". So in some cases even types
don't matter.

But regardless, I just don't think about read-only bit as part of
dynptr types. I don't see why it has to be statically known and even
why it has to stay as read-only or read-write for the entire duration
of dynptr lifetime. Why can't we allow flipping dynptr from read-write
to read-only before passing it to some custom BPF subprog, potentially
provided from some BPF library which you don't control; just to make
sure that it cannot really modify underlying memory (even if it is
fundamentally modifiable, like with ringbuf)?

So the point I'm trying to communicate is that things that don't
**need** to be knowable statically to BPF verifier - **should not** be
knowable and should be checked at runtime only. With any dynptr helper
that is interfacing its runtime nature into static thing that verifier
enforces (which is what bpf_dynptr_data/bpf_dynptr_data_rdonly are),
it will always be "fallible" and users will have to expect that they
might return NULL or do nothing, or whatever way to deal with failure
case.


And I believe dynptr read-onliness is not something that BPF verifier
should be tracking statically. PTR_TO_MEM -- yes, dynptr -- no. That's
it. Dynptr type/kind -- yes, track statically, it's way more important
to determine what you can at all do with dynptr.


> dynptr' you know nothing about even when you do know some information
> about it statically (e.g. if it's on stack). You don't want to reject
> things early at load even if you have all the info to do so. You want
> operations on it to fail at runtime instead.
>
> If you cannot track ro vs rw in the future statically, you won't be be
> able to track local or ringbuf or skb either (since both are part of
> the type of the dynptr). If you can, you can just as well encode
> const-ness as part of the type where you declare it (e.g. in a map
> field where the value is assigned dynamically, where you say
> dynptr_ringbuf etc.). Type comprises local vs skb vs ringbuf | ro vs
> rw for me. But maybe I could be wrong.

I don't consider read-only to be a part of type. It's more like offset
and size to me, rather than type. And even if we can track constness
(e.g., we can teach BPF verifier to recognize that hypothetical
bpf_dynptr_set_read_only() makes dynptr into dynptr_rdonly, but why?)

>
> So following this line of reasoning, will you be relaxing the argument
> type of helpers like bpf_ringbuf_submit_dynptr? Right now they take
> 'dynamic dynptr' as arg, but also expect DYNPTR_TYPE_RINGBUF, so you
> reject load when it's not a ringbuf dynptr. Will it then fallback to
> checking the type at runtime when the type will not be known? But then
> it will also permit passing local or skb dynptr in the future which it
> rejects right now.

No, which is exactly why we track dynptr type statically. Because some
operations don't make sense and they have additional semantics (like
with ringbuf submit).

>
> I'm just hoping you are able to see why it's looking a bit
> inconsistent to me. If DYNPTR_TYPE_RINGBUF has to be checked, it felt
> to me like DYNPTR_RDONLY is as much part of that kind of type safety
> wrt helpers. It would be set on the dynptr when skb passed to
> dynptr_from_skb is rdonly in some program types, along with
> DYNPTR_TYPE_SKB, and expect to be matched when invoking helpers on
> such dynptrs.

I understand your view point. But you are taking "completely dynamic"
nature of dynptr too far (not tracking dynptr type) in one case, or
making it too restrictive (tracking read-only state) in another. We
can argue what substitutes "inconsistent", but I'm just leaning
towards slightly different tradeoffs. I don't see a huge value in
preventing bpf_dynptr_write() to be called on read-only dynptr
**statically**. I've ran into various cases where I'd rather BPF
verifier not pretend to be all-knowing and too helpful, because it
sometimes turns into a game of proving to BPF verifier that "I know
what I'm doing, and it's correct, so please just let me do it".

>
> It is fine to push checks to runtime, especially when you won't know
> the type, because verifier cannot reasonably track the dynptr type
> then.
>
> But right now there's still some level of state you maintain in the
> verifier about dynptrs (like it's type), and it seems to me like some
> helpers are using that state to reject things at load time, while some
> other helper will ignore it and fallback to runtime checks.

Exactly. There are helpers that can fail no matter what, and
read-onliness is just another condition they will have to check and
reject.

And some helpers really don't and shouldn't care about the nature of
dynptr (like bpf_dynptr_get_size(), for example).

>
> I hope this is a clear enough description to atleast justify why I'm
> (still) a bit confused.

I don't think you are confused, you just prefer a different tradeoff.
Hopefully I gave a few more arguments in favor of less static
treatment of dynptr where it doesn't hurt correctness.

P.S. I'm going away on vacation, so unlikely to be able to continue
this discussion in a timely manner going forward.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs
  2022-08-27 23:03                           ` Andrii Nakryiko
@ 2022-08-27 23:47                             ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 43+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-27 23:47 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, andrii, daniel, ast, kafai, kuba, netdev

On Sun, 28 Aug 2022 at 01:03, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Sat, Aug 27, 2022 at 11:33 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Sat, 27 Aug 2022 at 19:22, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Sat, Aug 27, 2022 at 12:12 AM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > > [...]
> > > > >
> > > > > I think the right answer here is to not make bpf_dynptr_data() return
> > > > > direct pointer of changing read-only-ness. Maybe the right answer here
> > > > > is another helper, bpf_dynptr_data_rdonly(), that will return NULL for
> > > > > non-read-only dynptr and PTR_TO_MEM | MEM_RDONLY if dynptr is indeed
> > > > > read-only?
> > > >
> > > > Shouldn't it be the other way around? bpf_dynptr_data_rdonly() should
> > > > work for both ro and rw dynptrs, and bpf_dynptr_data() only for rw
> > > > dynptr?
> > >
> > > Right, that's what I proposed:
> > >
> > >   "bpf_dynptr_data_rdonly(), that will return NULL for non-read-only dynptr"
> > >
> > > so if you pass read-write dynptr, it will return NULL (because it's
> > > unsafe to take writable direct pointer).
> > >
> > > bpf_dynptr_data_rdonly() should still work fine with both rdonly and
> > > read-write dynptr.
> > > bpf_dynptr_data() only works (in the sense returns non-NULL) for
> > > read-write dynptr only.
> > >
> > >
> > > >
> > > > And yes, you're kind of painting yourself in a corner if you allow
> > > > dynptr_data to work with statically ro dynptrs now, ascertaining the
> > > > ro bit for the returned slice, and then later you plan to add dynptrs
> > > > whose ro vs rw is not known to the verifier statically. Then it works
> > > > well for statically known ones (returning both ro and rw slices), but
> > > > has to return NULL at runtime for statically unknown read only ones,
> > > > and always rw slice in verifier state for them.
> > >
> > > Right, will be both inconsistent and puzzling.
> > >
> > > >
> > > > >
> > > > > By saying that read-only-ness of dynptr should be statically known and
> > > > > rejecting some dynptr functions at load time places us at the mercy of
> > > > > verifier's complete knowledge of application logic, which is exactly
> > > > > against the spirit of dynptr.
> > > > >
> > > >
> > > > Right, that might be too restrictive if we require them to be
> > > > statically read only.
> > > >
> > > > But it's not about forcing it to be statically ro, it is more about
> > > > rejecting load when we know the program is incorrect (e.g. the types
> > > > are incorrect when passed to helpers), otherwise we throw the error at
> > > > runtime anyway, which seems to be the convention afaicu. But maybe I
> > > > missed the memo and we gradually want to move away from such strict
> > > > static checks.
> > > >
> > > > I view the situation here similar to if we were rejecting direct
> > > > writes to PTR_TO_MEM | MEM_RDONLY at load time, but offloading as
> > > > runtime check in the helper writing to it as rw memory arg. It's as if
> > > > we pretend it's part of the 'type' of the register when doing direct
> > > > writes, but then ignore it while matching it against the said helper's
> > > > argument type.
> > >
> > > I disagree, it's not the same. bpf_dynptr_data/bpf_dynptr_data_rdonly
> > > turns completely dynamic dynptr into static slice of memory. Only
> > > after that point it makes sense for BPF verifier to reject something.
> > > Until then it's not incorrect. BPF program will always have to deal
> > > with some dynptr operations potentially failing. dynptr can always be
> > > NULL internally, you can still call bpf_dynptr_xxx() operations on it,
> > > they will just do nothing and return error. That doesn't make BPF
> > > program incorrect.
> >
> > Let me just explain one last time why I'm unable to swallow this
> > 'completely dynamic dynptr' explanation, because it is not treated as
> > completely dynamic by all dynptr helpers.
> >
> > No pushback, but it would be great if you could further help me wrap
> > my head around this, so that we're in sync for future discussions.
> >
> > So you say you may not know the type of dynptr (read-only, rw, local,
> > ringbuf, etc.). Hence you want to treat every dynptr as 'dynamic
>
> No, that's not what I said. I didn't talk about dynptr type (ringbuf
> vs skb vs local). Type we have to track statically precisely to limit
> what kind of helpers can be used on dynptrs of specific type.
>
> But note that there are helpers that don't care about dynptr and, in
> theory, should work with any dynptr. Because dynptr is an abstraction
> of "logically continuous range of memory". So in some cases even types
> don't matter.
>
> But regardless, I just don't think about read-only bit as part of
> dynptr types. I don't see why it has to be statically known and even
> why it has to stay as read-only or read-write for the entire duration
> of dynptr lifetime. Why can't we allow flipping dynptr from read-write
> to read-only before passing it to some custom BPF subprog, potentially
> provided from some BPF library which you don't control; just to make
> sure that it cannot really modify underlying memory (even if it is
> fundamentally modifiable, like with ringbuf)?
>
> So the point I'm trying to communicate is that things that don't
> **need** to be knowable statically to BPF verifier - **should not** be
> knowable and should be checked at runtime only. With any dynptr helper
> that is interfacing its runtime nature into static thing that verifier
> enforces (which is what bpf_dynptr_data/bpf_dynptr_data_rdonly are),
> it will always be "fallible" and users will have to expect that they
> might return NULL or do nothing, or whatever way to deal with failure
> case.
>
>
> And I believe dynptr read-onliness is not something that BPF verifier
> should be tracking statically. PTR_TO_MEM -- yes, dynptr -- no. That's
> it. Dynptr type/kind -- yes, track statically, it's way more important
> to determine what you can at all do with dynptr.
>

Understood, it's more clear now how you're thinking about this :),
especially that you don't consider read-only bit to be a property of
dynptr type. I think that was the main point I was not following.

And yes, I guess it's about different tradeoffs. Taking your BPF
library example, my idea would be that we will anyway be verifying the
global or static subprog in the library by the BTF of its function
prototype. The verifier has to verify safety of the subprog by setting
up argument types if it is global.

There, I find it more appropriate for global ones to declare const
bpf_dynptr * as the argument type if the intent is to not write to the
dynptr. Then user can safely pass both rw and ro dynptrs to it,
without having to flip it each time at runtime just to ensure safety.
If it is static it will also be able to catch incorrect writes for
callers.

But indeed there may be cases where you want to control this at
runtime, IMPO that should be orthogonal to type level const-ness.
Those will be rw at the type level but may change const-ness at
runtime, then operations start failing at runtime.

Anyway, I guess we are done here. Again, thanks for sticking along and
taking the time to describe it in detail :).

>
> [...]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2022-08-27 23:47 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22 23:56 [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Joanne Koong
2022-08-22 23:56 ` [PATCH bpf-next v4 1/3] bpf: Add skb dynptrs Joanne Koong
2022-08-23 23:22   ` kernel test robot
2022-08-23 23:53   ` kernel test robot
2022-08-24 18:27   ` Andrii Nakryiko
2022-08-24 23:25     ` Kumar Kartikeya Dwivedi
2022-08-25 21:02       ` Joanne Koong
2022-08-26  0:18         ` Kumar Kartikeya Dwivedi
2022-08-26 18:44           ` Joanne Koong
2022-08-26 18:51             ` Kumar Kartikeya Dwivedi
2022-08-26 19:49               ` Joanne Koong
2022-08-26 20:54                 ` Kumar Kartikeya Dwivedi
2022-08-27  5:36                   ` Andrii Nakryiko
2022-08-27  7:11                     ` Kumar Kartikeya Dwivedi
2022-08-27 17:21                       ` Andrii Nakryiko
2022-08-27 18:32                         ` Kumar Kartikeya Dwivedi
2022-08-27 19:16                           ` Kumar Kartikeya Dwivedi
2022-08-27 23:03                           ` Andrii Nakryiko
2022-08-27 23:47                             ` Kumar Kartikeya Dwivedi
2022-08-22 23:56 ` [PATCH bpf-next v4 2/3] bpf: Add xdp dynptrs Joanne Koong
2022-08-23  2:30   ` Kumar Kartikeya Dwivedi
2022-08-23 22:26     ` Joanne Koong
2022-08-24 10:39       ` Toke Høiland-Jørgensen
2022-08-24 18:10         ` Joanne Koong
2022-08-24 23:04           ` Kumar Kartikeya Dwivedi
2022-08-25 20:14             ` Joanne Koong
2022-08-25 21:53             ` Andrii Nakryiko
2022-08-26  6:37             ` Martin KaFai Lau
2022-08-26  6:50               ` Martin KaFai Lau
2022-08-26 19:09               ` Kumar Kartikeya Dwivedi
2022-08-26 20:47                 ` Joanne Koong
2022-08-24 21:10       ` Kumar Kartikeya Dwivedi
2022-08-25 20:39         ` Joanne Koong
2022-08-25 23:18           ` Kumar Kartikeya Dwivedi
2022-08-26 18:23             ` Joanne Koong
2022-08-26 18:31               ` Kumar Kartikeya Dwivedi
2022-08-24  1:15   ` kernel test robot
2022-08-22 23:56 ` [PATCH bpf-next v4 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
2022-08-24 18:47   ` Andrii Nakryiko
2022-08-23  2:31 ` [PATCH bpf-next v4 0/3] Add skb + xdp dynptrs Kumar Kartikeya Dwivedi
2022-08-23 18:52   ` Joanne Koong
2022-08-24 18:01     ` Andrii Nakryiko
2022-08-24 23:18       ` Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).