bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/3] Add skb + xdp dynptrs
@ 2022-07-26 18:47 Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-26 18:47 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, Joanne Koong

This patchset is the 2nd in the dynptr series. The 1st can be found here [0].

This patchset adds skb and xdp type dynptrs, which have two main benefits for
packet parsing:
    * allowing operations on sizes that are not statically known at
      compile-time (eg variable-sized accesses).
    * more ergonomic and less brittle iteration through data (eg does not need
      manual if checking for being within bounds of data_end)

When comparing the differences in runtime for packet parsing without dynptrs
vs. with dynptrs, there is no noticeable difference (see patch 3 for more
details).

[0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/

Joanne Koong (3):
  bpf: Add skb dynptrs
  bpf: Add xdp dynptrs
  selftests/bpf: tests for using dynptrs to parse skb and xdp buffers

 include/linux/bpf.h                           |  14 ++-
 include/linux/filter.h                        |   7 ++
 include/uapi/linux/bpf.h                      |  58 ++++++++-
 kernel/bpf/helpers.c                          |  64 +++++++++-
 kernel/bpf/verifier.c                         |  48 +++++++-
 net/core/filter.c                             |  99 +++++++++++++--
 tools/include/uapi/linux/bpf.h                |  58 ++++++++-
 .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
 .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
 .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
 .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
 .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
 .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
 tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 15 files changed, 741 insertions(+), 131 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 [PATCH bpf-next v1 0/3] Add skb + xdp dynptrs Joanne Koong
@ 2022-07-26 18:47 ` Joanne Koong
  2022-07-27 17:13   ` sdf
                     ` (5 more replies)
  2022-07-26 18:47 ` [PATCH bpf-next v1 2/3] bpf: Add xdp dynptrs Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
  2 siblings, 6 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-26 18:47 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, Joanne Koong

Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only (writes and data slices are not permitted). For reads on the
dynptr, this includes reading into data in the non-linear paged buffers
but for writes and data slices, if the data is in a paged buffer, the
user must first call bpf_skb_pull_data to pull the data into the linear
portion.

Additionally, any helper calls that change the underlying packet buffer
(eg bpf_skb_pull_data) invalidates any data slices of the associated
dynptr.

Right now, skb dynptrs can only be constructed from skbs that are
the bpf program context - as such, there does not need to be any
reference tracking or release on skb dynptrs.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  8 ++++-
 include/linux/filter.h         |  4 +++
 include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
 kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
 kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
 net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
 tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
 7 files changed, 229 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 20c26aed7896..7fbd4324c848 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -407,11 +407,14 @@ enum bpf_type_flag {
 	/* Size is known at compile time. */
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to sk_buff */
+	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_LOCAL,
 	/* Underlying data is a ringbuf record */
 	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Underlying data is a sk_buff */
+	BPF_DYNPTR_TYPE_SKB,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
 		     enum bpf_dynptr_type type, u32 offset, u32 size);
 void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
 int bpf_dynptr_check_size(u32 size);
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
 
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5f21dc3c432..649063d9cbfd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1532,4 +1532,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 	return XDP_REDIRECT;
 }
 
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59a217ca2dfd..0730cd198a7f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5241,11 +5241,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  if *offset* + *len* extends into the skb's paged buffers, the user
+ *		       should manually pull the skb with bpf_skb_pull and then try again.
+ *
+ *		    *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM** (automatically
+ *			recompute the checksum for the packet after storing the bytes) and
+ *			**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
+ *			**->swhash** and *skb*\ **->l4hash** to 0).
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
+ *		skb-type dynptrs the write extends into the skb's paged buffers.
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5253,10 +5264,19 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * if *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with bpf_skb_pull and then
+ *		      try again.
+ *
+ *		    * the data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg bpf_skb_pull) is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5331,6 +5351,21 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only,
+ *		in which case trying to obtain a direct data slice to it through
+ *		bpf_dynptr_data will return an error.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
+ *		invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5541,6 +5576,7 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1f961f9982d2..21a806057e9e 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1425,11 +1425,21 @@ static bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr)
 	return ptr->size & DYNPTR_RDONLY_BIT;
 }
 
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+	ptr->size |= DYNPTR_RDONLY_BIT;
+}
+
 static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type)
 {
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
+static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
+{
+	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
+}
+
 static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
@@ -1500,6 +1510,7 @@ static const struct bpf_func_proto bpf_dynptr_from_mem_proto = {
 BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src,
 	   u32, offset, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
 	if (!src->data || flags)
@@ -1509,6 +1520,11 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 	if (err)
 		return err;
 
+	type = bpf_dynptr_get_type(src);
+
+	if (type == BPF_DYNPTR_TYPE_SKB)
+		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+
 	memcpy(dst, src->data + src->offset + offset, len);
 
 	return 0;
@@ -1528,15 +1544,38 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
 BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
 	   u32, len, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
-	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
+	if (!dst->data || bpf_dynptr_is_rdonly(dst))
 		return -EINVAL;
 
 	err = bpf_dynptr_check_off_len(dst, offset, len);
 	if (err)
 		return err;
 
+	type = bpf_dynptr_get_type(dst);
+
+	if (flags) {
+		if (type == BPF_DYNPTR_TYPE_SKB) {
+			if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
+				return -EINVAL;
+		} else {
+			return -EINVAL;
+		}
+	}
+
+	if (type == BPF_DYNPTR_TYPE_SKB) {
+		struct sk_buff *skb = dst->data;
+
+		/* if the data is paged, the caller needs to pull it first */
+		if (dst->offset + offset + len > skb->len - skb->data_len)
+			return -EAGAIN;
+
+		return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
+					     flags);
+	}
+
 	memcpy(dst->data + dst->offset + offset, src, len);
 
 	return 0;
@@ -1555,6 +1594,7 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
 
 BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
 	if (!ptr->data)
@@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 	if (bpf_dynptr_is_rdonly(ptr))
 		return 0;
 
+	type = bpf_dynptr_get_type(ptr);
+
+	if (type == BPF_DYNPTR_TYPE_SKB) {
+		struct sk_buff *skb = ptr->data;
+
+		/* if the data is paged, the caller needs to pull it first */
+		if (ptr->offset + offset + len > skb->len - skb->data_len)
+			return 0;
+
+		return (unsigned long)(skb->data + ptr->offset + offset);
+	}
+
 	return (unsigned long)(ptr->data + ptr->offset + offset);
 }
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0d523741a543..0838653eeb4e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
 	u32 subprogno;
 	struct bpf_map_value_off_desc *kptr_off_desc;
 	u8 uninit_dynptr_regno;
+	enum bpf_dynptr_type type;
 };
 
 struct btf *btf_vmlinux;
@@ -678,6 +679,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_LOCAL;
 	case DYNPTR_TYPE_RINGBUF:
 		return BPF_DYNPTR_TYPE_RINGBUF;
+	case DYNPTR_TYPE_SKB:
+		return BPF_DYNPTR_TYPE_SKB;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -5820,12 +5823,14 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
 }
 
-static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
+				       struct bpf_call_arg_meta *meta)
 {
 	struct bpf_func_state *state = func(env, reg);
 	int spi = get_spi(reg->off);
 
-	return state->stack[spi].spilled_ptr.id;
+	meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
+	meta->type = state->stack[spi].spilled_ptr.dynptr.type;
 }
 
 static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
@@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 				case DYNPTR_TYPE_RINGBUF:
 					err_extra = "ringbuf ";
 					break;
+				case DYNPTR_TYPE_SKB:
+					err_extra = "skb ";
+					break;
 				default:
 					break;
 				}
@@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 					verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
 					return -EFAULT;
 				}
-				/* Find the id of the dynptr we're tracking the reference of */
-				meta->ref_obj_id = stack_slot_get_id(env, reg);
+				/* Find the id and the type of the dynptr we're tracking
+				 * the reference of.
+				 */
+				stack_slot_get_dynptr_info(env, reg, meta);
 			}
 		}
 		break;
@@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
 	} else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
 		mark_reg_known_zero(env, regs, BPF_REG_0);
-		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
+		if (func_id == BPF_FUNC_dynptr_data &&
+		    meta.type == BPF_DYNPTR_TYPE_SKB)
+			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
+		else
+			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
 		regs[BPF_REG_0].mem_size = meta.mem_size;
 	} else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
 		const struct btf_type *t;
@@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto patch_call_imm;
 		}
 
+		if (insn->imm == BPF_FUNC_dynptr_from_skb) {
+			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
+				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
+			else
+				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
+			insn_buf[1] = *insn;
+			cnt = 2;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta += cnt - 1;
+			env->prog = new_prog;
+			prog = new_prog;
+			insn = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
+
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
 		 * only.
diff --git a/net/core/filter.c b/net/core/filter.c
index 5669248aff25..312f99deb759 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
 		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
 }
 
-BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
-	   const void *, from, u32, len, u64, flags)
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags)
 {
 	void *ptr;
 
@@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
 	return 0;
 }
 
+BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len, u64, flags)
+{
+	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
+}
+
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.func		= bpf_skb_store_bytes,
 	.gpl_only	= false,
@@ -1718,8 +1724,7 @@ static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
-	   void *, to, u32, len)
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
 {
 	void *ptr;
 
@@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
 	return -EFAULT;
 }
 
+BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
+	   void *, to, u32, len)
+{
+	return __bpf_skb_load_bytes(skb, offset, to, len);
+}
+
 static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.func		= bpf_skb_load_bytes,
 	.gpl_only	= false,
@@ -1849,6 +1860,32 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+/* is_rdonly is set by the verifier */
+BPF_CALL_4(bpf_dynptr_from_skb, struct sk_buff *, skb, u64, flags,
+	   struct bpf_dynptr_kern *, ptr, u32, is_rdonly)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
+
+	if (is_rdonly)
+		bpf_dynptr_set_rdonly(ptr);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_skb_proto = {
+	.func		= bpf_dynptr_from_skb,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT,
+};
+
 BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
 {
 	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
@@ -7808,6 +7845,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_perf_event_output:
 		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -7991,6 +8030,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8186,6 +8227,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skc_lookup_tcp:
 		return &bpf_skc_lookup_tcp_proto;
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8224,6 +8267,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_skb_under_cgroup:
 		return &bpf_skb_under_cgroup_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 59a217ca2dfd..0730cd198a7f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5241,11 +5241,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  if *offset* + *len* extends into the skb's paged buffers, the user
+ *		       should manually pull the skb with bpf_skb_pull and then try again.
+ *
+ *		    *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM** (automatically
+ *			recompute the checksum for the packet after storing the bytes) and
+ *			**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
+ *			**->swhash** and *skb*\ **->l4hash** to 0).
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
+ *		skb-type dynptrs the write extends into the skb's paged buffers.
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5253,10 +5264,19 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * if *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with bpf_skb_pull and then
+ *		      try again.
+ *
+ *		    * the data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg bpf_skb_pull) is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5331,6 +5351,21 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only,
+ *		in which case trying to obtain a direct data slice to it through
+ *		bpf_dynptr_data will return an error.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
+ *		invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5541,6 +5576,7 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v1 2/3] bpf: Add xdp dynptrs
  2022-07-26 18:47 [PATCH bpf-next v1 0/3] Add skb + xdp dynptrs Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
@ 2022-07-26 18:47 ` Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
  2 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-26 18:47 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, Joanne Koong

Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. For data slices, direct access to
data in fragments is also permitted, but access across fragments
is not.

Any helper calls that change the underlying packet buffer (eg
bpf_xdp_adjust_head) invalidates any data slices of the associated
dynptr.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  8 +++++-
 include/linux/filter.h         |  3 +++
 include/uapi/linux/bpf.h       | 20 +++++++++++++--
 kernel/bpf/helpers.c           | 10 ++++++++
 kernel/bpf/verifier.c          |  7 +++++-
 net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h | 20 +++++++++++++--
 7 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7fbd4324c848..77e2c94cce52 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -410,11 +410,15 @@ enum bpf_type_flag {
 	/* DYNPTR points to sk_buff */
 	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to xdp_buff */
+	DYNPTR_TYPE_XDP		= BIT(12 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
+				 | DYNPTR_TYPE_XDP)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -2561,6 +2565,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_RINGBUF,
 	/* Underlying data is a sk_buff */
 	BPF_DYNPTR_TYPE_SKB,
+	/* Underlying data is a xdp_buff */
+	BPF_DYNPTR_TYPE_XDP,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 649063d9cbfd..80f030239877 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
 int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
 			  u32 len, u64 flags);
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0730cd198a7f..559f9ba8b497 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5270,13 +5270,15 @@ union bpf_attr {
  *		      the user should manually pull the skb with bpf_skb_pull and then
  *		      try again.
  *
+ *		For skb-type and xdp-type dynptrs:
  *		    * the data slice is automatically invalidated anytime a
  *		      helper call that changes the underlying packet buffer
- *		      (eg bpf_skb_pull) is called.
+ *		      (eg bpf_skb_pull, bpf_xdp_adjust_head) is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5366,6 +5368,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg bpf_xdp_adjust_head) do not invalidate the dynptr, but they do
+ *		invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5577,6 +5592,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 21a806057e9e..3c6e349790f5 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1524,6 +1524,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 
 	if (type == BPF_DYNPTR_TYPE_SKB)
 		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	else if (type == BPF_DYNPTR_TYPE_XDP)
+		return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
 
 	memcpy(dst, src->data + src->offset + offset, len);
 
@@ -1574,6 +1576,8 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
 
 		return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
 					     flags);
+	} else if (type == BPF_DYNPTR_TYPE_XDP) {
+		return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
 	}
 
 	memcpy(dst->data + dst->offset + offset, src, len);
@@ -1617,6 +1621,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 			return 0;
 
 		return (unsigned long)(skb->data + ptr->offset + offset);
+	} else if (type == BPF_DYNPTR_TYPE_XDP) {
+		/* if the requested data in across fragments, then it cannot
+		 * be accessed directly - bpf_xdp_pointer will return NULL
+		 */
+		return (unsigned long)bpf_xdp_pointer(ptr->data,
+						      ptr->offset + offset, len);
 	}
 
 	return (unsigned long)(ptr->data + ptr->offset + offset);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0838653eeb4e..6bb1f68539a8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -681,6 +681,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_RINGBUF;
 	case DYNPTR_TYPE_SKB:
 		return BPF_DYNPTR_TYPE_SKB;
+	case DYNPTR_TYPE_XDP:
+		return BPF_DYNPTR_TYPE_XDP;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -6060,6 +6062,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 				case DYNPTR_TYPE_SKB:
 					err_extra = "skb ";
 					break;
+				case DYNPTR_TYPE_XDP:
+					err_extra = "xdp ";
+					break;
 				default:
 					break;
 				}
@@ -7417,7 +7422,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	} else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		if (func_id == BPF_FUNC_dynptr_data &&
-		    meta.type == BPF_DYNPTR_TYPE_SKB)
+		    (meta.type == BPF_DYNPTR_TYPE_SKB || meta.type == BPF_DYNPTR_TYPE_XDP))
 			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
 		else
 			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
diff --git a/net/core/filter.c b/net/core/filter.c
index 312f99deb759..3c8ba88eabb4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3825,7 +3825,29 @@ static const struct bpf_func_proto sk_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_1(bpf_xdp_get_buff_len, struct  xdp_buff*, xdp)
+BPF_CALL_3(bpf_dynptr_from_xdp, struct xdp_buff*, xdp, u64, flags,
+	   struct bpf_dynptr_kern *, ptr)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, xdp, BPF_DYNPTR_TYPE_XDP, 0, xdp_get_buff_len(xdp));
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_xdp_proto = {
+	.func		= bpf_dynptr_from_xdp,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_XDP | MEM_UNINIT,
+};
+
+BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
 {
 	return xdp_get_buff_len(xdp);
 }
@@ -3927,7 +3949,7 @@ static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
 	}
 }
 
-static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	u32 size = xdp->data_end - xdp->data;
@@ -3958,8 +3980,7 @@ static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 	return offset + len <= size ? addr + offset : NULL;
 }
 
-BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -3975,6 +3996,12 @@ BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_load_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.func		= bpf_xdp_load_bytes,
 	.gpl_only	= false,
@@ -3985,8 +4012,7 @@ static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
-BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -4002,6 +4028,12 @@ BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_store_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
 	.func		= bpf_xdp_store_bytes,
 	.gpl_only	= false,
@@ -8091,6 +8123,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_xdp:
+		return &bpf_dynptr_from_xdp_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0730cd198a7f..559f9ba8b497 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5270,13 +5270,15 @@ union bpf_attr {
  *		      the user should manually pull the skb with bpf_skb_pull and then
  *		      try again.
  *
+ *		For skb-type and xdp-type dynptrs:
  *		    * the data slice is automatically invalidated anytime a
  *		      helper call that changes the underlying packet buffer
- *		      (eg bpf_skb_pull) is called.
+ *		      (eg bpf_skb_pull, bpf_xdp_adjust_head) is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5366,6 +5368,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg bpf_xdp_adjust_head) do not invalidate the dynptr, but they do
+ *		invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5577,6 +5592,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-07-26 18:47 [PATCH bpf-next v1 0/3] Add skb + xdp dynptrs Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-07-26 18:47 ` [PATCH bpf-next v1 2/3] bpf: Add xdp dynptrs Joanne Koong
@ 2022-07-26 18:47 ` Joanne Koong
  2022-07-26 19:44   ` Zvi Effron
                     ` (2 more replies)
  2 siblings, 3 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-26 18:47 UTC (permalink / raw)
  To: bpf; +Cc: andrii, daniel, ast, Joanne Koong

Test skb and xdp dynptr functionality in the following ways:

1) progs/test_xdp.c
   * Change existing test to use dynptrs to parse xdp data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t xdp_bpf2bpf"):
         original version: 0.0449 sec
         with dynptrs: 0.0429 sec

2) progs/test_l4lb_noinline.c
   * Change existing test to use dynptrs to parse skb data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
         original version: 0.0502 sec
         with dynptrs: 0.055 sec

     For number of processed verifier instructions:
         original version: 6284 insns
         with dynptrs: 2538 insns

3) progs/test_dynptr_xdp.c
   * Add sample code for parsing tcp hdr opt lookup using dynptrs.
     This logic is lifted from a real-world use case of packet parsing in
     katran [0], a layer 4 load balancer

4) progs/dynptr_success.c
   * Add test case "test_skb_readonly" for testing attempts at writes /
     data slices on a prog type with read-only skb ctx.

5) progs/dynptr_fail.c
   * Add test cases "skb_invalid_data_slice" and
     "xdp_invalid_data_slice" for testing that helpers that modify the
     underlying packet buffer automatically invalidate the associated
     data slice.
   * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
     that prog types that do not support bpf_dynptr_from_skb/xdp don't
     have access to the API.

[0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
 .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
 .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
 .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
 .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
 .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
 tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 8 files changed, 416 insertions(+), 108 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c

diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index bcf80b9f7c27..c40631f33c7b 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -2,6 +2,7 @@
 /* Copyright (c) 2022 Facebook */
 
 #include <test_progs.h>
+#include <network_helpers.h>
 #include "dynptr_fail.skel.h"
 #include "dynptr_success.skel.h"
 
@@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
 static struct {
 	const char *prog_name;
 	const char *expected_err_msg;
-} dynptr_tests[] = {
-	/* failure cases */
+} verifier_error_tests[] = {
+	/* these cases should trigger a verifier error */
 	{"ringbuf_missing_release1", "Unreleased reference id=1"},
 	{"ringbuf_missing_release2", "Unreleased reference id=2"},
 	{"ringbuf_missing_release_callback", "Unreleased reference id"},
@@ -42,11 +43,25 @@ static struct {
 	{"release_twice_callback", "arg 1 is an unacquired reference"},
 	{"dynptr_from_mem_invalid_api",
 		"Unsupported reg type fp for bpf_dynptr_from_mem data"},
+	{"skb_invalid_data_slice", "invalid mem access 'scalar'"},
+	{"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
+	{"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
+	{"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
+};
+
+enum test_setup_type {
+	SETUP_SYSCALL_SLEEP,
+	SETUP_SKB_PROG,
+};
 
-	/* success cases */
-	{"test_read_write", NULL},
-	{"test_data_slice", NULL},
-	{"test_ringbuf", NULL},
+static struct {
+	const char *prog_name;
+	enum test_setup_type type;
+} runtime_tests[] = {
+	{"test_read_write", SETUP_SYSCALL_SLEEP},
+	{"test_data_slice", SETUP_SYSCALL_SLEEP},
+	{"test_ringbuf", SETUP_SYSCALL_SLEEP},
+	{"test_skb_readonly", SETUP_SKB_PROG},
 };
 
 static void verify_fail(const char *prog_name, const char *expected_err_msg)
@@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
 	dynptr_fail__destroy(skel);
 }
 
-static void verify_success(const char *prog_name)
+static void run_tests(const char *prog_name, enum test_setup_type setup_type)
 {
 	struct dynptr_success *skel;
 	struct bpf_program *prog;
@@ -107,15 +122,42 @@ static void verify_success(const char *prog_name)
 	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
 		goto cleanup;
 
-	link = bpf_program__attach(prog);
-	if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
-		goto cleanup;
+	switch (setup_type) {
+	case SETUP_SYSCALL_SLEEP:
+		link = bpf_program__attach(prog);
+		if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+			goto cleanup;
 
-	usleep(1);
+		usleep(1);
 
-	ASSERT_EQ(skel->bss->err, 0, "err");
+		bpf_link__destroy(link);
+		break;
+	case SETUP_SKB_PROG:
+	{
+		int prog_fd, err;
+		char buf[64];
+
+		prog_fd = bpf_program__fd(prog);
+		if (CHECK_FAIL(prog_fd < 0))
+			goto cleanup;
+
+		LIBBPF_OPTS(bpf_test_run_opts, topts,
+			    .data_in = &pkt_v4,
+			    .data_size_in = sizeof(pkt_v4),
+			    .data_out = buf,
+			    .data_size_out = sizeof(buf),
+			    .repeat = 1,
+		);
 
-	bpf_link__destroy(link);
+		err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+		if (!ASSERT_OK(err, "test_run"))
+			goto cleanup;
+
+		break;
+	}
+	}
+	ASSERT_EQ(skel->bss->err, 0, "err");
 
 cleanup:
 	dynptr_success__destroy(skel);
@@ -125,14 +167,17 @@ void test_dynptr(void)
 {
 	int i;
 
-	for (i = 0; i < ARRAY_SIZE(dynptr_tests); i++) {
-		if (!test__start_subtest(dynptr_tests[i].prog_name))
+	for (i = 0; i < ARRAY_SIZE(verifier_error_tests); i++) {
+		if (!test__start_subtest(verifier_error_tests[i].prog_name))
+			continue;
+
+		verify_fail(verifier_error_tests[i].prog_name,
+			    verifier_error_tests[i].expected_err_msg);
+	}
+	for (i = 0; i < ARRAY_SIZE(runtime_tests); i++) {
+		if (!test__start_subtest(runtime_tests[i].prog_name))
 			continue;
 
-		if (dynptr_tests[i].expected_err_msg)
-			verify_fail(dynptr_tests[i].prog_name,
-				    dynptr_tests[i].expected_err_msg);
-		else
-			verify_success(dynptr_tests[i].prog_name);
+		run_tests(runtime_tests[i].prog_name, runtime_tests[i].type);
 	}
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
new file mode 100644
index 000000000000..ca775d126b60
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_dynptr_xdp.skel.h"
+#include "test_tcp_hdr_options.h"
+
+struct test_pkt {
+	struct ipv6_packet pk6_v6;
+	u8 options[16];
+} __packed;
+
+void test_dynptr_xdp(void)
+{
+	struct test_dynptr_xdp *skel;
+	char buf[128];
+	int err;
+
+	skel = test_dynptr_xdp__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+		return;
+
+	struct test_pkt pkt = {
+		.pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+		.pk6_v6.iph.nexthdr = IPPROTO_TCP,
+		.pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+		.pk6_v6.tcp.urg_ptr = 123,
+		.pk6_v6.tcp.doff = 9, /* 16 bytes of options */
+
+		.options = {
+			TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
+			skel->rodata->tcp_hdr_opt_kind_tpr, 6, 0, 0, 0, 9, TCPOPT_EOL
+		},
+	};
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = &pkt,
+		    .data_size_in = sizeof(pkt),
+		    .data_out = buf,
+		    .data_size_out = sizeof(buf),
+		    .repeat = 3,
+	);
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.xdp_ingress_v6), &topts);
+	ASSERT_OK(err, "ipv6 test_run");
+	ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
+	ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
+
+	test_dynptr_xdp__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index c1814938a5fd..4e3f853b2d02 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -5,6 +5,7 @@
 #include <string.h>
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>
+#include <linux/if_ether.h>
 #include "bpf_misc.h"
 
 char _license[] SEC("license") = "GPL";
@@ -622,3 +623,78 @@ int dynptr_from_mem_invalid_api(void *ctx)
 
 	return 0;
 }
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?tc")
+int skb_invalid_data_slice(struct __sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+	if (!hdr)
+		return SK_DROP;
+
+	hdr->h_proto = 12;
+
+	if (bpf_skb_pull_data(skb, skb->len))
+		return SK_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return SK_PASS;
+}
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?xdp")
+int xdp_invalid_data_slice(struct xdp_md *xdp)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr1, *hdr2;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
+	if (!hdr1)
+		return XDP_DROP;
+
+	hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
+	if (!hdr2)
+		return XDP_DROP;
+
+	hdr1->h_proto = 12;
+	hdr2->h_proto = 12;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
+		return XDP_DROP;
+
+	/* this should fail */
+	hdr2->h_proto = 1;
+
+	return XDP_PASS;
+}
+
+/* Only supported prog type can create skb-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int skb_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+
+	return 0;
+}
+
+/* Only supported prog type can create xdp-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int xdp_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_xdp(ctx, 0, &ptr);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c
index a3a6103c8569..ffddd6ddc7fb 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_success.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_success.c
@@ -162,3 +162,35 @@ int test_ringbuf(void *ctx)
 	bpf_ringbuf_discard_dynptr(&ptr, 0);
 	return 0;
 }
+
+SEC("cgroup_skb/egress")
+int test_skb_readonly(void *ctx)
+{
+	__u8 write_data[2] = {1, 2};
+	struct bpf_dynptr ptr;
+	void *data;
+	int ret;
+
+	err = 1;
+
+	if (bpf_dynptr_from_skb(ctx, 0, &ptr))
+		return 0;
+	err++;
+
+	data = bpf_dynptr_data(&ptr, 0, 1);
+	if (data)
+		/* it's an error if data is not NULL since cgroup skbs
+		 * are read only
+		 */
+		return 0;
+	err++;
+
+	ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
+	/* since cgroup skbs are read only, writes should fail */
+	if (ret != -EINVAL)
+		return 0;
+
+	err = 0;
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
new file mode 100644
index 000000000000..c879dfb6370a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* This logic is lifted from a real-world use case of packet parsing, used in
+ * the open source library katran, a layer 4 load balancer.
+ *
+ * This test demonstrates how to parse packet contents using dynptrs.
+ *
+ * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
+ */
+
+#include <string.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/tcp.h>
+#include <stdbool.h>
+#include <linux/ipv6.h>
+#include <linux/if_ether.h>
+#include "test_tcp_hdr_options.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Arbitrarily picked unused value from IANA TCP Option Kind Numbers */
+const __u32 tcp_hdr_opt_kind_tpr = 0xB7;
+/* Length of the tcp header option */
+const __u32 tcp_hdr_opt_len_tpr = 6;
+/* maximum number of header options to check to lookup server_id */
+const __u32 tcp_hdr_opt_max_opt_checks = 15;
+
+__u32 server_id;
+
+static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
+			 __u32 *server_id)
+{
+	__u8 *tcp_opt, kind, hdr_len;
+	__u8 *data;
+
+	data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
+			       sizeof(*server_id));
+	if (!data)
+		return -1;
+
+	kind = data[0];
+
+	if (kind == TCPOPT_EOL)
+		return -1;
+
+	if (kind == TCPOPT_NOP) {
+		*off += 1;
+		/* continue to the next option */
+		*hdr_bytes_remaining -= 1;
+
+		return 0;
+	}
+
+	if (*hdr_bytes_remaining < 2)
+		return -1;
+
+	hdr_len = data[1];
+	if (hdr_len > *hdr_bytes_remaining)
+		return -1;
+
+	if (kind == tcp_hdr_opt_kind_tpr) {
+		if (hdr_len != tcp_hdr_opt_len_tpr)
+			return -1;
+
+		memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
+		return 1;
+	}
+
+	*off += hdr_len;
+	*hdr_bytes_remaining -= hdr_len;
+
+	return 0;
+}
+
+SEC("xdp")
+int xdp_ingress_v6(struct xdp_md *xdp)
+{
+	__u8 hdr_bytes_remaining;
+	struct tcphdr *tcp_hdr;
+	__u8 tcp_hdr_opt_len;
+	int err = 0;
+	__u32 off;
+
+	struct bpf_dynptr ptr;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+
+	off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+
+	tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
+	if (!tcp_hdr)
+		return XDP_DROP;
+
+	tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
+	if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
+		return XDP_DROP;
+
+	hdr_bytes_remaining = tcp_hdr_opt_len;
+
+	off += sizeof(struct tcphdr);
+
+	/* max number of bytes of options in tcp header is 40 bytes */
+	for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
+		err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
+
+		if (err || !hdr_bytes_remaining)
+			break;
+	}
+
+	if (!server_id)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
index c8bc0c6947aa..1fef7868ea8b 100644
--- a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
+++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
@@ -230,21 +230,18 @@ static __noinline bool get_packet_dst(struct real_definition **real,
 	return true;
 }
 
-static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
+static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
 				   struct packet_description *pckt)
 {
 	struct icmp6hdr *icmp_hdr;
 	struct ipv6hdr *ip6h;
 
-	icmp_hdr = data + off;
-	if (icmp_hdr + 1 > data_end)
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*ip6h));
+	if (!icmp_hdr)
 		return TC_ACT_SHOT;
 	if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
 		return TC_ACT_OK;
-	off += sizeof(struct icmp6hdr);
-	ip6h = data + off;
-	if (ip6h + 1 > data_end)
-		return TC_ACT_SHOT;
+	ip6h = (struct ipv6hdr *)(icmp_hdr + 1);
 	pckt->proto = ip6h->nexthdr;
 	pckt->flags |= F_ICMP;
 	memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
@@ -252,22 +249,19 @@ static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
 	return TC_ACT_UNSPEC;
 }
 
-static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
+static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
 				 struct packet_description *pckt)
 {
 	struct icmphdr *icmp_hdr;
 	struct iphdr *iph;
 
-	icmp_hdr = data + off;
-	if (icmp_hdr + 1 > data_end)
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*iph));
+	if (!icmp_hdr)
 		return TC_ACT_SHOT;
 	if (icmp_hdr->type != ICMP_DEST_UNREACH ||
 	    icmp_hdr->code != ICMP_FRAG_NEEDED)
 		return TC_ACT_OK;
-	off += sizeof(struct icmphdr);
-	iph = data + off;
-	if (iph + 1 > data_end)
-		return TC_ACT_SHOT;
+	iph = (struct iphdr *)(icmp_hdr + 1);
 	if (iph->ihl != 5)
 		return TC_ACT_SHOT;
 	pckt->proto = iph->protocol;
@@ -277,13 +271,13 @@ static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
 	return TC_ACT_UNSPEC;
 }
 
-static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
+static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off,
 				 struct packet_description *pckt)
 {
 	struct udphdr *udp;
-	udp = data + off;
 
-	if (udp + 1 > data_end)
+	udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp));
+	if (!udp)
 		return false;
 
 	if (!(pckt->flags & F_ICMP)) {
@@ -296,13 +290,13 @@ static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
 	return true;
 }
 
-static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
+static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off,
 				 struct packet_description *pckt)
 {
 	struct tcphdr *tcp;
 
-	tcp = data + off;
-	if (tcp + 1 > data_end)
+	tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp));
+	if (!tcp)
 		return false;
 
 	if (tcp->syn)
@@ -318,12 +312,11 @@ static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
 	return true;
 }
 
-static __noinline int process_packet(void *data, __u64 off, void *data_end,
+static __noinline int process_packet(struct bpf_dynptr *skb_ptr,
+				     struct eth_hdr *eth, __u64 off,
 				     bool is_ipv6, struct __sk_buff *skb)
 {
-	void *pkt_start = (void *)(long)skb->data;
 	struct packet_description pckt = {};
-	struct eth_hdr *eth = pkt_start;
 	struct bpf_tunnel_key tkey = {};
 	struct vip_stats *data_stats;
 	struct real_definition *dst;
@@ -344,8 +337,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 
 	tkey.tunnel_ttl = 64;
 	if (is_ipv6) {
-		ip6h = data + off;
-		if (ip6h + 1 > data_end)
+		ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
+		if (!ip6h)
 			return TC_ACT_SHOT;
 
 		iph_len = sizeof(struct ipv6hdr);
@@ -356,7 +349,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 		if (protocol == IPPROTO_FRAGMENT) {
 			return TC_ACT_SHOT;
 		} else if (protocol == IPPROTO_ICMPV6) {
-			action = parse_icmpv6(data, data_end, off, &pckt);
+			action = parse_icmpv6(skb_ptr, off, &pckt);
 			if (action >= 0)
 				return action;
 			off += IPV6_PLUS_ICMP_HDR;
@@ -365,10 +358,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 			memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16);
 		}
 	} else {
-		iph = data + off;
-		if (iph + 1 > data_end)
-			return TC_ACT_SHOT;
-		if (iph->ihl != 5)
+		iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
+		if (!iph || iph->ihl != 5)
 			return TC_ACT_SHOT;
 
 		protocol = iph->protocol;
@@ -379,7 +370,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 		if (iph->frag_off & PCKT_FRAGMENTED)
 			return TC_ACT_SHOT;
 		if (protocol == IPPROTO_ICMP) {
-			action = parse_icmp(data, data_end, off, &pckt);
+			action = parse_icmp(skb_ptr, off, &pckt);
 			if (action >= 0)
 				return action;
 			off += IPV4_PLUS_ICMP_HDR;
@@ -391,10 +382,10 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 	protocol = pckt.proto;
 
 	if (protocol == IPPROTO_TCP) {
-		if (!parse_tcp(data, off, data_end, &pckt))
+		if (!parse_tcp(skb_ptr, off, &pckt))
 			return TC_ACT_SHOT;
 	} else if (protocol == IPPROTO_UDP) {
-		if (!parse_udp(data, off, data_end, &pckt))
+		if (!parse_udp(skb_ptr, off, &pckt))
 			return TC_ACT_SHOT;
 	} else {
 		return TC_ACT_SHOT;
@@ -450,20 +441,22 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
 SEC("tc")
 int balancer_ingress(struct __sk_buff *ctx)
 {
-	void *data_end = (void *)(long)ctx->data_end;
-	void *data = (void *)(long)ctx->data;
-	struct eth_hdr *eth = data;
+	struct bpf_dynptr ptr;
+	struct eth_hdr *eth;
 	__u32 eth_proto;
 	__u32 nh_off;
 
 	nh_off = sizeof(struct eth_hdr);
-	if (data + nh_off > data_end)
+
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth));
+	if (!eth)
 		return TC_ACT_SHOT;
 	eth_proto = eth->eth_proto;
 	if (eth_proto == bpf_htons(ETH_P_IP))
-		return process_packet(data, nh_off, data_end, false, ctx);
+		return process_packet(&ptr, eth, nh_off, false, ctx);
 	else if (eth_proto == bpf_htons(ETH_P_IPV6))
-		return process_packet(data, nh_off, data_end, true, ctx);
+		return process_packet(&ptr, eth, nh_off, true, ctx);
 	else
 		return TC_ACT_SHOT;
 }
diff --git a/tools/testing/selftests/bpf/progs/test_xdp.c b/tools/testing/selftests/bpf/progs/test_xdp.c
index d7a9a74b7245..2272b56a8777 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp.c
@@ -20,6 +20,12 @@
 #include <bpf/bpf_endian.h>
 #include "test_iptunnel_common.h"
 
+const size_t tcphdr_sz = sizeof(struct tcphdr);
+const size_t udphdr_sz = sizeof(struct udphdr);
+const size_t ethhdr_sz = sizeof(struct ethhdr);
+const size_t iphdr_sz = sizeof(struct iphdr);
+const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
+
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__uint(max_entries, 256);
@@ -43,8 +49,7 @@ static __always_inline void count_tx(__u32 protocol)
 		*rxcnt_count += 1;
 }
 
-static __always_inline int get_dport(void *trans_data, void *data_end,
-				     __u8 protocol)
+static __always_inline int get_dport(void *trans_data, __u8 protocol)
 {
 	struct tcphdr *th;
 	struct udphdr *uh;
@@ -52,13 +57,9 @@ static __always_inline int get_dport(void *trans_data, void *data_end,
 	switch (protocol) {
 	case IPPROTO_TCP:
 		th = (struct tcphdr *)trans_data;
-		if (th + 1 > data_end)
-			return -1;
 		return th->dest;
 	case IPPROTO_UDP:
 		uh = (struct udphdr *)trans_data;
-		if (uh + 1 > data_end)
-			return -1;
 		return uh->dest;
 	default:
 		return 0;
@@ -75,14 +76,13 @@ static __always_inline void set_ethhdr(struct ethhdr *new_eth,
 	new_eth->h_proto = h_proto;
 }
 
-static __always_inline int handle_ipv4(struct xdp_md *xdp)
+static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
 {
-	void *data_end = (void *)(long)xdp->data_end;
-	void *data = (void *)(long)xdp->data;
+	struct bpf_dynptr new_xdp_ptr;
 	struct iptnl_info *tnl;
 	struct ethhdr *new_eth;
 	struct ethhdr *old_eth;
-	struct iphdr *iph = data + sizeof(struct ethhdr);
+	struct iphdr *iph;
 	__u16 *next_iph;
 	__u16 payload_len;
 	struct vip vip = {};
@@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 	__u32 csum = 0;
 	int i;
 
-	if (iph + 1 > data_end)
+	iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
+			      iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
+	if (!iph)
 		return XDP_DROP;
 
-	dport = get_dport(iph + 1, data_end, iph->protocol);
+	dport = get_dport(iph + 1, iph->protocol);
 	if (dport == -1)
 		return XDP_DROP;
 
@@ -108,37 +110,33 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 	if (!tnl || tnl->family != AF_INET)
 		return XDP_PASS;
 
-	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz))
 		return XDP_DROP;
 
-	data = (void *)(long)xdp->data;
-	data_end = (void *)(long)xdp->data_end;
-
-	new_eth = data;
-	iph = data + sizeof(*new_eth);
-	old_eth = data + sizeof(*iph);
-
-	if (new_eth + 1 > data_end ||
-	    old_eth + 1 > data_end ||
-	    iph + 1 > data_end)
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz);
+	if (!new_eth)
 		return XDP_DROP;
 
+	iph = (struct iphdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(iph + 1);
+
 	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP));
 
 	iph->version = 4;
-	iph->ihl = sizeof(*iph) >> 2;
+	iph->ihl = iphdr_sz >> 2;
 	iph->frag_off =	0;
 	iph->protocol = IPPROTO_IPIP;
 	iph->check = 0;
 	iph->tos = 0;
-	iph->tot_len = bpf_htons(payload_len + sizeof(*iph));
+	iph->tot_len = bpf_htons(payload_len + iphdr_sz);
 	iph->daddr = tnl->daddr.v4;
 	iph->saddr = tnl->saddr.v4;
 	iph->ttl = 8;
 
 	next_iph = (__u16 *)iph;
 #pragma clang loop unroll(full)
-	for (i = 0; i < sizeof(*iph) >> 1; i++)
+	for (i = 0; i < iphdr_sz >> 1; i++)
 		csum += *next_iph++;
 
 	iph->check = ~((csum & 0xffff) + (csum >> 16));
@@ -148,22 +146,23 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 	return XDP_TX;
 }
 
-static __always_inline int handle_ipv6(struct xdp_md *xdp)
+static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
 {
-	void *data_end = (void *)(long)xdp->data_end;
-	void *data = (void *)(long)xdp->data;
+	struct bpf_dynptr new_xdp_ptr;
 	struct iptnl_info *tnl;
 	struct ethhdr *new_eth;
 	struct ethhdr *old_eth;
-	struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
+	struct ipv6hdr *ip6h;
 	__u16 payload_len;
 	struct vip vip = {};
 	int dport;
 
-	if (ip6h + 1 > data_end)
+	ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
+			       ipv6hdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
+	if (!ip6h)
 		return XDP_DROP;
 
-	dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
+	dport = get_dport(ip6h + 1, ip6h->nexthdr);
 	if (dport == -1)
 		return XDP_DROP;
 
@@ -178,26 +177,23 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
 	if (!tnl || tnl->family != AF_INET6)
 		return XDP_PASS;
 
-	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz))
 		return XDP_DROP;
 
-	data = (void *)(long)xdp->data;
-	data_end = (void *)(long)xdp->data_end;
-
-	new_eth = data;
-	ip6h = data + sizeof(*new_eth);
-	old_eth = data + sizeof(*ip6h);
-
-	if (new_eth + 1 > data_end || old_eth + 1 > data_end ||
-	    ip6h + 1 > data_end)
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz);
+	if (!new_eth)
 		return XDP_DROP;
 
+	ip6h = (struct ipv6hdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(ip6h + 1);
+
 	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6));
 
 	ip6h->version = 6;
 	ip6h->priority = 0;
 	memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
-	ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + sizeof(*ip6h));
+	ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz);
 	ip6h->nexthdr = IPPROTO_IPV6;
 	ip6h->hop_limit = 8;
 	memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
@@ -211,21 +207,22 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
 SEC("xdp")
 int _xdp_tx_iptunnel(struct xdp_md *xdp)
 {
-	void *data_end = (void *)(long)xdp->data_end;
-	void *data = (void *)(long)xdp->data;
-	struct ethhdr *eth = data;
+	struct bpf_dynptr ptr;
+	struct ethhdr *eth;
 	__u16 h_proto;
 
-	if (eth + 1 > data_end)
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz);
+	if (!eth)
 		return XDP_DROP;
 
 	h_proto = eth->h_proto;
 
 	if (h_proto == bpf_htons(ETH_P_IP))
-		return handle_ipv4(xdp);
+		return handle_ipv4(xdp, &ptr);
 	else if (h_proto == bpf_htons(ETH_P_IPV6))
 
-		return handle_ipv6(xdp);
+		return handle_ipv6(xdp, &ptr);
 	else
 		return XDP_DROP;
 }
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
index 6118e3ab61fc..56c9f8a3ad3d 100644
--- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -50,6 +50,7 @@ struct linum_err {
 
 #define TCPOPT_EOL		0
 #define TCPOPT_NOP		1
+#define TCPOPT_MSS		2
 #define TCPOPT_WINDOW		3
 #define TCPOPT_EXP		254
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
@ 2022-07-26 19:44   ` Zvi Effron
  2022-07-26 20:06     ` Joanne Koong
  2022-08-01 17:58   ` Andrii Nakryiko
  2022-08-01 19:12   ` Alexei Starovoitov
  2 siblings, 1 reply; 52+ messages in thread
From: Zvi Effron @ 2022-07-26 19:44 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Test skb and xdp dynptr functionality in the following ways:
>
> 1) progs/test_xdp.c
> * Change existing test to use dynptrs to parse xdp data
>
> There were no noticeable diferences in user + system time between
> the original version vs. using dynptrs. Averaging the time for 10
> runs (run using "time ./test_progs -t xdp_bpf2bpf"):
> original version: 0.0449 sec
> with dynptrs: 0.0429 sec
>
> 2) progs/test_l4lb_noinline.c
> * Change existing test to use dynptrs to parse skb data
>
> There were no noticeable diferences in user + system time between
> the original version vs. using dynptrs. Averaging the time for 10
> runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
> original version: 0.0502 sec
> with dynptrs: 0.055 sec
>
> For number of processed verifier instructions:
> original version: 6284 insns
> with dynptrs: 2538 insns
>
> 3) progs/test_dynptr_xdp.c
> * Add sample code for parsing tcp hdr opt lookup using dynptrs.
> This logic is lifted from a real-world use case of packet parsing in
> katran [0], a layer 4 load balancer
>
> 4) progs/dynptr_success.c
> * Add test case "test_skb_readonly" for testing attempts at writes /
> data slices on a prog type with read-only skb ctx.
>
> 5) progs/dynptr_fail.c
> * Add test cases "skb_invalid_data_slice" and
> "xdp_invalid_data_slice" for testing that helpers that modify the
> underlying packet buffer automatically invalidate the associated
> data slice.
> * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
> that prog types that do not support bpf_dynptr_from_skb/xdp don't
> have access to the API.
>
> [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> .../testing/selftests/bpf/prog_tests/dynptr.c | 85 ++++++++++---
> .../selftests/bpf/prog_tests/dynptr_xdp.c | 49 ++++++++
> .../testing/selftests/bpf/progs/dynptr_fail.c | 76 ++++++++++++
> .../selftests/bpf/progs/dynptr_success.c | 32 +++++
> .../selftests/bpf/progs/test_dynptr_xdp.c | 115 ++++++++++++++++++
> .../selftests/bpf/progs/test_l4lb_noinline.c | 71 +++++------
> tools/testing/selftests/bpf/progs/test_xdp.c | 95 +++++++--------
> .../selftests/bpf/test_tcp_hdr_options.h | 1 +
> 8 files changed, 416 insertions(+), 108 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> index bcf80b9f7c27..c40631f33c7b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> @@ -2,6 +2,7 @@
> /* Copyright (c) 2022 Facebook */
>
> #include <test_progs.h>
> +#include <network_helpers.h>
> #include "dynptr_fail.skel.h"
> #include "dynptr_success.skel.h"
>
> @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
> static struct {
> const char *prog_name;
> const char *expected_err_msg;
> -} dynptr_tests[] = {
> - /* failure cases */
> +} verifier_error_tests[] = {
> + /* these cases should trigger a verifier error */
> {"ringbuf_missing_release1", "Unreleased reference id=1"},
> {"ringbuf_missing_release2", "Unreleased reference id=2"},
> {"ringbuf_missing_release_callback", "Unreleased reference id"},
> @@ -42,11 +43,25 @@ static struct {
> {"release_twice_callback", "arg 1 is an unacquired reference"},
> {"dynptr_from_mem_invalid_api",
> "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> + {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> + {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> + {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> + {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> +};
> +
> +enum test_setup_type {
> + SETUP_SYSCALL_SLEEP,
> + SETUP_SKB_PROG,
> +};
>
> - /* success cases */
> - {"test_read_write", NULL},
> - {"test_data_slice", NULL},
> - {"test_ringbuf", NULL},
> +static struct {
> + const char *prog_name;
> + enum test_setup_type type;
> +} runtime_tests[] = {
> + {"test_read_write", SETUP_SYSCALL_SLEEP},
> + {"test_data_slice", SETUP_SYSCALL_SLEEP},
> + {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> + {"test_skb_readonly", SETUP_SKB_PROG},
> };
>
> static void verify_fail(const char *prog_name, const char *expected_err_msg)
> @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
> dynptr_fail__destroy(skel);
> }
>
> -static void verify_success(const char *prog_name)
> +static void run_tests(const char *prog_name, enum test_setup_type setup_type)
> {
> struct dynptr_success *skel;
> struct bpf_program *prog;
> @@ -107,15 +122,42 @@ static void verify_success(const char *prog_name)
> if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> goto cleanup;
>
> - link = bpf_program__attach(prog);
> - if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> - goto cleanup;
> + switch (setup_type) {
> + case SETUP_SYSCALL_SLEEP:
> + link = bpf_program__attach(prog);
> + if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> + goto cleanup;
>
> - usleep(1);
> + usleep(1);
>
> - ASSERT_EQ(skel->bss->err, 0, "err");
> + bpf_link__destroy(link);
> + break;
> + case SETUP_SKB_PROG:
> + {
> + int prog_fd, err;
> + char buf[64];
> +
> + prog_fd = bpf_program__fd(prog);
> + if (CHECK_FAIL(prog_fd < 0))
> + goto cleanup;
> +
> + LIBBPF_OPTS(bpf_test_run_opts, topts,
> + .data_in = &pkt_v4,
> + .data_size_in = sizeof(pkt_v4),
> + .data_out = buf,
> + .data_size_out = sizeof(buf),
> + .repeat = 1,
> + );
>
> - bpf_link__destroy(link);
> + err = bpf_prog_test_run_opts(prog_fd, &topts);
> +
> + if (!ASSERT_OK(err, "test_run"))
> + goto cleanup;
> +
> + break;
> + }
> + }
> + ASSERT_EQ(skel->bss->err, 0, "err");
>
> cleanup:
> dynptr_success__destroy(skel);
> @@ -125,14 +167,17 @@ void test_dynptr(void)
> {
> int i;
>
> - for (i = 0; i < ARRAY_SIZE(dynptr_tests); i++) {
> - if (!test__start_subtest(dynptr_tests[i].prog_name))
> + for (i = 0; i < ARRAY_SIZE(verifier_error_tests); i++) {
> + if (!test__start_subtest(verifier_error_tests[i].prog_name))
> + continue;
> +
> + verify_fail(verifier_error_tests[i].prog_name,
> + verifier_error_tests[i].expected_err_msg);
> + }
> + for (i = 0; i < ARRAY_SIZE(runtime_tests); i++) {
> + if (!test__start_subtest(runtime_tests[i].prog_name))
> continue;
>
> - if (dynptr_tests[i].expected_err_msg)
> - verify_fail(dynptr_tests[i].prog_name,
> - dynptr_tests[i].expected_err_msg);
> - else
> - verify_success(dynptr_tests[i].prog_name);
> + run_tests(runtime_tests[i].prog_name, runtime_tests[i].type);
> }
> }
> diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> new file mode 100644
> index 000000000000..ca775d126b60
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include "test_dynptr_xdp.skel.h"
> +#include "test_tcp_hdr_options.h"
> +
> +struct test_pkt {
> + struct ipv6_packet pk6_v6;
> + u8 options[16];
> +} __packed;
> +
> +void test_dynptr_xdp(void)
> +{
> + struct test_dynptr_xdp *skel;
> + char buf[128];
> + int err;
> +
> + skel = test_dynptr_xdp__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> + return;
> +
> + struct test_pkt pkt = {
> + .pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
> + .pk6_v6.iph.nexthdr = IPPROTO_TCP,
> + .pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
> + .pk6_v6.tcp.urg_ptr = 123,
> + .pk6_v6.tcp.doff = 9, /* 16 bytes of options */
> +
> + .options = {
> + TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
> + skel->rodata->tcp_hdr_opt_kind_tpr, 6, 0, 0, 0, 9, TCPOPT_EOL
> + },
> + };
> +
> + LIBBPF_OPTS(bpf_test_run_opts, topts,
> + .data_in = &pkt,
> + .data_size_in = sizeof(pkt),
> + .data_out = buf,
> + .data_size_out = sizeof(buf),
> + .repeat = 3,
> + );
> +
> + err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.xdp_ingress_v6), &topts);
> + ASSERT_OK(err, "ipv6 test_run");
> + ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> + ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
> +
> + test_dynptr_xdp__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> index c1814938a5fd..4e3f853b2d02 100644
> --- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
> +++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> @@ -5,6 +5,7 @@
> #include <string.h>
> #include <linux/bpf.h>
> #include <bpf/bpf_helpers.h>
> +#include <linux/if_ether.h>
> #include "bpf_misc.h"
>
> char _license[] SEC("license") = "GPL";
> @@ -622,3 +623,78 @@ int dynptr_from_mem_invalid_api(void *ctx)
>
> return 0;
> }
> +
> +/* The data slice is invalidated whenever a helper changes packet data */
> +SEC("?tc")
> +int skb_invalid_data_slice(struct __sk_buff *skb)
> +{
> + struct bpf_dynptr ptr;
> + struct ethhdr *hdr;
> +
> + bpf_dynptr_from_skb(skb, 0, &ptr);
> + hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
> + if (!hdr)
> + return SK_DROP;
> +
> + hdr->h_proto = 12;
> +
> + if (bpf_skb_pull_data(skb, skb->len))
> + return SK_DROP;
> +
> + /* this should fail */
> + hdr->h_proto = 1;
> +
> + return SK_PASS;
> +}
> +
> +/* The data slice is invalidated whenever a helper changes packet data */
> +SEC("?xdp")
> +int xdp_invalid_data_slice(struct xdp_md *xdp)
> +{
> + struct bpf_dynptr ptr;
> + struct ethhdr *hdr1, *hdr2;
> +
> + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> + hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> + if (!hdr1)
> + return XDP_DROP;
> +
> + hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> + if (!hdr2)
> + return XDP_DROP;
> +
> + hdr1->h_proto = 12;
> + hdr2->h_proto = 12;
> +
> + if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> + return XDP_DROP;
> +
> + /* this should fail */
> + hdr2->h_proto = 1;
> +
> + return XDP_PASS;
> +}
> +
> +/* Only supported prog type can create skb-type dynptrs */
> +SEC("?raw_tp/sys_nanosleep")
> +int skb_invalid_ctx(void *ctx)
> +{
> + struct bpf_dynptr ptr;
> +
> + /* this should fail */
> + bpf_dynptr_from_skb(ctx, 0, &ptr);
> +
> + return 0;
> +}
> +
> +/* Only supported prog type can create xdp-type dynptrs */
> +SEC("?raw_tp/sys_nanosleep")
> +int xdp_invalid_ctx(void *ctx)
> +{
> + struct bpf_dynptr ptr;
> +
> + /* this should fail */
> + bpf_dynptr_from_xdp(ctx, 0, &ptr);
> +
> + return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c
> index a3a6103c8569..ffddd6ddc7fb 100644
> --- a/tools/testing/selftests/bpf/progs/dynptr_success.c
> +++ b/tools/testing/selftests/bpf/progs/dynptr_success.c
> @@ -162,3 +162,35 @@ int test_ringbuf(void *ctx)
> bpf_ringbuf_discard_dynptr(&ptr, 0);
> return 0;
> }
> +
> +SEC("cgroup_skb/egress")
> +int test_skb_readonly(void *ctx)
> +{
> + __u8 write_data[2] = {1, 2};
> + struct bpf_dynptr ptr;
> + void *data;
> + int ret;
> +
> + err = 1;
> +
> + if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> + return 0;
> + err++;
> +
> + data = bpf_dynptr_data(&ptr, 0, 1);
> + if (data)
> + /* it's an error if data is not NULL since cgroup skbs
> + * are read only
> + */
> + return 0;
> + err++;
> +
> + ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> + /* since cgroup skbs are read only, writes should fail */
> + if (ret != -EINVAL)
> + return 0;
> +
> + err = 0;
> +
> + return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> new file mode 100644
> index 000000000000..c879dfb6370a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> @@ -0,0 +1,115 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/* This logic is lifted from a real-world use case of packet parsing, used in
> + * the open source library katran, a layer 4 load balancer.
> + *
> + * This test demonstrates how to parse packet contents using dynptrs.
> + *
> + * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> + */
> +
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <linux/tcp.h>
> +#include <stdbool.h>
> +#include <linux/ipv6.h>
> +#include <linux/if_ether.h>
> +#include "test_tcp_hdr_options.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +/* Arbitrarily picked unused value from IANA TCP Option Kind Numbers */
> +const __u32 tcp_hdr_opt_kind_tpr = 0xB7;

Should this instead be either 0xFD or 0xFE, as those are the two Kind numbers
allocated for experiments? Using a reserved value seems suboptimal, and
potentially risks updating one of the entries in [0] to have a double asterisk.

[0]: https://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-parameters-1

> +/* Length of the tcp header option */
> +const __u32 tcp_hdr_opt_len_tpr = 6;
> +/* maximum number of header options to check to lookup server_id */
> +const __u32 tcp_hdr_opt_max_opt_checks = 15;
> +
> +__u32 server_id;
> +
> +static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
> + __u32 *server_id)
> +{
> + __u8 *tcp_opt, kind, hdr_len;
> + __u8 *data;
> +
> + data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
> + sizeof(*server_id));
> + if (!data)
> + return -1;
> +
> + kind = data[0];
> +
> + if (kind == TCPOPT_EOL)
> + return -1;
> +
> + if (kind == TCPOPT_NOP) {
> + *off += 1;
> + /* continue to the next option */
> + *hdr_bytes_remaining -= 1;
> +
> + return 0;
> + }
> +
> + if (*hdr_bytes_remaining < 2)
> + return -1;
> +
> + hdr_len = data[1];
> + if (hdr_len > *hdr_bytes_remaining)
> + return -1;
> +
> + if (kind == tcp_hdr_opt_kind_tpr) {
> + if (hdr_len != tcp_hdr_opt_len_tpr)
> + return -1;
> +
> + memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
> + return 1;
> + }
> +
> + *off += hdr_len;
> + *hdr_bytes_remaining -= hdr_len;
> +
> + return 0;
> +}
> +
> +SEC("xdp")
> +int xdp_ingress_v6(struct xdp_md *xdp)
> +{
> + __u8 hdr_bytes_remaining;
> + struct tcphdr *tcp_hdr;
> + __u8 tcp_hdr_opt_len;
> + int err = 0;
> + __u32 off;
> +
> + struct bpf_dynptr ptr;
> +
> + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> +
> + off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> +
> + tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
> + if (!tcp_hdr)
> + return XDP_DROP;
> +
> + tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
> + if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
> + return XDP_DROP;
> +
> + hdr_bytes_remaining = tcp_hdr_opt_len;
> +
> + off += sizeof(struct tcphdr);
> +
> + /* max number of bytes of options in tcp header is 40 bytes */
> + for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
> + err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
> +
> + if (err || !hdr_bytes_remaining)
> + break;
> + }
> +
> + if (!server_id)
> + return XDP_DROP;
> +
> + return XDP_PASS;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> index c8bc0c6947aa..1fef7868ea8b 100644
> --- a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> +++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> @@ -230,21 +230,18 @@ static __noinline bool get_packet_dst(struct real_definition **real,
> return true;
> }
>
> -static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> +static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
> struct packet_description *pckt)
> {
> struct icmp6hdr *icmp_hdr;
> struct ipv6hdr *ip6h;
>
> - icmp_hdr = data + off;
> - if (icmp_hdr + 1 > data_end)
> + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*ip6h));
> + if (!icmp_hdr)
> return TC_ACT_SHOT;
> if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
> return TC_ACT_OK;
> - off += sizeof(struct icmp6hdr);
> - ip6h = data + off;
> - if (ip6h + 1 > data_end)
> - return TC_ACT_SHOT;
> + ip6h = (struct ipv6hdr *)(icmp_hdr + 1);
> pckt->proto = ip6h->nexthdr;
> pckt->flags |= F_ICMP;
> memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
> @@ -252,22 +249,19 @@ static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> return TC_ACT_UNSPEC;
> }
>
> -static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> +static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
> struct packet_description *pckt)
> {
> struct icmphdr *icmp_hdr;
> struct iphdr *iph;
>
> - icmp_hdr = data + off;
> - if (icmp_hdr + 1 > data_end)
> + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*iph));
> + if (!icmp_hdr)
> return TC_ACT_SHOT;
> if (icmp_hdr->type != ICMP_DEST_UNREACH ||
> icmp_hdr->code != ICMP_FRAG_NEEDED)
> return TC_ACT_OK;
> - off += sizeof(struct icmphdr);
> - iph = data + off;
> - if (iph + 1 > data_end)
> - return TC_ACT_SHOT;
> + iph = (struct iphdr *)(icmp_hdr + 1);
> if (iph->ihl != 5)
> return TC_ACT_SHOT;
> pckt->proto = iph->protocol;
> @@ -277,13 +271,13 @@ static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> return TC_ACT_UNSPEC;
> }
>
> -static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
> +static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off,
> struct packet_description *pckt)
> {
> struct udphdr *udp;
> - udp = data + off;
>
> - if (udp + 1 > data_end)
> + udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp));
> + if (!udp)
> return false;
>
> if (!(pckt->flags & F_ICMP)) {
> @@ -296,13 +290,13 @@ static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
> return true;
> }
>
> -static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
> +static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off,
> struct packet_description *pckt)
> {
> struct tcphdr *tcp;
>
> - tcp = data + off;
> - if (tcp + 1 > data_end)
> + tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp));
> + if (!tcp)
> return false;
>
> if (tcp->syn)
> @@ -318,12 +312,11 @@ static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
> return true;
> }
>
> -static __noinline int process_packet(void *data, __u64 off, void *data_end,
> +static __noinline int process_packet(struct bpf_dynptr *skb_ptr,
> + struct eth_hdr *eth, __u64 off,
> bool is_ipv6, struct __sk_buff *skb)
> {
> - void *pkt_start = (void *)(long)skb->data;
> struct packet_description pckt = {};
> - struct eth_hdr *eth = pkt_start;
> struct bpf_tunnel_key tkey = {};
> struct vip_stats *data_stats;
> struct real_definition *dst;
> @@ -344,8 +337,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
>
> tkey.tunnel_ttl = 64;
> if (is_ipv6) {
> - ip6h = data + off;
> - if (ip6h + 1 > data_end)
> + ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
> + if (!ip6h)
> return TC_ACT_SHOT;
>
> iph_len = sizeof(struct ipv6hdr);
> @@ -356,7 +349,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> if (protocol == IPPROTO_FRAGMENT) {
> return TC_ACT_SHOT;
> } else if (protocol == IPPROTO_ICMPV6) {
> - action = parse_icmpv6(data, data_end, off, &pckt);
> + action = parse_icmpv6(skb_ptr, off, &pckt);
> if (action >= 0)
> return action;
> off += IPV6_PLUS_ICMP_HDR;
> @@ -365,10 +358,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16);
> }
> } else {
> - iph = data + off;
> - if (iph + 1 > data_end)
> - return TC_ACT_SHOT;
> - if (iph->ihl != 5)
> + iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
> + if (!iph || iph->ihl != 5)
> return TC_ACT_SHOT;
>
> protocol = iph->protocol;
> @@ -379,7 +370,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> if (iph->frag_off & PCKT_FRAGMENTED)
> return TC_ACT_SHOT;
> if (protocol == IPPROTO_ICMP) {
> - action = parse_icmp(data, data_end, off, &pckt);
> + action = parse_icmp(skb_ptr, off, &pckt);
> if (action >= 0)
> return action;
> off += IPV4_PLUS_ICMP_HDR;
> @@ -391,10 +382,10 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> protocol = pckt.proto;
>
> if (protocol == IPPROTO_TCP) {
> - if (!parse_tcp(data, off, data_end, &pckt))
> + if (!parse_tcp(skb_ptr, off, &pckt))
> return TC_ACT_SHOT;
> } else if (protocol == IPPROTO_UDP) {
> - if (!parse_udp(data, off, data_end, &pckt))
> + if (!parse_udp(skb_ptr, off, &pckt))
> return TC_ACT_SHOT;
> } else {
> return TC_ACT_SHOT;
> @@ -450,20 +441,22 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> SEC("tc")
> int balancer_ingress(struct __sk_buff *ctx)
> {
> - void *data_end = (void *)(long)ctx->data_end;
> - void *data = (void *)(long)ctx->data;
> - struct eth_hdr *eth = data;
> + struct bpf_dynptr ptr;
> + struct eth_hdr *eth;
> __u32 eth_proto;
> __u32 nh_off;
>
> nh_off = sizeof(struct eth_hdr);
> - if (data + nh_off > data_end)
> +
> + bpf_dynptr_from_skb(ctx, 0, &ptr);
> + eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth));
> + if (!eth)
> return TC_ACT_SHOT;
> eth_proto = eth->eth_proto;
> if (eth_proto == bpf_htons(ETH_P_IP))
> - return process_packet(data, nh_off, data_end, false, ctx);
> + return process_packet(&ptr, eth, nh_off, false, ctx);
> else if (eth_proto == bpf_htons(ETH_P_IPV6))
> - return process_packet(data, nh_off, data_end, true, ctx);
> + return process_packet(&ptr, eth, nh_off, true, ctx);
> else
> return TC_ACT_SHOT;
> }
> diff --git a/tools/testing/selftests/bpf/progs/test_xdp.c b/tools/testing/selftests/bpf/progs/test_xdp.c
> index d7a9a74b7245..2272b56a8777 100644
> --- a/tools/testing/selftests/bpf/progs/test_xdp.c
> +++ b/tools/testing/selftests/bpf/progs/test_xdp.c
> @@ -20,6 +20,12 @@
> #include <bpf/bpf_endian.h>
> #include "test_iptunnel_common.h"
>
> +const size_t tcphdr_sz = sizeof(struct tcphdr);
> +const size_t udphdr_sz = sizeof(struct udphdr);
> +const size_t ethhdr_sz = sizeof(struct ethhdr);
> +const size_t iphdr_sz = sizeof(struct iphdr);
> +const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
> +
> struct {
> __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> __uint(max_entries, 256);
> @@ -43,8 +49,7 @@ static __always_inline void count_tx(__u32 protocol)
> *rxcnt_count += 1;
> }
>
> -static __always_inline int get_dport(void *trans_data, void *data_end,
> - __u8 protocol)
> +static __always_inline int get_dport(void *trans_data, __u8 protocol)
> {
> struct tcphdr *th;
> struct udphdr *uh;
> @@ -52,13 +57,9 @@ static __always_inline int get_dport(void *trans_data, void *data_end,
> switch (protocol) {
> case IPPROTO_TCP:
> th = (struct tcphdr *)trans_data;
> - if (th + 1 > data_end)
> - return -1;
> return th->dest;
> case IPPROTO_UDP:
> uh = (struct udphdr *)trans_data;
> - if (uh + 1 > data_end)
> - return -1;
> return uh->dest;
> default:
> return 0;
> @@ -75,14 +76,13 @@ static __always_inline void set_ethhdr(struct ethhdr *new_eth,
> new_eth->h_proto = h_proto;
> }
>
> -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> {
> - void *data_end = (void *)(long)xdp->data_end;
> - void *data = (void *)(long)xdp->data;
> + struct bpf_dynptr new_xdp_ptr;
> struct iptnl_info *tnl;
> struct ethhdr *new_eth;
> struct ethhdr *old_eth;
> - struct iphdr *iph = data + sizeof(struct ethhdr);
> + struct iphdr *iph;
> __u16 *next_iph;
> __u16 payload_len;
> struct vip vip = {};
> @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> __u32 csum = 0;
> int i;
>
> - if (iph + 1 > data_end)
> + iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> + iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> + if (!iph)
> return XDP_DROP;
>
> - dport = get_dport(iph + 1, data_end, iph->protocol);
> + dport = get_dport(iph + 1, iph->protocol);
> if (dport == -1)
> return XDP_DROP;
>
> @@ -108,37 +110,33 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> if (!tnl || tnl->family != AF_INET)
> return XDP_PASS;
>
> - if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
> + if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz))
> return XDP_DROP;
>
> - data = (void *)(long)xdp->data;
> - data_end = (void *)(long)xdp->data_end;
> -
> - new_eth = data;
> - iph = data + sizeof(*new_eth);
> - old_eth = data + sizeof(*iph);
> -
> - if (new_eth + 1 > data_end ||
> - old_eth + 1 > data_end ||
> - iph + 1 > data_end)
> + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
> + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz);
> + if (!new_eth)
> return XDP_DROP;
>
> + iph = (struct iphdr *)(new_eth + 1);
> + old_eth = (struct ethhdr *)(iph + 1);
> +
> set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP));
>
> iph->version = 4;
> - iph->ihl = sizeof(*iph) >> 2;
> + iph->ihl = iphdr_sz >> 2;
> iph->frag_off = 0;
> iph->protocol = IPPROTO_IPIP;
> iph->check = 0;
> iph->tos = 0;
> - iph->tot_len = bpf_htons(payload_len + sizeof(*iph));
> + iph->tot_len = bpf_htons(payload_len + iphdr_sz);
> iph->daddr = tnl->daddr.v4;
> iph->saddr = tnl->saddr.v4;
> iph->ttl = 8;
>
> next_iph = (__u16 *)iph;
> #pragma clang loop unroll(full)
> - for (i = 0; i < sizeof(*iph) >> 1; i++)
> + for (i = 0; i < iphdr_sz >> 1; i++)
> csum += *next_iph++;
>
> iph->check = ~((csum & 0xffff) + (csum >> 16));
> @@ -148,22 +146,23 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> return XDP_TX;
> }
>
> -static __always_inline int handle_ipv6(struct xdp_md *xdp)
> +static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> {
> - void *data_end = (void *)(long)xdp->data_end;
> - void *data = (void *)(long)xdp->data;
> + struct bpf_dynptr new_xdp_ptr;
> struct iptnl_info *tnl;
> struct ethhdr *new_eth;
> struct ethhdr *old_eth;
> - struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
> + struct ipv6hdr *ip6h;
> __u16 payload_len;
> struct vip vip = {};
> int dport;
>
> - if (ip6h + 1 > data_end)
> + ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> + ipv6hdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> + if (!ip6h)
> return XDP_DROP;
>
> - dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
> + dport = get_dport(ip6h + 1, ip6h->nexthdr);
> if (dport == -1)
> return XDP_DROP;
>
> @@ -178,26 +177,23 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
> if (!tnl || tnl->family != AF_INET6)
> return XDP_PASS;
>
> - if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
> + if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz))
> return XDP_DROP;
>
> - data = (void *)(long)xdp->data;
> - data_end = (void *)(long)xdp->data_end;
> -
> - new_eth = data;
> - ip6h = data + sizeof(*new_eth);
> - old_eth = data + sizeof(*ip6h);
> -
> - if (new_eth + 1 > data_end || old_eth + 1 > data_end ||
> - ip6h + 1 > data_end)
> + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
> + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz);
> + if (!new_eth)
> return XDP_DROP;
>
> + ip6h = (struct ipv6hdr *)(new_eth + 1);
> + old_eth = (struct ethhdr *)(ip6h + 1);
> +
> set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6));
>
> ip6h->version = 6;
> ip6h->priority = 0;
> memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
> - ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + sizeof(*ip6h));
> + ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz);
> ip6h->nexthdr = IPPROTO_IPV6;
> ip6h->hop_limit = 8;
> memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
> @@ -211,21 +207,22 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
> SEC("xdp")
> int _xdp_tx_iptunnel(struct xdp_md *xdp)
> {
> - void *data_end = (void *)(long)xdp->data_end;
> - void *data = (void *)(long)xdp->data;
> - struct ethhdr *eth = data;
> + struct bpf_dynptr ptr;
> + struct ethhdr *eth;
> __u16 h_proto;
>
> - if (eth + 1 > data_end)
> + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> + eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz);
> + if (!eth)
> return XDP_DROP;
>
> h_proto = eth->h_proto;
>
> if (h_proto == bpf_htons(ETH_P_IP))
> - return handle_ipv4(xdp);
> + return handle_ipv4(xdp, &ptr);
> else if (h_proto == bpf_htons(ETH_P_IPV6))
>
> - return handle_ipv6(xdp);
> + return handle_ipv6(xdp, &ptr);
> else
> return XDP_DROP;
> }
> diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> index 6118e3ab61fc..56c9f8a3ad3d 100644
> --- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> +++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> @@ -50,6 +50,7 @@ struct linum_err {
>
> #define TCPOPT_EOL 0
> #define TCPOPT_NOP 1
> +#define TCPOPT_MSS 2
> #define TCPOPT_WINDOW 3
> #define TCPOPT_EXP 254
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-07-26 19:44   ` Zvi Effron
@ 2022-07-26 20:06     ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-26 20:06 UTC (permalink / raw)
  To: Zvi Effron; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Jul 26, 2022 at 12:44 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Test skb and xdp dynptr functionality in the following ways:
> >
> > 1) progs/test_xdp.c
> > * Change existing test to use dynptrs to parse xdp data
> >
> > There were no noticeable diferences in user + system time between
> > the original version vs. using dynptrs. Averaging the time for 10
> > runs (run using "time ./test_progs -t xdp_bpf2bpf"):
> > original version: 0.0449 sec
> > with dynptrs: 0.0429 sec
> >
> > 2) progs/test_l4lb_noinline.c
> > * Change existing test to use dynptrs to parse skb data
> >
> > There were no noticeable diferences in user + system time between
> > the original version vs. using dynptrs. Averaging the time for 10
> > runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
> > original version: 0.0502 sec
> > with dynptrs: 0.055 sec
> >
> > For number of processed verifier instructions:
> > original version: 6284 insns
> > with dynptrs: 2538 insns
> >
> > 3) progs/test_dynptr_xdp.c
> > * Add sample code for parsing tcp hdr opt lookup using dynptrs.
> > This logic is lifted from a real-world use case of packet parsing in
> > katran [0], a layer 4 load balancer
> >
> > 4) progs/dynptr_success.c
> > * Add test case "test_skb_readonly" for testing attempts at writes /
> > data slices on a prog type with read-only skb ctx.
> >
> > 5) progs/dynptr_fail.c
> > * Add test cases "skb_invalid_data_slice" and
> > "xdp_invalid_data_slice" for testing that helpers that modify the
> > underlying packet buffer automatically invalidate the associated
> > data slice.
> > * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
> > that prog types that do not support bpf_dynptr_from_skb/xdp don't
> > have access to the API.
> >
> > [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > .../testing/selftests/bpf/prog_tests/dynptr.c | 85 ++++++++++---
> > .../selftests/bpf/prog_tests/dynptr_xdp.c | 49 ++++++++
> > .../testing/selftests/bpf/progs/dynptr_fail.c | 76 ++++++++++++
> > .../selftests/bpf/progs/dynptr_success.c | 32 +++++
> > .../selftests/bpf/progs/test_dynptr_xdp.c | 115 ++++++++++++++++++
> > .../selftests/bpf/progs/test_l4lb_noinline.c | 71 +++++------
> > tools/testing/selftests/bpf/progs/test_xdp.c | 95 +++++++--------
> > .../selftests/bpf/test_tcp_hdr_options.h | 1 +
> > 8 files changed, 416 insertions(+), 108 deletions(-)
> > create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > index bcf80b9f7c27..c40631f33c7b 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > @@ -2,6 +2,7 @@
> > /* Copyright (c) 2022 Facebook */
> >
> > #include <test_progs.h>
> > +#include <network_helpers.h>
> > #include "dynptr_fail.skel.h"
> > #include "dynptr_success.skel.h"
> >
> > @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
> > static struct {
> > const char *prog_name;
> > const char *expected_err_msg;
> > -} dynptr_tests[] = {
> > - /* failure cases */
> > +} verifier_error_tests[] = {
> > + /* these cases should trigger a verifier error */
> > {"ringbuf_missing_release1", "Unreleased reference id=1"},
> > {"ringbuf_missing_release2", "Unreleased reference id=2"},
> > {"ringbuf_missing_release_callback", "Unreleased reference id"},
> > @@ -42,11 +43,25 @@ static struct {
> > {"release_twice_callback", "arg 1 is an unacquired reference"},
> > {"dynptr_from_mem_invalid_api",
> > "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> > + {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> > + {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> > + {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> > + {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> > +};
> > +
> > +enum test_setup_type {
> > + SETUP_SYSCALL_SLEEP,
> > + SETUP_SKB_PROG,
> > +};
> >
> > - /* success cases */
> > - {"test_read_write", NULL},
> > - {"test_data_slice", NULL},
> > - {"test_ringbuf", NULL},
> > +static struct {
> > + const char *prog_name;
> > + enum test_setup_type type;
> > +} runtime_tests[] = {
> > + {"test_read_write", SETUP_SYSCALL_SLEEP},
> > + {"test_data_slice", SETUP_SYSCALL_SLEEP},
> > + {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> > + {"test_skb_readonly", SETUP_SKB_PROG},
> > };
> >
> > static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > dynptr_fail__destroy(skel);
> > }
> >
> > -static void verify_success(const char *prog_name)
> > +static void run_tests(const char *prog_name, enum test_setup_type setup_type)
> > {
> > struct dynptr_success *skel;
> > struct bpf_program *prog;
> > @@ -107,15 +122,42 @@ static void verify_success(const char *prog_name)
> > if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> > goto cleanup;
> >
> > - link = bpf_program__attach(prog);
> > - if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> > - goto cleanup;
> > + switch (setup_type) {
> > + case SETUP_SYSCALL_SLEEP:
> > + link = bpf_program__attach(prog);
> > + if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> > + goto cleanup;
> >
> > - usleep(1);
> > + usleep(1);
> >
> > - ASSERT_EQ(skel->bss->err, 0, "err");
> > + bpf_link__destroy(link);
> > + break;
> > + case SETUP_SKB_PROG:
> > + {
> > + int prog_fd, err;
> > + char buf[64];
> > +
> > + prog_fd = bpf_program__fd(prog);
> > + if (CHECK_FAIL(prog_fd < 0))
> > + goto cleanup;
> > +
> > + LIBBPF_OPTS(bpf_test_run_opts, topts,
> > + .data_in = &pkt_v4,
> > + .data_size_in = sizeof(pkt_v4),
> > + .data_out = buf,
> > + .data_size_out = sizeof(buf),
> > + .repeat = 1,
> > + );
> >
> > - bpf_link__destroy(link);
> > + err = bpf_prog_test_run_opts(prog_fd, &topts);
> > +
> > + if (!ASSERT_OK(err, "test_run"))
> > + goto cleanup;
> > +
> > + break;
> > + }
> > + }
> > + ASSERT_EQ(skel->bss->err, 0, "err");
> >
> > cleanup:
> > dynptr_success__destroy(skel);
> > @@ -125,14 +167,17 @@ void test_dynptr(void)
> > {
> > int i;
> >
> > - for (i = 0; i < ARRAY_SIZE(dynptr_tests); i++) {
> > - if (!test__start_subtest(dynptr_tests[i].prog_name))
> > + for (i = 0; i < ARRAY_SIZE(verifier_error_tests); i++) {
> > + if (!test__start_subtest(verifier_error_tests[i].prog_name))
> > + continue;
> > +
> > + verify_fail(verifier_error_tests[i].prog_name,
> > + verifier_error_tests[i].expected_err_msg);
> > + }
> > + for (i = 0; i < ARRAY_SIZE(runtime_tests); i++) {
> > + if (!test__start_subtest(runtime_tests[i].prog_name))
> > continue;
> >
> > - if (dynptr_tests[i].expected_err_msg)
> > - verify_fail(dynptr_tests[i].prog_name,
> > - dynptr_tests[i].expected_err_msg);
> > - else
> > - verify_success(dynptr_tests[i].prog_name);
> > + run_tests(runtime_tests[i].prog_name, runtime_tests[i].type);
> > }
> > }
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > new file mode 100644
> > index 000000000000..ca775d126b60
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > @@ -0,0 +1,49 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <test_progs.h>
> > +#include <network_helpers.h>
> > +#include "test_dynptr_xdp.skel.h"
> > +#include "test_tcp_hdr_options.h"
> > +
> > +struct test_pkt {
> > + struct ipv6_packet pk6_v6;
> > + u8 options[16];
> > +} __packed;
> > +
> > +void test_dynptr_xdp(void)
> > +{
> > + struct test_dynptr_xdp *skel;
> > + char buf[128];
> > + int err;
> > +
> > + skel = test_dynptr_xdp__open_and_load();
> > + if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> > + return;
> > +
> > + struct test_pkt pkt = {
> > + .pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
> > + .pk6_v6.iph.nexthdr = IPPROTO_TCP,
> > + .pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
> > + .pk6_v6.tcp.urg_ptr = 123,
> > + .pk6_v6.tcp.doff = 9, /* 16 bytes of options */
> > +
> > + .options = {
> > + TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
> > + skel->rodata->tcp_hdr_opt_kind_tpr, 6, 0, 0, 0, 9, TCPOPT_EOL
> > + },
> > + };
> > +
> > + LIBBPF_OPTS(bpf_test_run_opts, topts,
> > + .data_in = &pkt,
> > + .data_size_in = sizeof(pkt),
> > + .data_out = buf,
> > + .data_size_out = sizeof(buf),
> > + .repeat = 3,
> > + );
> > +
> > + err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.xdp_ingress_v6), &topts);
> > + ASSERT_OK(err, "ipv6 test_run");
> > + ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> > + ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
> > +
> > + test_dynptr_xdp__destroy(skel);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > index c1814938a5fd..4e3f853b2d02 100644
> > --- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > +++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > @@ -5,6 +5,7 @@
> > #include <string.h>
> > #include <linux/bpf.h>
> > #include <bpf/bpf_helpers.h>
> > +#include <linux/if_ether.h>
> > #include "bpf_misc.h"
> >
> > char _license[] SEC("license") = "GPL";
> > @@ -622,3 +623,78 @@ int dynptr_from_mem_invalid_api(void *ctx)
> >
> > return 0;
> > }
> > +
> > +/* The data slice is invalidated whenever a helper changes packet data */
> > +SEC("?tc")
> > +int skb_invalid_data_slice(struct __sk_buff *skb)
> > +{
> > + struct bpf_dynptr ptr;
> > + struct ethhdr *hdr;
> > +
> > + bpf_dynptr_from_skb(skb, 0, &ptr);
> > + hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
> > + if (!hdr)
> > + return SK_DROP;
> > +
> > + hdr->h_proto = 12;
> > +
> > + if (bpf_skb_pull_data(skb, skb->len))
> > + return SK_DROP;
> > +
> > + /* this should fail */
> > + hdr->h_proto = 1;
> > +
> > + return SK_PASS;
> > +}
> > +
> > +/* The data slice is invalidated whenever a helper changes packet data */
> > +SEC("?xdp")
> > +int xdp_invalid_data_slice(struct xdp_md *xdp)
> > +{
> > + struct bpf_dynptr ptr;
> > + struct ethhdr *hdr1, *hdr2;
> > +
> > + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > + hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> > + if (!hdr1)
> > + return XDP_DROP;
> > +
> > + hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> > + if (!hdr2)
> > + return XDP_DROP;
> > +
> > + hdr1->h_proto = 12;
> > + hdr2->h_proto = 12;
> > +
> > + if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> > + return XDP_DROP;
> > +
> > + /* this should fail */
> > + hdr2->h_proto = 1;
> > +
> > + return XDP_PASS;
> > +}
> > +
> > +/* Only supported prog type can create skb-type dynptrs */
> > +SEC("?raw_tp/sys_nanosleep")
> > +int skb_invalid_ctx(void *ctx)
> > +{
> > + struct bpf_dynptr ptr;
> > +
> > + /* this should fail */
> > + bpf_dynptr_from_skb(ctx, 0, &ptr);
> > +
> > + return 0;
> > +}
> > +
> > +/* Only supported prog type can create xdp-type dynptrs */
> > +SEC("?raw_tp/sys_nanosleep")
> > +int xdp_invalid_ctx(void *ctx)
> > +{
> > + struct bpf_dynptr ptr;
> > +
> > + /* this should fail */
> > + bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > +
> > + return 0;
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c
> > index a3a6103c8569..ffddd6ddc7fb 100644
> > --- a/tools/testing/selftests/bpf/progs/dynptr_success.c
> > +++ b/tools/testing/selftests/bpf/progs/dynptr_success.c
> > @@ -162,3 +162,35 @@ int test_ringbuf(void *ctx)
> > bpf_ringbuf_discard_dynptr(&ptr, 0);
> > return 0;
> > }
> > +
> > +SEC("cgroup_skb/egress")
> > +int test_skb_readonly(void *ctx)
> > +{
> > + __u8 write_data[2] = {1, 2};
> > + struct bpf_dynptr ptr;
> > + void *data;
> > + int ret;
> > +
> > + err = 1;
> > +
> > + if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> > + return 0;
> > + err++;
> > +
> > + data = bpf_dynptr_data(&ptr, 0, 1);
> > + if (data)
> > + /* it's an error if data is not NULL since cgroup skbs
> > + * are read only
> > + */
> > + return 0;
> > + err++;
> > +
> > + ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> > + /* since cgroup skbs are read only, writes should fail */
> > + if (ret != -EINVAL)
> > + return 0;
> > +
> > + err = 0;
> > +
> > + return 0;
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > new file mode 100644
> > index 000000000000..c879dfb6370a
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > @@ -0,0 +1,115 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/* This logic is lifted from a real-world use case of packet parsing, used in
> > + * the open source library katran, a layer 4 load balancer.
> > + *
> > + * This test demonstrates how to parse packet contents using dynptrs.
> > + *
> > + * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> > + */
> > +
> > +#include <string.h>
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <linux/tcp.h>
> > +#include <stdbool.h>
> > +#include <linux/ipv6.h>
> > +#include <linux/if_ether.h>
> > +#include "test_tcp_hdr_options.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/* Arbitrarily picked unused value from IANA TCP Option Kind Numbers */
> > +const __u32 tcp_hdr_opt_kind_tpr = 0xB7;
>
> Should this instead be either 0xFD or 0xFE, as those are the two Kind numbers
> allocated for experiments? Using a reserved value seems suboptimal, and
> potentially risks updating one of the entries in [0] to have a double asterisk.

I used 0xB7 because that's what the katran library [1] uses, but after
reading through that iana link, using 0xFD or 0xFE as the experimental
value makes sense to me. I'll change this for v2.

[1] https://github.com/facebookincubator/katran/blob/44ac98876280b8a76e6c90bf857b7b0afe1870f1/katran/lib/bpf/balancer_consts.h#L176-L177

>
> [0]: https://www.iana.org/assignments/tcp-parameters/tcp-parameters.xhtml#tcp-parameters-1
>
> > +/* Length of the tcp header option */
> > +const __u32 tcp_hdr_opt_len_tpr = 6;
> > +/* maximum number of header options to check to lookup server_id */
> > +const __u32 tcp_hdr_opt_max_opt_checks = 15;
> > +
> > +__u32 server_id;
> > +
> > +static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
> > + __u32 *server_id)
> > +{
> > + __u8 *tcp_opt, kind, hdr_len;
> > + __u8 *data;
> > +
> > + data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
> > + sizeof(*server_id));
> > + if (!data)
> > + return -1;
> > +
> > + kind = data[0];
> > +
> > + if (kind == TCPOPT_EOL)
> > + return -1;
> > +
> > + if (kind == TCPOPT_NOP) {
> > + *off += 1;
> > + /* continue to the next option */
> > + *hdr_bytes_remaining -= 1;
> > +
> > + return 0;
> > + }
> > +
> > + if (*hdr_bytes_remaining < 2)
> > + return -1;
> > +
> > + hdr_len = data[1];
> > + if (hdr_len > *hdr_bytes_remaining)
> > + return -1;
> > +
> > + if (kind == tcp_hdr_opt_kind_tpr) {
> > + if (hdr_len != tcp_hdr_opt_len_tpr)
> > + return -1;
> > +
> > + memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
> > + return 1;
> > + }
> > +
> > + *off += hdr_len;
> > + *hdr_bytes_remaining -= hdr_len;
> > +
> > + return 0;
> > +}
> > +
> > +SEC("xdp")
> > +int xdp_ingress_v6(struct xdp_md *xdp)
> > +{
> > + __u8 hdr_bytes_remaining;
> > + struct tcphdr *tcp_hdr;
> > + __u8 tcp_hdr_opt_len;
> > + int err = 0;
> > + __u32 off;
> > +
> > + struct bpf_dynptr ptr;
> > +
> > + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > +
> > + off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> > +
> > + tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
> > + if (!tcp_hdr)
> > + return XDP_DROP;
> > +
> > + tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
> > + if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
> > + return XDP_DROP;
> > +
> > + hdr_bytes_remaining = tcp_hdr_opt_len;
> > +
> > + off += sizeof(struct tcphdr);
> > +
> > + /* max number of bytes of options in tcp header is 40 bytes */
> > + for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
> > + err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
> > +
> > + if (err || !hdr_bytes_remaining)
> > + break;
> > + }
> > +
> > + if (!server_id)
> > + return XDP_DROP;
> > +
> > + return XDP_PASS;
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > index c8bc0c6947aa..1fef7868ea8b 100644
> > --- a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > +++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > @@ -230,21 +230,18 @@ static __noinline bool get_packet_dst(struct real_definition **real,
> > return true;
> > }
> >
> > -static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> > +static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
> > struct packet_description *pckt)
> > {
> > struct icmp6hdr *icmp_hdr;
> > struct ipv6hdr *ip6h;
> >
> > - icmp_hdr = data + off;
> > - if (icmp_hdr + 1 > data_end)
> > + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*ip6h));
> > + if (!icmp_hdr)
> > return TC_ACT_SHOT;
> > if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
> > return TC_ACT_OK;
> > - off += sizeof(struct icmp6hdr);
> > - ip6h = data + off;
> > - if (ip6h + 1 > data_end)
> > - return TC_ACT_SHOT;
> > + ip6h = (struct ipv6hdr *)(icmp_hdr + 1);
> > pckt->proto = ip6h->nexthdr;
> > pckt->flags |= F_ICMP;
> > memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
> > @@ -252,22 +249,19 @@ static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> > return TC_ACT_UNSPEC;
> > }
> >
> > -static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> > +static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
> > struct packet_description *pckt)
> > {
> > struct icmphdr *icmp_hdr;
> > struct iphdr *iph;
> >
> > - icmp_hdr = data + off;
> > - if (icmp_hdr + 1 > data_end)
> > + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*iph));
> > + if (!icmp_hdr)
> > return TC_ACT_SHOT;
> > if (icmp_hdr->type != ICMP_DEST_UNREACH ||
> > icmp_hdr->code != ICMP_FRAG_NEEDED)
> > return TC_ACT_OK;
> > - off += sizeof(struct icmphdr);
> > - iph = data + off;
> > - if (iph + 1 > data_end)
> > - return TC_ACT_SHOT;
> > + iph = (struct iphdr *)(icmp_hdr + 1);
> > if (iph->ihl != 5)
> > return TC_ACT_SHOT;
> > pckt->proto = iph->protocol;
> > @@ -277,13 +271,13 @@ static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> > return TC_ACT_UNSPEC;
> > }
> >
> > -static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
> > +static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off,
> > struct packet_description *pckt)
> > {
> > struct udphdr *udp;
> > - udp = data + off;
> >
> > - if (udp + 1 > data_end)
> > + udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp));
> > + if (!udp)
> > return false;
> >
> > if (!(pckt->flags & F_ICMP)) {
> > @@ -296,13 +290,13 @@ static __noinline bool parse_udp(void *data, __u64 off, void *data_end,
> > return true;
> > }
> >
> > -static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
> > +static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off,
> > struct packet_description *pckt)
> > {
> > struct tcphdr *tcp;
> >
> > - tcp = data + off;
> > - if (tcp + 1 > data_end)
> > + tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp));
> > + if (!tcp)
> > return false;
> >
> > if (tcp->syn)
> > @@ -318,12 +312,11 @@ static __noinline bool parse_tcp(void *data, __u64 off, void *data_end,
> > return true;
> > }
> >
> > -static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > +static __noinline int process_packet(struct bpf_dynptr *skb_ptr,
> > + struct eth_hdr *eth, __u64 off,
> > bool is_ipv6, struct __sk_buff *skb)
> > {
> > - void *pkt_start = (void *)(long)skb->data;
> > struct packet_description pckt = {};
> > - struct eth_hdr *eth = pkt_start;
> > struct bpf_tunnel_key tkey = {};
> > struct vip_stats *data_stats;
> > struct real_definition *dst;
> > @@ -344,8 +337,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> >
> > tkey.tunnel_ttl = 64;
> > if (is_ipv6) {
> > - ip6h = data + off;
> > - if (ip6h + 1 > data_end)
> > + ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
> > + if (!ip6h)
> > return TC_ACT_SHOT;
> >
> > iph_len = sizeof(struct ipv6hdr);
> > @@ -356,7 +349,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > if (protocol == IPPROTO_FRAGMENT) {
> > return TC_ACT_SHOT;
> > } else if (protocol == IPPROTO_ICMPV6) {
> > - action = parse_icmpv6(data, data_end, off, &pckt);
> > + action = parse_icmpv6(skb_ptr, off, &pckt);
> > if (action >= 0)
> > return action;
> > off += IPV6_PLUS_ICMP_HDR;
> > @@ -365,10 +358,8 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16);
> > }
> > } else {
> > - iph = data + off;
> > - if (iph + 1 > data_end)
> > - return TC_ACT_SHOT;
> > - if (iph->ihl != 5)
> > + iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
> > + if (!iph || iph->ihl != 5)
> > return TC_ACT_SHOT;
> >
> > protocol = iph->protocol;
> > @@ -379,7 +370,7 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > if (iph->frag_off & PCKT_FRAGMENTED)
> > return TC_ACT_SHOT;
> > if (protocol == IPPROTO_ICMP) {
> > - action = parse_icmp(data, data_end, off, &pckt);
> > + action = parse_icmp(skb_ptr, off, &pckt);
> > if (action >= 0)
> > return action;
> > off += IPV4_PLUS_ICMP_HDR;
> > @@ -391,10 +382,10 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > protocol = pckt.proto;
> >
> > if (protocol == IPPROTO_TCP) {
> > - if (!parse_tcp(data, off, data_end, &pckt))
> > + if (!parse_tcp(skb_ptr, off, &pckt))
> > return TC_ACT_SHOT;
> > } else if (protocol == IPPROTO_UDP) {
> > - if (!parse_udp(data, off, data_end, &pckt))
> > + if (!parse_udp(skb_ptr, off, &pckt))
> > return TC_ACT_SHOT;
> > } else {
> > return TC_ACT_SHOT;
> > @@ -450,20 +441,22 @@ static __noinline int process_packet(void *data, __u64 off, void *data_end,
> > SEC("tc")
> > int balancer_ingress(struct __sk_buff *ctx)
> > {
> > - void *data_end = (void *)(long)ctx->data_end;
> > - void *data = (void *)(long)ctx->data;
> > - struct eth_hdr *eth = data;
> > + struct bpf_dynptr ptr;
> > + struct eth_hdr *eth;
> > __u32 eth_proto;
> > __u32 nh_off;
> >
> > nh_off = sizeof(struct eth_hdr);
> > - if (data + nh_off > data_end)
> > +
> > + bpf_dynptr_from_skb(ctx, 0, &ptr);
> > + eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth));
> > + if (!eth)
> > return TC_ACT_SHOT;
> > eth_proto = eth->eth_proto;
> > if (eth_proto == bpf_htons(ETH_P_IP))
> > - return process_packet(data, nh_off, data_end, false, ctx);
> > + return process_packet(&ptr, eth, nh_off, false, ctx);
> > else if (eth_proto == bpf_htons(ETH_P_IPV6))
> > - return process_packet(data, nh_off, data_end, true, ctx);
> > + return process_packet(&ptr, eth, nh_off, true, ctx);
> > else
> > return TC_ACT_SHOT;
> > }
> > diff --git a/tools/testing/selftests/bpf/progs/test_xdp.c b/tools/testing/selftests/bpf/progs/test_xdp.c
> > index d7a9a74b7245..2272b56a8777 100644
> > --- a/tools/testing/selftests/bpf/progs/test_xdp.c
> > +++ b/tools/testing/selftests/bpf/progs/test_xdp.c
> > @@ -20,6 +20,12 @@
> > #include <bpf/bpf_endian.h>
> > #include "test_iptunnel_common.h"
> >
> > +const size_t tcphdr_sz = sizeof(struct tcphdr);
> > +const size_t udphdr_sz = sizeof(struct udphdr);
> > +const size_t ethhdr_sz = sizeof(struct ethhdr);
> > +const size_t iphdr_sz = sizeof(struct iphdr);
> > +const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
> > +
> > struct {
> > __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > __uint(max_entries, 256);
> > @@ -43,8 +49,7 @@ static __always_inline void count_tx(__u32 protocol)
> > *rxcnt_count += 1;
> > }
> >
> > -static __always_inline int get_dport(void *trans_data, void *data_end,
> > - __u8 protocol)
> > +static __always_inline int get_dport(void *trans_data, __u8 protocol)
> > {
> > struct tcphdr *th;
> > struct udphdr *uh;
> > @@ -52,13 +57,9 @@ static __always_inline int get_dport(void *trans_data, void *data_end,
> > switch (protocol) {
> > case IPPROTO_TCP:
> > th = (struct tcphdr *)trans_data;
> > - if (th + 1 > data_end)
> > - return -1;
> > return th->dest;
> > case IPPROTO_UDP:
> > uh = (struct udphdr *)trans_data;
> > - if (uh + 1 > data_end)
> > - return -1;
> > return uh->dest;
> > default:
> > return 0;
> > @@ -75,14 +76,13 @@ static __always_inline void set_ethhdr(struct ethhdr *new_eth,
> > new_eth->h_proto = h_proto;
> > }
> >
> > -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> > {
> > - void *data_end = (void *)(long)xdp->data_end;
> > - void *data = (void *)(long)xdp->data;
> > + struct bpf_dynptr new_xdp_ptr;
> > struct iptnl_info *tnl;
> > struct ethhdr *new_eth;
> > struct ethhdr *old_eth;
> > - struct iphdr *iph = data + sizeof(struct ethhdr);
> > + struct iphdr *iph;
> > __u16 *next_iph;
> > __u16 payload_len;
> > struct vip vip = {};
> > @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > __u32 csum = 0;
> > int i;
> >
> > - if (iph + 1 > data_end)
> > + iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > + iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> > + if (!iph)
> > return XDP_DROP;
> >
> > - dport = get_dport(iph + 1, data_end, iph->protocol);
> > + dport = get_dport(iph + 1, iph->protocol);
> > if (dport == -1)
> > return XDP_DROP;
> >
> > @@ -108,37 +110,33 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > if (!tnl || tnl->family != AF_INET)
> > return XDP_PASS;
> >
> > - if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct iphdr)))
> > + if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz))
> > return XDP_DROP;
> >
> > - data = (void *)(long)xdp->data;
> > - data_end = (void *)(long)xdp->data_end;
> > -
> > - new_eth = data;
> > - iph = data + sizeof(*new_eth);
> > - old_eth = data + sizeof(*iph);
> > -
> > - if (new_eth + 1 > data_end ||
> > - old_eth + 1 > data_end ||
> > - iph + 1 > data_end)
> > + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
> > + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz);
> > + if (!new_eth)
> > return XDP_DROP;
> >
> > + iph = (struct iphdr *)(new_eth + 1);
> > + old_eth = (struct ethhdr *)(iph + 1);
> > +
> > set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP));
> >
> > iph->version = 4;
> > - iph->ihl = sizeof(*iph) >> 2;
> > + iph->ihl = iphdr_sz >> 2;
> > iph->frag_off = 0;
> > iph->protocol = IPPROTO_IPIP;
> > iph->check = 0;
> > iph->tos = 0;
> > - iph->tot_len = bpf_htons(payload_len + sizeof(*iph));
> > + iph->tot_len = bpf_htons(payload_len + iphdr_sz);
> > iph->daddr = tnl->daddr.v4;
> > iph->saddr = tnl->saddr.v4;
> > iph->ttl = 8;
> >
> > next_iph = (__u16 *)iph;
> > #pragma clang loop unroll(full)
> > - for (i = 0; i < sizeof(*iph) >> 1; i++)
> > + for (i = 0; i < iphdr_sz >> 1; i++)
> > csum += *next_iph++;
> >
> > iph->check = ~((csum & 0xffff) + (csum >> 16));
> > @@ -148,22 +146,23 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > return XDP_TX;
> > }
> >
> > -static __always_inline int handle_ipv6(struct xdp_md *xdp)
> > +static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> > {
> > - void *data_end = (void *)(long)xdp->data_end;
> > - void *data = (void *)(long)xdp->data;
> > + struct bpf_dynptr new_xdp_ptr;
> > struct iptnl_info *tnl;
> > struct ethhdr *new_eth;
> > struct ethhdr *old_eth;
> > - struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
> > + struct ipv6hdr *ip6h;
> > __u16 payload_len;
> > struct vip vip = {};
> > int dport;
> >
> > - if (ip6h + 1 > data_end)
> > + ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > + ipv6hdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> > + if (!ip6h)
> > return XDP_DROP;
> >
> > - dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
> > + dport = get_dport(ip6h + 1, ip6h->nexthdr);
> > if (dport == -1)
> > return XDP_DROP;
> >
> > @@ -178,26 +177,23 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
> > if (!tnl || tnl->family != AF_INET6)
> > return XDP_PASS;
> >
> > - if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(struct ipv6hdr)))
> > + if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz))
> > return XDP_DROP;
> >
> > - data = (void *)(long)xdp->data;
> > - data_end = (void *)(long)xdp->data_end;
> > -
> > - new_eth = data;
> > - ip6h = data + sizeof(*new_eth);
> > - old_eth = data + sizeof(*ip6h);
> > -
> > - if (new_eth + 1 > data_end || old_eth + 1 > data_end ||
> > - ip6h + 1 > data_end)
> > + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
> > + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz);
> > + if (!new_eth)
> > return XDP_DROP;
> >
> > + ip6h = (struct ipv6hdr *)(new_eth + 1);
> > + old_eth = (struct ethhdr *)(ip6h + 1);
> > +
> > set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6));
> >
> > ip6h->version = 6;
> > ip6h->priority = 0;
> > memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
> > - ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + sizeof(*ip6h));
> > + ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz);
> > ip6h->nexthdr = IPPROTO_IPV6;
> > ip6h->hop_limit = 8;
> > memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
> > @@ -211,21 +207,22 @@ static __always_inline int handle_ipv6(struct xdp_md *xdp)
> > SEC("xdp")
> > int _xdp_tx_iptunnel(struct xdp_md *xdp)
> > {
> > - void *data_end = (void *)(long)xdp->data_end;
> > - void *data = (void *)(long)xdp->data;
> > - struct ethhdr *eth = data;
> > + struct bpf_dynptr ptr;
> > + struct ethhdr *eth;
> > __u16 h_proto;
> >
> > - if (eth + 1 > data_end)
> > + bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > + eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz);
> > + if (!eth)
> > return XDP_DROP;
> >
> > h_proto = eth->h_proto;
> >
> > if (h_proto == bpf_htons(ETH_P_IP))
> > - return handle_ipv4(xdp);
> > + return handle_ipv4(xdp, &ptr);
> > else if (h_proto == bpf_htons(ETH_P_IPV6))
> >
> > - return handle_ipv6(xdp);
> > + return handle_ipv6(xdp, &ptr);
> > else
> > return XDP_DROP;
> > }
> > diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> > index 6118e3ab61fc..56c9f8a3ad3d 100644
> > --- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> > +++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
> > @@ -50,6 +50,7 @@ struct linum_err {
> >
> > #define TCPOPT_EOL 0
> > #define TCPOPT_NOP 1
> > +#define TCPOPT_MSS 2
> > #define TCPOPT_WINDOW 3
> > #define TCPOPT_EXP 254
> >
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
@ 2022-07-27 17:13   ` sdf
  2022-07-28 16:49     ` Joanne Koong
  2022-07-28 17:45   ` Hao Luo
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 52+ messages in thread
From: sdf @ 2022-07-27 17:13 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On 07/26, Joanne Koong wrote:
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).

> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (writes and data slices are not permitted). For reads on the
> dynptr, this includes reading into data in the non-linear paged buffers
> but for writes and data slices, if the data is in a paged buffer, the
> user must first call bpf_skb_pull_data to pull the data into the linear
> portion.

> Additionally, any helper calls that change the underlying packet buffer
> (eg bpf_skb_pull_data) invalidates any data slices of the associated
> dynptr.

> Right now, skb dynptrs can only be constructed from skbs that are
> the bpf program context - as such, there does not need to be any
> reference tracking or release on skb dynptrs.

> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>   include/linux/bpf.h            |  8 ++++-
>   include/linux/filter.h         |  4 +++
>   include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
>   kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
>   kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
>   net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
>   tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
>   7 files changed, 229 insertions(+), 17 deletions(-)

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 20c26aed7896..7fbd4324c848 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -407,11 +407,14 @@ enum bpf_type_flag {
>   	/* Size is known at compile time. */
>   	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),

> +	/* DYNPTR points to sk_buff */
> +	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
> +
>   	__BPF_TYPE_FLAG_MAX,
>   	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
>   };

> -#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> +#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF |  
> DYNPTR_TYPE_SKB)

>   /* Max number of base types. */
>   #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
> @@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
>   	BPF_DYNPTR_TYPE_LOCAL,
>   	/* Underlying data is a ringbuf record */
>   	BPF_DYNPTR_TYPE_RINGBUF,
> +	/* Underlying data is a sk_buff */
> +	BPF_DYNPTR_TYPE_SKB,
>   };

>   void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
>   		     enum bpf_dynptr_type type, u32 offset, u32 size);
>   void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
>   int bpf_dynptr_check_size(u32 size);
> +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);

>   #ifdef CONFIG_BPF_LSM
>   void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index a5f21dc3c432..649063d9cbfd 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1532,4 +1532,8 @@ static __always_inline int  
> __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>   	return XDP_REDIRECT;
>   }

> +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void  
> *to, u32 len);
> +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void  
> *from,
> +			  u32 len, u64 flags);
> +
>   #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59a217ca2dfd..0730cd198a7f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5241,11 +5241,22 @@ union bpf_attr {
>    *	Description
>    *		Write *len* bytes from *src* into *dst*, starting from *offset*
>    *		into *dst*.
> - *		*flags* is currently unused.
> + *
> + *		*flags* must be 0 except for skb-type dynptrs.
> + *
> + *		For skb-type dynptrs:
> + *		    *  if *offset* + *len* extends into the skb's paged buffers, the  
> user
> + *		       should manually pull the skb with bpf_skb_pull and then try  
> again.
> + *
> + *		    *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM**  
> (automatically
> + *			recompute the checksum for the packet after storing the bytes) and
> + *			**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + *			**->swhash** and *skb*\ **->l4hash** to 0).
>    *	Return
>    *		0 on success, -E2BIG if *offset* + *len* exceeds the length
>    *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> - *		is a read-only dynptr or if *flags* is not 0.
> + *		is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> + *		skb-type dynptrs the write extends into the skb's paged buffers.
>    *
>    * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
>    *	Description
> @@ -5253,10 +5264,19 @@ union bpf_attr {
>    *
>    *		*len* must be a statically known value. The returned data slice
>    *		is invalidated whenever the dynptr is invalidated.
> + *
> + *		For skb-type dynptrs:
> + *		    * if *offset* + *len* extends into the skb's paged buffers,
> + *		      the user should manually pull the skb with bpf_skb_pull and  
> then
> + *		      try again.
> + *
> + *		    * the data slice is automatically invalidated anytime a
> + *		      helper call that changes the underlying packet buffer
> + *		      (eg bpf_skb_pull) is called.
>    *	Return
>    *		Pointer to the underlying dynptr data, NULL if the dynptr is
>    *		read-only, if the dynptr is invalid, or if the offset and length
> - *		is out of bounds.
> + *		is out of bounds or in a paged buffer for skb-type dynptrs.
>    *
>    * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr  
> *th, u32 th_len)
>    *	Description
> @@ -5331,6 +5351,21 @@ union bpf_attr {
>    *		**-EACCES** if the SYN cookie is not valid.
>    *
>    *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> + *
> + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct  
> bpf_dynptr *ptr)
> + *	Description
> + *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
> + *		context. Depending on program type, the dynptr may be read-only,
> + *		in which case trying to obtain a direct data slice to it through
> + *		bpf_dynptr_data will return an error.
> + *
> + *		Calls that change the *skb*'s underlying packet buffer
> + *		(eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> + *		invalidate any data slices associated with the dynptr.
> + *
> + *		*flags* is currently unused, it must be 0 for now.
> + *	Return
> + *		0 on success or -EINVAL if flags is not 0.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -5541,6 +5576,7 @@ union bpf_attr {
>   	FN(tcp_raw_gen_syncookie_ipv6),	\
>   	FN(tcp_raw_check_syncookie_ipv4),	\
>   	FN(tcp_raw_check_syncookie_ipv6),	\
> +	FN(dynptr_from_skb),		\
>   	/* */

>   /* integer value in 'imm' field of BPF_CALL instruction selects which  
> helper
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 1f961f9982d2..21a806057e9e 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1425,11 +1425,21 @@ static bool bpf_dynptr_is_rdonly(struct  
> bpf_dynptr_kern *ptr)
>   	return ptr->size & DYNPTR_RDONLY_BIT;
>   }

> +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
> +{
> +	ptr->size |= DYNPTR_RDONLY_BIT;
> +}
> +
>   static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum  
> bpf_dynptr_type type)
>   {
>   	ptr->size |= type << DYNPTR_TYPE_SHIFT;
>   }

> +static enum bpf_dynptr_type bpf_dynptr_get_type(const struct  
> bpf_dynptr_kern *ptr)
> +{
> +	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
> +}
> +
>   static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
>   {
>   	return ptr->size & DYNPTR_SIZE_MASK;
> @@ -1500,6 +1510,7 @@ static const struct bpf_func_proto  
> bpf_dynptr_from_mem_proto = {
>   BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct  
> bpf_dynptr_kern *, src,
>   	   u32, offset, u64, flags)
>   {
> +	enum bpf_dynptr_type type;
>   	int err;

>   	if (!src->data || flags)
> @@ -1509,6 +1520,11 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len,  
> struct bpf_dynptr_kern *, src
>   	if (err)
>   		return err;

> +	type = bpf_dynptr_get_type(src);
> +
> +	if (type == BPF_DYNPTR_TYPE_SKB)
> +		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> +
>   	memcpy(dst, src->data + src->offset + offset, len);

>   	return 0;
> @@ -1528,15 +1544,38 @@ static const struct bpf_func_proto  
> bpf_dynptr_read_proto = {
>   BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset,  
> void *, src,
>   	   u32, len, u64, flags)
>   {
> +	enum bpf_dynptr_type type;
>   	int err;

> -	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> +	if (!dst->data || bpf_dynptr_is_rdonly(dst))
>   		return -EINVAL;

>   	err = bpf_dynptr_check_off_len(dst, offset, len);
>   	if (err)
>   		return err;

> +	type = bpf_dynptr_get_type(dst);
> +
> +	if (flags) {
> +		if (type == BPF_DYNPTR_TYPE_SKB) {
> +			if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
> +				return -EINVAL;
> +		} else {
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (type == BPF_DYNPTR_TYPE_SKB) {
> +		struct sk_buff *skb = dst->data;
> +
> +		/* if the data is paged, the caller needs to pull it first */
> +		if (dst->offset + offset + len > skb->len - skb->data_len)

Use skb_headlen instead of 'skb->len - skb->data_len' ?

> +			return -EAGAIN;
> +
> +		return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> +					     flags);
> +	}
> +
>   	memcpy(dst->data + dst->offset + offset, src, len);

>   	return 0;
> @@ -1555,6 +1594,7 @@ static const struct bpf_func_proto  
> bpf_dynptr_write_proto = {

>   BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset,  
> u32, len)
>   {
> +	enum bpf_dynptr_type type;
>   	int err;

>   	if (!ptr->data)
> @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern  
> *, ptr, u32, offset, u32, len
>   	if (bpf_dynptr_is_rdonly(ptr))
>   		return 0;

> +	type = bpf_dynptr_get_type(ptr);
> +
> +	if (type == BPF_DYNPTR_TYPE_SKB) {
> +		struct sk_buff *skb = ptr->data;
> +
> +		/* if the data is paged, the caller needs to pull it first */
> +		if (ptr->offset + offset + len > skb->len - skb->data_len)
> +			return 0;

Same here?

> +
> +		return (unsigned long)(skb->data + ptr->offset + offset);
> +	}
> +
>   	return (unsigned long)(ptr->data + ptr->offset + offset);
>   }

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 0d523741a543..0838653eeb4e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
>   	u32 subprogno;
>   	struct bpf_map_value_off_desc *kptr_off_desc;
>   	u8 uninit_dynptr_regno;
> +	enum bpf_dynptr_type type;
>   };

>   struct btf *btf_vmlinux;
> @@ -678,6 +679,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum  
> bpf_arg_type arg_type)
>   		return BPF_DYNPTR_TYPE_LOCAL;
>   	case DYNPTR_TYPE_RINGBUF:
>   		return BPF_DYNPTR_TYPE_RINGBUF;
> +	case DYNPTR_TYPE_SKB:
> +		return BPF_DYNPTR_TYPE_SKB;
>   	default:
>   		return BPF_DYNPTR_TYPE_INVALID;
>   	}
> @@ -5820,12 +5823,14 @@ int check_func_arg_reg_off(struct  
> bpf_verifier_env *env,
>   	return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
>   }

> -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct  
> bpf_reg_state *reg)
> +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env,  
> struct bpf_reg_state *reg,
> +				       struct bpf_call_arg_meta *meta)
>   {
>   	struct bpf_func_state *state = func(env, reg);
>   	int spi = get_spi(reg->off);

> -	return state->stack[spi].spilled_ptr.id;
> +	meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> +	meta->type = state->stack[spi].spilled_ptr.dynptr.type;
>   }

>   static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env  
> *env, u32 arg,
>   				case DYNPTR_TYPE_RINGBUF:
>   					err_extra = "ringbuf ";
>   					break;
> +				case DYNPTR_TYPE_SKB:
> +					err_extra = "skb ";
> +					break;
>   				default:
>   					break;
>   				}
> @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env  
> *env, u32 arg,
>   					verbose(env, "verifier internal error: multiple refcounted args in  
> BPF_FUNC_dynptr_data");
>   					return -EFAULT;
>   				}
> -				/* Find the id of the dynptr we're tracking the reference of */
> -				meta->ref_obj_id = stack_slot_get_id(env, reg);
> +				/* Find the id and the type of the dynptr we're tracking
> +				 * the reference of.
> +				 */
> +				stack_slot_get_dynptr_info(env, reg, meta);
>   			}
>   		}
>   		break;
> @@ -7406,7 +7416,11 @@ static int check_helper_call(struct  
> bpf_verifier_env *env, struct bpf_insn *insn
>   		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
>   	} else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
>   		mark_reg_known_zero(env, regs, BPF_REG_0);
> -		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> +		if (func_id == BPF_FUNC_dynptr_data &&
> +		    meta.type == BPF_DYNPTR_TYPE_SKB)
> +			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> +		else
> +			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>   		regs[BPF_REG_0].mem_size = meta.mem_size;
>   	} else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
>   		const struct btf_type *t;
> @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct  
> bpf_verifier_env *env)
>   			goto patch_call_imm;
>   		}


[..]

> +		if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> +				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> +			else
> +				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> +			insn_buf[1] = *insn;
> +			cnt = 2;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta += cnt - 1;
> +			env->prog = new_prog;
> +			prog = new_prog;
> +			insn = new_prog->insnsi + i + delta;
> +			goto patch_call_imm;
> +		}

Would it be easier to have two separate helpers:
- BPF_FUNC_dynptr_from_skb
- BPF_FUNC_dynptr_from_skb_readonly

And make the verifier rewrite insn->imm to
BPF_FUNC_dynptr_from_skb_readonly when needed?

if (insn->imm == BPF_FUNC_dynptr_from_skb) {
	if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
		insn->imm = BPF_FUNC_dynptr_from_skb_readonly;
}

Or it's also ugly because we'd have to leak that new helper into UAPI?
(I wonder whether that hidden 4th argument is too magical, but probably
fine?)

> +
>   		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
>   		 * and other inlining handlers are currently limited to 64 bit
>   		 * only.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5669248aff25..312f99deb759 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct  
> sk_buff *skb)
>   		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
>   }

> -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> -	   const void *, from, u32, len, u64, flags)
> +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void  
> *from,
> +			  u32 len, u64 flags)
>   {
>   	void *ptr;

> @@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *,  
> skb, u32, offset,
>   	return 0;
>   }

> +BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> +	   const void *, from, u32, len, u64, flags)
> +{
> +	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
> +}
> +
>   static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
>   	.func		= bpf_skb_store_bytes,
>   	.gpl_only	= false,
> @@ -1718,8 +1724,7 @@ static const struct bpf_func_proto  
> bpf_skb_store_bytes_proto = {
>   	.arg5_type	= ARG_ANYTHING,
>   };

> -BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> -	   void *, to, u32, len)
> +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void  
> *to, u32 len)
>   {
>   	void *ptr;

> @@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct  
> sk_buff *, skb, u32, offset,
>   	return -EFAULT;
>   }

> +BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> +	   void *, to, u32, len)
> +{
> +	return __bpf_skb_load_bytes(skb, offset, to, len);
> +}
> +
>   static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
>   	.func		= bpf_skb_load_bytes,
>   	.gpl_only	= false,
> @@ -1849,6 +1860,32 @@ static const struct bpf_func_proto  
> bpf_skb_pull_data_proto = {
>   	.arg2_type	= ARG_ANYTHING,
>   };

> +/* is_rdonly is set by the verifier */
> +BPF_CALL_4(bpf_dynptr_from_skb, struct sk_buff *, skb, u64, flags,
> +	   struct bpf_dynptr_kern *, ptr, u32, is_rdonly)
> +{
> +	if (flags) {
> +		bpf_dynptr_set_null(ptr);
> +		return -EINVAL;
> +	}
> +
> +	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> +
> +	if (is_rdonly)
> +		bpf_dynptr_set_rdonly(ptr);
> +
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_dynptr_from_skb_proto = {
> +	.func		= bpf_dynptr_from_skb,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT,
> +};
> +
>   BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>   {
>   	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> @@ -7808,6 +7845,8 @@ sk_filter_func_proto(enum bpf_func_id func_id,  
> const struct bpf_prog *prog)
>   		return &bpf_get_socket_uid_proto;
>   	case BPF_FUNC_perf_event_output:
>   		return &bpf_skb_event_output_proto;
> +	case BPF_FUNC_dynptr_from_skb:
> +		return &bpf_dynptr_from_skb_proto;
>   	default:
>   		return bpf_sk_base_func_proto(func_id);
>   	}
> @@ -7991,6 +8030,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id,  
> const struct bpf_prog *prog)
>   		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
>   #endif
>   #endif
> +	case BPF_FUNC_dynptr_from_skb:
> +		return &bpf_dynptr_from_skb_proto;
>   	default:
>   		return bpf_sk_base_func_proto(func_id);
>   	}
> @@ -8186,6 +8227,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const  
> struct bpf_prog *prog)
>   	case BPF_FUNC_skc_lookup_tcp:
>   		return &bpf_skc_lookup_tcp_proto;
>   #endif
> +	case BPF_FUNC_dynptr_from_skb:
> +		return &bpf_dynptr_from_skb_proto;
>   	default:
>   		return bpf_sk_base_func_proto(func_id);
>   	}
> @@ -8224,6 +8267,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const  
> struct bpf_prog *prog)
>   		return &bpf_get_smp_processor_id_proto;
>   	case BPF_FUNC_skb_under_cgroup:
>   		return &bpf_skb_under_cgroup_proto;
> +	case BPF_FUNC_dynptr_from_skb:
> +		return &bpf_dynptr_from_skb_proto;
>   	default:
>   		return bpf_sk_base_func_proto(func_id);
>   	}
> diff --git a/tools/include/uapi/linux/bpf.h  
> b/tools/include/uapi/linux/bpf.h
> index 59a217ca2dfd..0730cd198a7f 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5241,11 +5241,22 @@ union bpf_attr {
>    *	Description
>    *		Write *len* bytes from *src* into *dst*, starting from *offset*
>    *		into *dst*.
> - *		*flags* is currently unused.
> + *
> + *		*flags* must be 0 except for skb-type dynptrs.
> + *
> + *		For skb-type dynptrs:
> + *		    *  if *offset* + *len* extends into the skb's paged buffers, the  
> user
> + *		       should manually pull the skb with bpf_skb_pull and then try  
> again.
> + *
> + *		    *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM**  
> (automatically
> + *			recompute the checksum for the packet after storing the bytes) and
> + *			**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + *			**->swhash** and *skb*\ **->l4hash** to 0).
>    *	Return
>    *		0 on success, -E2BIG if *offset* + *len* exceeds the length
>    *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> - *		is a read-only dynptr or if *flags* is not 0.
> + *		is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> + *		skb-type dynptrs the write extends into the skb's paged buffers.
>    *
>    * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
>    *	Description
> @@ -5253,10 +5264,19 @@ union bpf_attr {
>    *
>    *		*len* must be a statically known value. The returned data slice
>    *		is invalidated whenever the dynptr is invalidated.
> + *
> + *		For skb-type dynptrs:
> + *		    * if *offset* + *len* extends into the skb's paged buffers,
> + *		      the user should manually pull the skb with bpf_skb_pull and  
> then
> + *		      try again.
> + *
> + *		    * the data slice is automatically invalidated anytime a
> + *		      helper call that changes the underlying packet buffer
> + *		      (eg bpf_skb_pull) is called.
>    *	Return
>    *		Pointer to the underlying dynptr data, NULL if the dynptr is
>    *		read-only, if the dynptr is invalid, or if the offset and length
> - *		is out of bounds.
> + *		is out of bounds or in a paged buffer for skb-type dynptrs.
>    *
>    * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr  
> *th, u32 th_len)
>    *	Description
> @@ -5331,6 +5351,21 @@ union bpf_attr {
>    *		**-EACCES** if the SYN cookie is not valid.
>    *
>    *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> + *
> + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct  
> bpf_dynptr *ptr)
> + *	Description
> + *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
> + *		context. Depending on program type, the dynptr may be read-only,
> + *		in which case trying to obtain a direct data slice to it through
> + *		bpf_dynptr_data will return an error.
> + *
> + *		Calls that change the *skb*'s underlying packet buffer
> + *		(eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> + *		invalidate any data slices associated with the dynptr.
> + *
> + *		*flags* is currently unused, it must be 0 for now.
> + *	Return
> + *		0 on success or -EINVAL if flags is not 0.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -5541,6 +5576,7 @@ union bpf_attr {
>   	FN(tcp_raw_gen_syncookie_ipv6),	\
>   	FN(tcp_raw_check_syncookie_ipv4),	\
>   	FN(tcp_raw_check_syncookie_ipv6),	\
> +	FN(dynptr_from_skb),		\
>   	/* */

>   /* integer value in 'imm' field of BPF_CALL instruction selects which  
> helper
> --
> 2.30.2


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-27 17:13   ` sdf
@ 2022-07-28 16:49     ` Joanne Koong
  2022-07-28 17:28       ` Stanislav Fomichev
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-07-28 16:49 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Wed, Jul 27, 2022 at 10:14 AM <sdf@google.com> wrote:
>
> On 07/26, Joanne Koong wrote:
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
>
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (writes and data slices are not permitted). For reads on the
> > dynptr, this includes reading into data in the non-linear paged buffers
> > but for writes and data slices, if the data is in a paged buffer, the
> > user must first call bpf_skb_pull_data to pull the data into the linear
> > portion.
>
> > Additionally, any helper calls that change the underlying packet buffer
> > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > dynptr.
>
> > Right now, skb dynptrs can only be constructed from skbs that are
> > the bpf program context - as such, there does not need to be any
> > reference tracking or release on skb dynptrs.
>
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >   include/linux/bpf.h            |  8 ++++-
> >   include/linux/filter.h         |  4 +++
> >   include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
> >   kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
> >   kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
> >   net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
> >   tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
> >   7 files changed, 229 insertions(+), 17 deletions(-)
>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 20c26aed7896..7fbd4324c848 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -407,11 +407,14 @@ enum bpf_type_flag {
> >       /* Size is known at compile time. */
> >       MEM_FIXED_SIZE          = BIT(10 + BPF_BASE_TYPE_BITS),
>
> > +     /* DYNPTR points to sk_buff */
> > +     DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> > +
> >       __BPF_TYPE_FLAG_MAX,
> >       __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> >   };
>
> > -#define DYNPTR_TYPE_FLAG_MASK        (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> > +#define DYNPTR_TYPE_FLAG_MASK        (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF |
> > DYNPTR_TYPE_SKB)
>
> >   /* Max number of base types. */
> >   #define BPF_BASE_TYPE_LIMIT (1UL << BPF_BASE_TYPE_BITS)
> > @@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
> >       BPF_DYNPTR_TYPE_LOCAL,
> >       /* Underlying data is a ringbuf record */
> >       BPF_DYNPTR_TYPE_RINGBUF,
> > +     /* Underlying data is a sk_buff */
> > +     BPF_DYNPTR_TYPE_SKB,
> >   };
>
> >   void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
> >                    enum bpf_dynptr_type type, u32 offset, u32 size);
> >   void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
> >   int bpf_dynptr_check_size(u32 size);
> > +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
>
> >   #ifdef CONFIG_BPF_LSM
> >   void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index a5f21dc3c432..649063d9cbfd 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1532,4 +1532,8 @@ static __always_inline int
> > __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
> >       return XDP_REDIRECT;
> >   }
>
> > +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void
> > *to, u32 len);
> > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void
> > *from,
> > +                       u32 len, u64 flags);
> > +
> >   #endif /* __LINUX_FILTER_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 59a217ca2dfd..0730cd198a7f 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5241,11 +5241,22 @@ union bpf_attr {
> >    *  Description
> >    *          Write *len* bytes from *src* into *dst*, starting from *offset*
> >    *          into *dst*.
> > - *           *flags* is currently unused.
> > + *
> > + *           *flags* must be 0 except for skb-type dynptrs.
> > + *
> > + *           For skb-type dynptrs:
> > + *               *  if *offset* + *len* extends into the skb's paged buffers, the
> > user
> > + *                  should manually pull the skb with bpf_skb_pull and then try
> > again.
> > + *
> > + *               *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM**
> > (automatically
> > + *                   recompute the checksum for the packet after storing the bytes) and
> > + *                   **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> > + *                   **->swhash** and *skb*\ **->l4hash** to 0).
> >    *  Return
> >    *          0 on success, -E2BIG if *offset* + *len* exceeds the length
> >    *          of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> > - *           is a read-only dynptr or if *flags* is not 0.
> > + *           is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> > + *           skb-type dynptrs the write extends into the skb's paged buffers.
> >    *
> >    * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
> >    *  Description
> > @@ -5253,10 +5264,19 @@ union bpf_attr {
> >    *
> >    *          *len* must be a statically known value. The returned data slice
> >    *          is invalidated whenever the dynptr is invalidated.
> > + *
> > + *           For skb-type dynptrs:
> > + *               * if *offset* + *len* extends into the skb's paged buffers,
> > + *                 the user should manually pull the skb with bpf_skb_pull and
> > then
> > + *                 try again.
> > + *
> > + *               * the data slice is automatically invalidated anytime a
> > + *                 helper call that changes the underlying packet buffer
> > + *                 (eg bpf_skb_pull) is called.
> >    *  Return
> >    *          Pointer to the underlying dynptr data, NULL if the dynptr is
> >    *          read-only, if the dynptr is invalid, or if the offset and length
> > - *           is out of bounds.
> > + *           is out of bounds or in a paged buffer for skb-type dynptrs.
> >    *
> >    * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr
> > *th, u32 th_len)
> >    *  Description
> > @@ -5331,6 +5351,21 @@ union bpf_attr {
> >    *          **-EACCES** if the SYN cookie is not valid.
> >    *
> >    *          **-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> > + *
> > + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct
> > bpf_dynptr *ptr)
> > + *   Description
> > + *           Get a dynptr to the data in *skb*. *skb* must be the BPF program
> > + *           context. Depending on program type, the dynptr may be read-only,
> > + *           in which case trying to obtain a direct data slice to it through
> > + *           bpf_dynptr_data will return an error.
> > + *
> > + *           Calls that change the *skb*'s underlying packet buffer
> > + *           (eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> > + *           invalidate any data slices associated with the dynptr.
> > + *
> > + *           *flags* is currently unused, it must be 0 for now.
> > + *   Return
> > + *           0 on success or -EINVAL if flags is not 0.
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)               \
> >       FN(unspec),                     \
> > @@ -5541,6 +5576,7 @@ union bpf_attr {
> >       FN(tcp_raw_gen_syncookie_ipv6), \
> >       FN(tcp_raw_check_syncookie_ipv4),       \
> >       FN(tcp_raw_check_syncookie_ipv6),       \
> > +     FN(dynptr_from_skb),            \
> >       /* */
>
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which
> > helper
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 1f961f9982d2..21a806057e9e 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1425,11 +1425,21 @@ static bool bpf_dynptr_is_rdonly(struct
> > bpf_dynptr_kern *ptr)
> >       return ptr->size & DYNPTR_RDONLY_BIT;
> >   }
>
> > +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
> > +{
> > +     ptr->size |= DYNPTR_RDONLY_BIT;
> > +}
> > +
> >   static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum
> > bpf_dynptr_type type)
> >   {
> >       ptr->size |= type << DYNPTR_TYPE_SHIFT;
> >   }
>
> > +static enum bpf_dynptr_type bpf_dynptr_get_type(const struct
> > bpf_dynptr_kern *ptr)
> > +{
> > +     return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
> > +}
> > +
> >   static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
> >   {
> >       return ptr->size & DYNPTR_SIZE_MASK;
> > @@ -1500,6 +1510,7 @@ static const struct bpf_func_proto
> > bpf_dynptr_from_mem_proto = {
> >   BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct
> > bpf_dynptr_kern *, src,
> >          u32, offset, u64, flags)
> >   {
> > +     enum bpf_dynptr_type type;
> >       int err;
>
> >       if (!src->data || flags)
> > @@ -1509,6 +1520,11 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len,
> > struct bpf_dynptr_kern *, src
> >       if (err)
> >               return err;
>
> > +     type = bpf_dynptr_get_type(src);
> > +
> > +     if (type == BPF_DYNPTR_TYPE_SKB)
> > +             return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > +
> >       memcpy(dst, src->data + src->offset + offset, len);
>
> >       return 0;
> > @@ -1528,15 +1544,38 @@ static const struct bpf_func_proto
> > bpf_dynptr_read_proto = {
> >   BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset,
> > void *, src,
> >          u32, len, u64, flags)
> >   {
> > +     enum bpf_dynptr_type type;
> >       int err;
>
> > -     if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> > +     if (!dst->data || bpf_dynptr_is_rdonly(dst))
> >               return -EINVAL;
>
> >       err = bpf_dynptr_check_off_len(dst, offset, len);
> >       if (err)
> >               return err;
>
> > +     type = bpf_dynptr_get_type(dst);
> > +
> > +     if (flags) {
> > +             if (type == BPF_DYNPTR_TYPE_SKB) {
> > +                     if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
> > +                             return -EINVAL;
> > +             } else {
> > +                     return -EINVAL;
> > +             }
> > +     }
> > +
> > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > +             struct sk_buff *skb = dst->data;
> > +
> > +             /* if the data is paged, the caller needs to pull it first */
> > +             if (dst->offset + offset + len > skb->len - skb->data_len)
>
> Use skb_headlen instead of 'skb->len - skb->data_len' ?
Awesome, will replace this (and the one in bpf_dynptr_data) with
skb_headlen() for v2. thanks!
>
> > +                     return -EAGAIN;
> > +
> > +             return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> > +                                          flags);
> > +     }
> > +
> >       memcpy(dst->data + dst->offset + offset, src, len);
>
> >       return 0;
> > @@ -1555,6 +1594,7 @@ static const struct bpf_func_proto
> > bpf_dynptr_write_proto = {
>
> >   BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset,
> > u32, len)
> >   {
> > +     enum bpf_dynptr_type type;
> >       int err;
>
> >       if (!ptr->data)
> > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern
> > *, ptr, u32, offset, u32, len
> >       if (bpf_dynptr_is_rdonly(ptr))
> >               return 0;
>
> > +     type = bpf_dynptr_get_type(ptr);
> > +
> > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > +             struct sk_buff *skb = ptr->data;
> > +
> > +             /* if the data is paged, the caller needs to pull it first */
> > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > +                     return 0;
>
> Same here?
>
> > +
> > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > +     }
> > +
> >       return (unsigned long)(ptr->data + ptr->offset + offset);
> >   }
>
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 0d523741a543..0838653eeb4e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
> >       u32 subprogno;
> >       struct bpf_map_value_off_desc *kptr_off_desc;
> >       u8 uninit_dynptr_regno;
> > +     enum bpf_dynptr_type type;
> >   };
>
> >   struct btf *btf_vmlinux;
> > @@ -678,6 +679,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum
> > bpf_arg_type arg_type)
> >               return BPF_DYNPTR_TYPE_LOCAL;
> >       case DYNPTR_TYPE_RINGBUF:
> >               return BPF_DYNPTR_TYPE_RINGBUF;
> > +     case DYNPTR_TYPE_SKB:
> > +             return BPF_DYNPTR_TYPE_SKB;
> >       default:
> >               return BPF_DYNPTR_TYPE_INVALID;
> >       }
> > @@ -5820,12 +5823,14 @@ int check_func_arg_reg_off(struct
> > bpf_verifier_env *env,
> >       return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
> >   }
>
> > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct
> > bpf_reg_state *reg)
> > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
> > struct bpf_reg_state *reg,
> > +                                    struct bpf_call_arg_meta *meta)
> >   {
> >       struct bpf_func_state *state = func(env, reg);
> >       int spi = get_spi(reg->off);
>
> > -     return state->stack[spi].spilled_ptr.id;
> > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> >   }
>
> >   static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env
> > *env, u32 arg,
> >                               case DYNPTR_TYPE_RINGBUF:
> >                                       err_extra = "ringbuf ";
> >                                       break;
> > +                             case DYNPTR_TYPE_SKB:
> > +                                     err_extra = "skb ";
> > +                                     break;
> >                               default:
> >                                       break;
> >                               }
> > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env
> > *env, u32 arg,
> >                                       verbose(env, "verifier internal error: multiple refcounted args in
> > BPF_FUNC_dynptr_data");
> >                                       return -EFAULT;
> >                               }
> > -                             /* Find the id of the dynptr we're tracking the reference of */
> > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > +                             /* Find the id and the type of the dynptr we're tracking
> > +                              * the reference of.
> > +                              */
> > +                             stack_slot_get_dynptr_info(env, reg, meta);
> >                       }
> >               }
> >               break;
> > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct
> > bpf_verifier_env *env, struct bpf_insn *insn
> >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > +             if (func_id == BPF_FUNC_dynptr_data &&
> > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > +             else
> > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >               regs[BPF_REG_0].mem_size = meta.mem_size;
> >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> >               const struct btf_type *t;
> > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct
> > bpf_verifier_env *env)
> >                       goto patch_call_imm;
> >               }
>
>
> [..]
>
> > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > +                     else
> > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > +                     insn_buf[1] = *insn;
> > +                     cnt = 2;
> > +
> > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +                     if (!new_prog)
> > +                             return -ENOMEM;
> > +
> > +                     delta += cnt - 1;
> > +                     env->prog = new_prog;
> > +                     prog = new_prog;
> > +                     insn = new_prog->insnsi + i + delta;
> > +                     goto patch_call_imm;
> > +             }
>
> Would it be easier to have two separate helpers:
> - BPF_FUNC_dynptr_from_skb
> - BPF_FUNC_dynptr_from_skb_readonly
>
> And make the verifier rewrite insn->imm to
> BPF_FUNC_dynptr_from_skb_readonly when needed?
>
> if (insn->imm == BPF_FUNC_dynptr_from_skb) {
>         if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
>                 insn->imm = BPF_FUNC_dynptr_from_skb_readonly;
> }
>
> Or it's also ugly because we'd have to leak that new helper into UAPI?
> (I wonder whether that hidden 4th argument is too magical, but probably
> fine?)
To me, having 2 separate helpers feels more cluttered and having to
expose it in the uapi (though I think there is probably some way to
avoid this by doing some sort of ad hoc processing) doesn't seem
ideal. If you feel strongly about this though, I am happy to change
this to use two separate helpers. We do this sort of manual
instruction patching for the sleepable flags in
bpf_task/sk/inode_storage_get and for the callback args in
bpf_timer_set_callback as well - if we use separate helpers here, we
should do that for the other cases as well to maintain consistency.
>
> > +
> >               /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
> >                * and other inlining handlers are currently limited to 64 bit
> >                * only.
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 5669248aff25..312f99deb759 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct
> > sk_buff *skb)
> >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> >   }
>
> > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > -        const void *, from, u32, len, u64, flags)
> > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void
> > *from,
> > +                       u32 len, u64 flags)
> >   {
> >       void *ptr;
>
> > @@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *,
> > skb, u32, offset,
> >       return 0;
> >   }
>
> > +BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > +        const void *, from, u32, len, u64, flags)
> > +{
> > +     return __bpf_skb_store_bytes(skb, offset, from, len, flags);
> > +}
> > +
> >   static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
> >       .func           = bpf_skb_store_bytes,
> >       .gpl_only       = false,
> > @@ -1718,8 +1724,7 @@ static const struct bpf_func_proto
> > bpf_skb_store_bytes_proto = {
> >       .arg5_type      = ARG_ANYTHING,
> >   };
>
> > -BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> > -        void *, to, u32, len)
> > +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void
> > *to, u32 len)
> >   {
> >       void *ptr;
>
> > @@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct
> > sk_buff *, skb, u32, offset,
> >       return -EFAULT;
> >   }
>
> > +BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> > +        void *, to, u32, len)
> > +{
> > +     return __bpf_skb_load_bytes(skb, offset, to, len);
> > +}
> > +
> >   static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
> >       .func           = bpf_skb_load_bytes,
> >       .gpl_only       = false,
> > @@ -1849,6 +1860,32 @@ static const struct bpf_func_proto
> > bpf_skb_pull_data_proto = {
> >       .arg2_type      = ARG_ANYTHING,
> >   };
>
> > +/* is_rdonly is set by the verifier */
> > +BPF_CALL_4(bpf_dynptr_from_skb, struct sk_buff *, skb, u64, flags,
> > +        struct bpf_dynptr_kern *, ptr, u32, is_rdonly)
> > +{
> > +     if (flags) {
> > +             bpf_dynptr_set_null(ptr);
> > +             return -EINVAL;
> > +     }
> > +
> > +     bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> > +
> > +     if (is_rdonly)
> > +             bpf_dynptr_set_rdonly(ptr);
> > +
> > +     return 0;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_dynptr_from_skb_proto = {
> > +     .func           = bpf_dynptr_from_skb,
> > +     .gpl_only       = false,
> > +     .ret_type       = RET_INTEGER,
> > +     .arg1_type      = ARG_PTR_TO_CTX,
> > +     .arg2_type      = ARG_ANYTHING,
> > +     .arg3_type      = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT,
> > +};
> > +
> >   BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> >   {
> >       return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > @@ -7808,6 +7845,8 @@ sk_filter_func_proto(enum bpf_func_id func_id,
> > const struct bpf_prog *prog)
> >               return &bpf_get_socket_uid_proto;
> >       case BPF_FUNC_perf_event_output:
> >               return &bpf_skb_event_output_proto;
> > +     case BPF_FUNC_dynptr_from_skb:
> > +             return &bpf_dynptr_from_skb_proto;
> >       default:
> >               return bpf_sk_base_func_proto(func_id);
> >       }
> > @@ -7991,6 +8030,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id,
> > const struct bpf_prog *prog)
> >               return &bpf_tcp_raw_check_syncookie_ipv6_proto;
> >   #endif
> >   #endif
> > +     case BPF_FUNC_dynptr_from_skb:
> > +             return &bpf_dynptr_from_skb_proto;
> >       default:
> >               return bpf_sk_base_func_proto(func_id);
> >       }
> > @@ -8186,6 +8227,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const
> > struct bpf_prog *prog)
> >       case BPF_FUNC_skc_lookup_tcp:
> >               return &bpf_skc_lookup_tcp_proto;
> >   #endif
> > +     case BPF_FUNC_dynptr_from_skb:
> > +             return &bpf_dynptr_from_skb_proto;
> >       default:
> >               return bpf_sk_base_func_proto(func_id);
> >       }
> > @@ -8224,6 +8267,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const
> > struct bpf_prog *prog)
> >               return &bpf_get_smp_processor_id_proto;
> >       case BPF_FUNC_skb_under_cgroup:
> >               return &bpf_skb_under_cgroup_proto;
> > +     case BPF_FUNC_dynptr_from_skb:
> > +             return &bpf_dynptr_from_skb_proto;
> >       default:
> >               return bpf_sk_base_func_proto(func_id);
> >       }
> > diff --git a/tools/include/uapi/linux/bpf.h
> > b/tools/include/uapi/linux/bpf.h
> > index 59a217ca2dfd..0730cd198a7f 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -5241,11 +5241,22 @@ union bpf_attr {
> >    *  Description
> >    *          Write *len* bytes from *src* into *dst*, starting from *offset*
> >    *          into *dst*.
> > - *           *flags* is currently unused.
> > + *
> > + *           *flags* must be 0 except for skb-type dynptrs.
> > + *
> > + *           For skb-type dynptrs:
> > + *               *  if *offset* + *len* extends into the skb's paged buffers, the
> > user
> > + *                  should manually pull the skb with bpf_skb_pull and then try
> > again.
> > + *
> > + *               *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM**
> > (automatically
> > + *                   recompute the checksum for the packet after storing the bytes) and
> > + *                   **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> > + *                   **->swhash** and *skb*\ **->l4hash** to 0).
> >    *  Return
> >    *          0 on success, -E2BIG if *offset* + *len* exceeds the length
> >    *          of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> > - *           is a read-only dynptr or if *flags* is not 0.
> > + *           is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> > + *           skb-type dynptrs the write extends into the skb's paged buffers.
> >    *
> >    * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
> >    *  Description
> > @@ -5253,10 +5264,19 @@ union bpf_attr {
> >    *
> >    *          *len* must be a statically known value. The returned data slice
> >    *          is invalidated whenever the dynptr is invalidated.
> > + *
> > + *           For skb-type dynptrs:
> > + *               * if *offset* + *len* extends into the skb's paged buffers,
> > + *                 the user should manually pull the skb with bpf_skb_pull and
> > then
> > + *                 try again.
> > + *
> > + *               * the data slice is automatically invalidated anytime a
> > + *                 helper call that changes the underlying packet buffer
> > + *                 (eg bpf_skb_pull) is called.
> >    *  Return
> >    *          Pointer to the underlying dynptr data, NULL if the dynptr is
> >    *          read-only, if the dynptr is invalid, or if the offset and length
> > - *           is out of bounds.
> > + *           is out of bounds or in a paged buffer for skb-type dynptrs.
> >    *
> >    * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr
> > *th, u32 th_len)
> >    *  Description
> > @@ -5331,6 +5351,21 @@ union bpf_attr {
> >    *          **-EACCES** if the SYN cookie is not valid.
> >    *
> >    *          **-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> > + *
> > + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct
> > bpf_dynptr *ptr)
> > + *   Description
> > + *           Get a dynptr to the data in *skb*. *skb* must be the BPF program
> > + *           context. Depending on program type, the dynptr may be read-only,
> > + *           in which case trying to obtain a direct data slice to it through
> > + *           bpf_dynptr_data will return an error.
> > + *
> > + *           Calls that change the *skb*'s underlying packet buffer
> > + *           (eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> > + *           invalidate any data slices associated with the dynptr.
> > + *
> > + *           *flags* is currently unused, it must be 0 for now.
> > + *   Return
> > + *           0 on success or -EINVAL if flags is not 0.
> >    */
> >   #define __BPF_FUNC_MAPPER(FN)               \
> >       FN(unspec),                     \
> > @@ -5541,6 +5576,7 @@ union bpf_attr {
> >       FN(tcp_raw_gen_syncookie_ipv6), \
> >       FN(tcp_raw_check_syncookie_ipv4),       \
> >       FN(tcp_raw_check_syncookie_ipv6),       \
> > +     FN(dynptr_from_skb),            \
> >       /* */
>
> >   /* integer value in 'imm' field of BPF_CALL instruction selects which
> > helper
> > --
> > 2.30.2
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-28 16:49     ` Joanne Koong
@ 2022-07-28 17:28       ` Stanislav Fomichev
  0 siblings, 0 replies; 52+ messages in thread
From: Stanislav Fomichev @ 2022-07-28 17:28 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Thu, Jul 28, 2022 at 9:49 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Jul 27, 2022 at 10:14 AM <sdf@google.com> wrote:
> >
> > On 07/26, Joanne Koong wrote:
> > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > benefits. One is that they allow operations on sizes that are not
> > > statically known at compile-time (eg variable-sized accesses).
> > > Another is that parsing the packet data through dynptrs (instead of
> > > through direct access of skb->data and skb->data_end) can be more
> > > ergonomic and less brittle (eg does not need manual if checking for
> > > being within bounds of data_end).
> >
> > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > read-only (writes and data slices are not permitted). For reads on the
> > > dynptr, this includes reading into data in the non-linear paged buffers
> > > but for writes and data slices, if the data is in a paged buffer, the
> > > user must first call bpf_skb_pull_data to pull the data into the linear
> > > portion.
> >
> > > Additionally, any helper calls that change the underlying packet buffer
> > > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > > dynptr.
> >
> > > Right now, skb dynptrs can only be constructed from skbs that are
> > > the bpf program context - as such, there does not need to be any
> > > reference tracking or release on skb dynptrs.
> >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >   include/linux/bpf.h            |  8 ++++-
> > >   include/linux/filter.h         |  4 +++
> > >   include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
> > >   kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
> > >   kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
> > >   net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
> > >   tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
> > >   7 files changed, 229 insertions(+), 17 deletions(-)
> >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 20c26aed7896..7fbd4324c848 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -407,11 +407,14 @@ enum bpf_type_flag {
> > >       /* Size is known at compile time. */
> > >       MEM_FIXED_SIZE          = BIT(10 + BPF_BASE_TYPE_BITS),
> >
> > > +     /* DYNPTR points to sk_buff */
> > > +     DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> > > +
> > >       __BPF_TYPE_FLAG_MAX,
> > >       __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> > >   };
> >
> > > -#define DYNPTR_TYPE_FLAG_MASK        (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> > > +#define DYNPTR_TYPE_FLAG_MASK        (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF |
> > > DYNPTR_TYPE_SKB)
> >
> > >   /* Max number of base types. */
> > >   #define BPF_BASE_TYPE_LIMIT (1UL << BPF_BASE_TYPE_BITS)
> > > @@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
> > >       BPF_DYNPTR_TYPE_LOCAL,
> > >       /* Underlying data is a ringbuf record */
> > >       BPF_DYNPTR_TYPE_RINGBUF,
> > > +     /* Underlying data is a sk_buff */
> > > +     BPF_DYNPTR_TYPE_SKB,
> > >   };
> >
> > >   void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
> > >                    enum bpf_dynptr_type type, u32 offset, u32 size);
> > >   void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
> > >   int bpf_dynptr_check_size(u32 size);
> > > +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
> >
> > >   #ifdef CONFIG_BPF_LSM
> > >   void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
> > > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > > index a5f21dc3c432..649063d9cbfd 100644
> > > --- a/include/linux/filter.h
> > > +++ b/include/linux/filter.h
> > > @@ -1532,4 +1532,8 @@ static __always_inline int
> > > __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
> > >       return XDP_REDIRECT;
> > >   }
> >
> > > +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void
> > > *to, u32 len);
> > > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void
> > > *from,
> > > +                       u32 len, u64 flags);
> > > +
> > >   #endif /* __LINUX_FILTER_H__ */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 59a217ca2dfd..0730cd198a7f 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -5241,11 +5241,22 @@ union bpf_attr {
> > >    *  Description
> > >    *          Write *len* bytes from *src* into *dst*, starting from *offset*
> > >    *          into *dst*.
> > > - *           *flags* is currently unused.
> > > + *
> > > + *           *flags* must be 0 except for skb-type dynptrs.
> > > + *
> > > + *           For skb-type dynptrs:
> > > + *               *  if *offset* + *len* extends into the skb's paged buffers, the
> > > user
> > > + *                  should manually pull the skb with bpf_skb_pull and then try
> > > again.
> > > + *
> > > + *               *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM**
> > > (automatically
> > > + *                   recompute the checksum for the packet after storing the bytes) and
> > > + *                   **BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> > > + *                   **->swhash** and *skb*\ **->l4hash** to 0).
> > >    *  Return
> > >    *          0 on success, -E2BIG if *offset* + *len* exceeds the length
> > >    *          of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> > > - *           is a read-only dynptr or if *flags* is not 0.
> > > + *           is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> > > + *           skb-type dynptrs the write extends into the skb's paged buffers.
> > >    *
> > >    * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
> > >    *  Description
> > > @@ -5253,10 +5264,19 @@ union bpf_attr {
> > >    *
> > >    *          *len* must be a statically known value. The returned data slice
> > >    *          is invalidated whenever the dynptr is invalidated.
> > > + *
> > > + *           For skb-type dynptrs:
> > > + *               * if *offset* + *len* extends into the skb's paged buffers,
> > > + *                 the user should manually pull the skb with bpf_skb_pull and
> > > then
> > > + *                 try again.
> > > + *
> > > + *               * the data slice is automatically invalidated anytime a
> > > + *                 helper call that changes the underlying packet buffer
> > > + *                 (eg bpf_skb_pull) is called.
> > >    *  Return
> > >    *          Pointer to the underlying dynptr data, NULL if the dynptr is
> > >    *          read-only, if the dynptr is invalid, or if the offset and length
> > > - *           is out of bounds.
> > > + *           is out of bounds or in a paged buffer for skb-type dynptrs.
> > >    *
> > >    * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr
> > > *th, u32 th_len)
> > >    *  Description
> > > @@ -5331,6 +5351,21 @@ union bpf_attr {
> > >    *          **-EACCES** if the SYN cookie is not valid.
> > >    *
> > >    *          **-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> > > + *
> > > + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct
> > > bpf_dynptr *ptr)
> > > + *   Description
> > > + *           Get a dynptr to the data in *skb*. *skb* must be the BPF program
> > > + *           context. Depending on program type, the dynptr may be read-only,
> > > + *           in which case trying to obtain a direct data slice to it through
> > > + *           bpf_dynptr_data will return an error.
> > > + *
> > > + *           Calls that change the *skb*'s underlying packet buffer
> > > + *           (eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> > > + *           invalidate any data slices associated with the dynptr.
> > > + *
> > > + *           *flags* is currently unused, it must be 0 for now.
> > > + *   Return
> > > + *           0 on success or -EINVAL if flags is not 0.
> > >    */
> > >   #define __BPF_FUNC_MAPPER(FN)               \
> > >       FN(unspec),                     \
> > > @@ -5541,6 +5576,7 @@ union bpf_attr {
> > >       FN(tcp_raw_gen_syncookie_ipv6), \
> > >       FN(tcp_raw_check_syncookie_ipv4),       \
> > >       FN(tcp_raw_check_syncookie_ipv6),       \
> > > +     FN(dynptr_from_skb),            \
> > >       /* */
> >
> > >   /* integer value in 'imm' field of BPF_CALL instruction selects which
> > > helper
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index 1f961f9982d2..21a806057e9e 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> > > @@ -1425,11 +1425,21 @@ static bool bpf_dynptr_is_rdonly(struct
> > > bpf_dynptr_kern *ptr)
> > >       return ptr->size & DYNPTR_RDONLY_BIT;
> > >   }
> >
> > > +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
> > > +{
> > > +     ptr->size |= DYNPTR_RDONLY_BIT;
> > > +}
> > > +
> > >   static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum
> > > bpf_dynptr_type type)
> > >   {
> > >       ptr->size |= type << DYNPTR_TYPE_SHIFT;
> > >   }
> >
> > > +static enum bpf_dynptr_type bpf_dynptr_get_type(const struct
> > > bpf_dynptr_kern *ptr)
> > > +{
> > > +     return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
> > > +}
> > > +
> > >   static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
> > >   {
> > >       return ptr->size & DYNPTR_SIZE_MASK;
> > > @@ -1500,6 +1510,7 @@ static const struct bpf_func_proto
> > > bpf_dynptr_from_mem_proto = {
> > >   BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct
> > > bpf_dynptr_kern *, src,
> > >          u32, offset, u64, flags)
> > >   {
> > > +     enum bpf_dynptr_type type;
> > >       int err;
> >
> > >       if (!src->data || flags)
> > > @@ -1509,6 +1520,11 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len,
> > > struct bpf_dynptr_kern *, src
> > >       if (err)
> > >               return err;
> >
> > > +     type = bpf_dynptr_get_type(src);
> > > +
> > > +     if (type == BPF_DYNPTR_TYPE_SKB)
> > > +             return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > > +
> > >       memcpy(dst, src->data + src->offset + offset, len);
> >
> > >       return 0;
> > > @@ -1528,15 +1544,38 @@ static const struct bpf_func_proto
> > > bpf_dynptr_read_proto = {
> > >   BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset,
> > > void *, src,
> > >          u32, len, u64, flags)
> > >   {
> > > +     enum bpf_dynptr_type type;
> > >       int err;
> >
> > > -     if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> > > +     if (!dst->data || bpf_dynptr_is_rdonly(dst))
> > >               return -EINVAL;
> >
> > >       err = bpf_dynptr_check_off_len(dst, offset, len);
> > >       if (err)
> > >               return err;
> >
> > > +     type = bpf_dynptr_get_type(dst);
> > > +
> > > +     if (flags) {
> > > +             if (type == BPF_DYNPTR_TYPE_SKB) {
> > > +                     if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
> > > +                             return -EINVAL;
> > > +             } else {
> > > +                     return -EINVAL;
> > > +             }
> > > +     }
> > > +
> > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > +             struct sk_buff *skb = dst->data;
> > > +
> > > +             /* if the data is paged, the caller needs to pull it first */
> > > +             if (dst->offset + offset + len > skb->len - skb->data_len)
> >
> > Use skb_headlen instead of 'skb->len - skb->data_len' ?
> Awesome, will replace this (and the one in bpf_dynptr_data) with
> skb_headlen() for v2. thanks!
> >
> > > +                     return -EAGAIN;
> > > +
> > > +             return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> > > +                                          flags);
> > > +     }
> > > +
> > >       memcpy(dst->data + dst->offset + offset, src, len);
> >
> > >       return 0;
> > > @@ -1555,6 +1594,7 @@ static const struct bpf_func_proto
> > > bpf_dynptr_write_proto = {
> >
> > >   BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset,
> > > u32, len)
> > >   {
> > > +     enum bpf_dynptr_type type;
> > >       int err;
> >
> > >       if (!ptr->data)
> > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern
> > > *, ptr, u32, offset, u32, len
> > >       if (bpf_dynptr_is_rdonly(ptr))
> > >               return 0;
> >
> > > +     type = bpf_dynptr_get_type(ptr);
> > > +
> > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > +             struct sk_buff *skb = ptr->data;
> > > +
> > > +             /* if the data is paged, the caller needs to pull it first */
> > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > +                     return 0;
> >
> > Same here?
> >
> > > +
> > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > +     }
> > > +
> > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > >   }
> >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 0d523741a543..0838653eeb4e 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
> > >       u32 subprogno;
> > >       struct bpf_map_value_off_desc *kptr_off_desc;
> > >       u8 uninit_dynptr_regno;
> > > +     enum bpf_dynptr_type type;
> > >   };
> >
> > >   struct btf *btf_vmlinux;
> > > @@ -678,6 +679,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum
> > > bpf_arg_type arg_type)
> > >               return BPF_DYNPTR_TYPE_LOCAL;
> > >       case DYNPTR_TYPE_RINGBUF:
> > >               return BPF_DYNPTR_TYPE_RINGBUF;
> > > +     case DYNPTR_TYPE_SKB:
> > > +             return BPF_DYNPTR_TYPE_SKB;
> > >       default:
> > >               return BPF_DYNPTR_TYPE_INVALID;
> > >       }
> > > @@ -5820,12 +5823,14 @@ int check_func_arg_reg_off(struct
> > > bpf_verifier_env *env,
> > >       return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
> > >   }
> >
> > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct
> > > bpf_reg_state *reg)
> > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
> > > struct bpf_reg_state *reg,
> > > +                                    struct bpf_call_arg_meta *meta)
> > >   {
> > >       struct bpf_func_state *state = func(env, reg);
> > >       int spi = get_spi(reg->off);
> >
> > > -     return state->stack[spi].spilled_ptr.id;
> > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > >   }
> >
> > >   static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env
> > > *env, u32 arg,
> > >                               case DYNPTR_TYPE_RINGBUF:
> > >                                       err_extra = "ringbuf ";
> > >                                       break;
> > > +                             case DYNPTR_TYPE_SKB:
> > > +                                     err_extra = "skb ";
> > > +                                     break;
> > >                               default:
> > >                                       break;
> > >                               }
> > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env
> > > *env, u32 arg,
> > >                                       verbose(env, "verifier internal error: multiple refcounted args in
> > > BPF_FUNC_dynptr_data");
> > >                                       return -EFAULT;
> > >                               }
> > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > +                             /* Find the id and the type of the dynptr we're tracking
> > > +                              * the reference of.
> > > +                              */
> > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > >                       }
> > >               }
> > >               break;
> > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct
> > > bpf_verifier_env *env, struct bpf_insn *insn
> > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > +             else
> > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > >               const struct btf_type *t;
> > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct
> > > bpf_verifier_env *env)
> > >                       goto patch_call_imm;
> > >               }
> >
> >
> > [..]
> >
> > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > +                     else
> > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > +                     insn_buf[1] = *insn;
> > > +                     cnt = 2;
> > > +
> > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > +                     if (!new_prog)
> > > +                             return -ENOMEM;
> > > +
> > > +                     delta += cnt - 1;
> > > +                     env->prog = new_prog;
> > > +                     prog = new_prog;
> > > +                     insn = new_prog->insnsi + i + delta;
> > > +                     goto patch_call_imm;
> > > +             }
> >
> > Would it be easier to have two separate helpers:
> > - BPF_FUNC_dynptr_from_skb
> > - BPF_FUNC_dynptr_from_skb_readonly
> >
> > And make the verifier rewrite insn->imm to
> > BPF_FUNC_dynptr_from_skb_readonly when needed?
> >
> > if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> >         if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> >                 insn->imm = BPF_FUNC_dynptr_from_skb_readonly;
> > }
> >
> > Or it's also ugly because we'd have to leak that new helper into UAPI?
> > (I wonder whether that hidden 4th argument is too magical, but probably
> > fine?)
> To me, having 2 separate helpers feels more cluttered and having to
> expose it in the uapi (though I think there is probably some way to
> avoid this by doing some sort of ad hoc processing) doesn't seem
> ideal. If you feel strongly about this though, I am happy to change
> this to use two separate helpers. We do this sort of manual
> instruction patching for the sleepable flags in
> bpf_task/sk/inode_storage_get and for the callback args in
> bpf_timer_set_callback as well - if we use separate helpers here, we
> should do that for the other cases as well to maintain consistency.

No, I don't feel strongly, let's keep it, especially if there is prior art :-)
I am just wondering whether there is some better alternative, but
extra helpers also have their downsides.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-07-27 17:13   ` sdf
@ 2022-07-28 17:45   ` Hao Luo
  2022-07-28 18:36     ` Joanne Koong
  2022-07-28 23:39   ` Martin KaFai Lau
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 52+ messages in thread
From: Hao Luo @ 2022-07-28 17:45 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

Hi, Joanne,

On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (writes and data slices are not permitted). For reads on the
> dynptr, this includes reading into data in the non-linear paged buffers
> but for writes and data slices, if the data is in a paged buffer, the
> user must first call bpf_skb_pull_data to pull the data into the linear
> portion.
>
> Additionally, any helper calls that change the underlying packet buffer
> (eg bpf_skb_pull_data) invalidates any data slices of the associated
> dynptr.
>
> Right now, skb dynptrs can only be constructed from skbs that are
> the bpf program context - as such, there does not need to be any
> reference tracking or release on skb dynptrs.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  8 ++++-
>  include/linux/filter.h         |  4 +++
>  include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
>  kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
>  net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
>  7 files changed, 229 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 20c26aed7896..7fbd4324c848 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -407,11 +407,14 @@ enum bpf_type_flag {
>         /* Size is known at compile time. */
>         MEM_FIXED_SIZE          = BIT(10 + BPF_BASE_TYPE_BITS),
>
> +       /* DYNPTR points to sk_buff */
> +       DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> +
>         __BPF_TYPE_FLAG_MAX,
>         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
>  };
>
> -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
>

I wonder if we could maximize the use of these flags by combining them
with other base types, not just DYNPTR. For example, does TYPE_LOCAL
indicate memory is on stack? If so, can we apply LOCAL on PTR_TO_MEM?
If we have PTR_TO_MEM + LOCAL, can it be used to replace PTR_TO_STACK
in some scenarios?

WDYT?

>  /* Max number of base types. */
>  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> @@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
>         BPF_DYNPTR_TYPE_LOCAL,
>         /* Underlying data is a ringbuf record */
>         BPF_DYNPTR_TYPE_RINGBUF,
> +       /* Underlying data is a sk_buff */
> +       BPF_DYNPTR_TYPE_SKB,
>  };
>
<...>
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-28 17:45   ` Hao Luo
@ 2022-07-28 18:36     ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-07-28 18:36 UTC (permalink / raw)
  To: Hao Luo; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Thu, Jul 28, 2022 at 10:45 AM Hao Luo <haoluo@google.com> wrote:
>
> Hi, Joanne,
>
> On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (writes and data slices are not permitted). For reads on the
> > dynptr, this includes reading into data in the non-linear paged buffers
> > but for writes and data slices, if the data is in a paged buffer, the
> > user must first call bpf_skb_pull_data to pull the data into the linear
> > portion.
> >
> > Additionally, any helper calls that change the underlying packet buffer
> > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > dynptr.
> >
> > Right now, skb dynptrs can only be constructed from skbs that are
> > the bpf program context - as such, there does not need to be any
> > reference tracking or release on skb dynptrs.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  8 ++++-
> >  include/linux/filter.h         |  4 +++
> >  include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
> >  kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
> >  kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
> >  net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
> >  tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
> >  7 files changed, 229 insertions(+), 17 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 20c26aed7896..7fbd4324c848 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -407,11 +407,14 @@ enum bpf_type_flag {
> >         /* Size is known at compile time. */
> >         MEM_FIXED_SIZE          = BIT(10 + BPF_BASE_TYPE_BITS),
> >
> > +       /* DYNPTR points to sk_buff */
> > +       DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> > +
> >         __BPF_TYPE_FLAG_MAX,
> >         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> >  };
> >
> > -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> > +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
> >
>
> I wonder if we could maximize the use of these flags by combining them
> with other base types, not just DYNPTR. For example, does TYPE_LOCAL
> indicate memory is on stack? If so, can we apply LOCAL on PTR_TO_MEM?
> If we have PTR_TO_MEM + LOCAL, can it be used to replace PTR_TO_STACK
> in some scenarios?
>
> WDYT?

Hi Hao. I love the idea but unfortunately I don't think it applies
neatly in this case. "local" in the context of dynptrs means memory
that is local to the bpf program (eg includes not just memory on the
stack).

>
> >  /* Max number of base types. */
> >  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> > @@ -2556,12 +2559,15 @@ enum bpf_dynptr_type {
> >         BPF_DYNPTR_TYPE_LOCAL,
> >         /* Underlying data is a ringbuf record */
> >         BPF_DYNPTR_TYPE_RINGBUF,
> > +       /* Underlying data is a sk_buff */
> > +       BPF_DYNPTR_TYPE_SKB,
> >  };
> >
> <...>
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > --
> > 2.30.2
> >

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
  2022-07-27 17:13   ` sdf
  2022-07-28 17:45   ` Hao Luo
@ 2022-07-28 23:39   ` Martin KaFai Lau
  2022-07-29 20:26     ` Joanne Koong
  2022-08-01 22:11   ` Andrii Nakryiko
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-07-28 23:39 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>  	if (bpf_dynptr_is_rdonly(ptr))
Is it possible to allow data slice for rdonly dynptr-skb?
and depends on the may_access_direct_pkt_data() check in the verifier.

>  		return 0;
>  
> +	type = bpf_dynptr_get_type(ptr);
> +
> +	if (type == BPF_DYNPTR_TYPE_SKB) {
> +		struct sk_buff *skb = ptr->data;
> +
> +		/* if the data is paged, the caller needs to pull it first */
> +		if (ptr->offset + offset + len > skb->len - skb->data_len)
> +			return 0;
> +
> +		return (unsigned long)(skb->data + ptr->offset + offset);
> +	}
> +
>  	return (unsigned long)(ptr->data + ptr->offset + offset);
>  }

[ ... ]

> -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> +				       struct bpf_call_arg_meta *meta)
>  {
>  	struct bpf_func_state *state = func(env, reg);
>  	int spi = get_spi(reg->off);
>  
> -	return state->stack[spi].spilled_ptr.id;
> +	meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> +	meta->type = state->stack[spi].spilled_ptr.dynptr.type;
>  }
>  
>  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  				case DYNPTR_TYPE_RINGBUF:
>  					err_extra = "ringbuf ";
>  					break;
> +				case DYNPTR_TYPE_SKB:
> +					err_extra = "skb ";
> +					break;
>  				default:
>  					break;
>  				}
> @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  					verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
>  					return -EFAULT;
>  				}
> -				/* Find the id of the dynptr we're tracking the reference of */
> -				meta->ref_obj_id = stack_slot_get_id(env, reg);
> +				/* Find the id and the type of the dynptr we're tracking
> +				 * the reference of.
> +				 */
> +				stack_slot_get_dynptr_info(env, reg, meta);
>  			}
>  		}
>  		break;
> @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
>  	} else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
>  		mark_reg_known_zero(env, regs, BPF_REG_0);
> -		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> +		if (func_id == BPF_FUNC_dynptr_data &&
> +		    meta.type == BPF_DYNPTR_TYPE_SKB)
> +			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> +		else
> +			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>  		regs[BPF_REG_0].mem_size = meta.mem_size;
check_packet_access() uses range.
It took me a while to figure range and mem_size is in union.
Mentioning here in case someone has similar question.

>  	} else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
>  		const struct btf_type *t;
> @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  			goto patch_call_imm;
>  		}
>  
> +		if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> +				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> +			else
> +				insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> +			insn_buf[1] = *insn;
> +			cnt = 2;
> +
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta += cnt - 1;
> +			env->prog = new_prog;
> +			prog = new_prog;
> +			insn = new_prog->insnsi + i + delta;
> +			goto patch_call_imm;
> +		}
Have you considered to reject bpf_dynptr_write()
at prog load time?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-28 23:39   ` Martin KaFai Lau
@ 2022-07-29 20:26     ` Joanne Koong
  2022-07-29 21:39       ` Martin KaFai Lau
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-07-29 20:26 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >       if (bpf_dynptr_is_rdonly(ptr))
> Is it possible to allow data slice for rdonly dynptr-skb?
> and depends on the may_access_direct_pkt_data() check in the verifier.

Ooh great idea. This should be very simple to do, since the data slice
that gets returned is assigned as PTR_TO_PACKET. So any stx operations
on it will by default go through the may_access_direct_pkt_data()
check. I'll add this for v2.

>
> >               return 0;
> >
> > +     type = bpf_dynptr_get_type(ptr);
> > +
> > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > +             struct sk_buff *skb = ptr->data;
> > +
> > +             /* if the data is paged, the caller needs to pull it first */
> > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > +                     return 0;
> > +
> > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > +     }
> > +
> >       return (unsigned long)(ptr->data + ptr->offset + offset);
> >  }
>
> [ ... ]
>
> > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > +                                    struct bpf_call_arg_meta *meta)
> >  {
> >       struct bpf_func_state *state = func(env, reg);
> >       int spi = get_spi(reg->off);
> >
> > -     return state->stack[spi].spilled_ptr.id;
> > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> >  }
> >
> >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                               case DYNPTR_TYPE_RINGBUF:
> >                                       err_extra = "ringbuf ";
> >                                       break;
> > +                             case DYNPTR_TYPE_SKB:
> > +                                     err_extra = "skb ";
> > +                                     break;
> >                               default:
> >                                       break;
> >                               }
> > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> >                                       return -EFAULT;
> >                               }
> > -                             /* Find the id of the dynptr we're tracking the reference of */
> > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > +                             /* Find the id and the type of the dynptr we're tracking
> > +                              * the reference of.
> > +                              */
> > +                             stack_slot_get_dynptr_info(env, reg, meta);
> >                       }
> >               }
> >               break;
> > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > +             if (func_id == BPF_FUNC_dynptr_data &&
> > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > +             else
> > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >               regs[BPF_REG_0].mem_size = meta.mem_size;
> check_packet_access() uses range.
> It took me a while to figure range and mem_size is in union.
> Mentioning here in case someone has similar question.
For v2, I'll add this as a comment in the code or I'll include
"regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
obvious :)
>
> >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> >               const struct btf_type *t;
> > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >                       goto patch_call_imm;
> >               }
> >
> > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > +                     else
> > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > +                     insn_buf[1] = *insn;
> > +                     cnt = 2;
> > +
> > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +                     if (!new_prog)
> > +                             return -ENOMEM;
> > +
> > +                     delta += cnt - 1;
> > +                     env->prog = new_prog;
> > +                     prog = new_prog;
> > +                     insn = new_prog->insnsi + i + delta;
> > +                     goto patch_call_imm;
> > +             }
> Have you considered to reject bpf_dynptr_write()
> at prog load time?
It's possible to reject bpf_dynptr_write() at prog load time but would
require adding tracking in the verifier for whether a dynptr is
read-only or not. Do you think it's better to reject it at load time
instead of returning NULL at runtime?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-29 20:26     ` Joanne Koong
@ 2022-07-29 21:39       ` Martin KaFai Lau
  2022-08-01 17:52         ` Joanne Koong
  2022-08-03 20:29         ` Joanne Koong
  0 siblings, 2 replies; 52+ messages in thread
From: Martin KaFai Lau @ 2022-07-29 21:39 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Fri, Jul 29, 2022 at 01:26:31PM -0700, Joanne Koong wrote:
> On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> > >       if (bpf_dynptr_is_rdonly(ptr))
> > Is it possible to allow data slice for rdonly dynptr-skb?
> > and depends on the may_access_direct_pkt_data() check in the verifier.
> 
> Ooh great idea. This should be very simple to do, since the data slice
> that gets returned is assigned as PTR_TO_PACKET. So any stx operations
> on it will by default go through the may_access_direct_pkt_data()
> check. I'll add this for v2.
It will be great.  Out of all three helpers (bpf_dynptr_read/write/data),
bpf_dynptr_data will be the useful one to parse the header data (e.g. tcp-hdr-opt)
that has runtime variable length because bpf_dynptr_data() can take a non-cost
'offset' argument.  It is useful to get a consistent usage across all bpf
prog types that are either read-only or read-write of the skb.

> 
> >
> > >               return 0;
> > >
> > > +     type = bpf_dynptr_get_type(ptr);
> > > +
> > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > +             struct sk_buff *skb = ptr->data;
> > > +
> > > +             /* if the data is paged, the caller needs to pull it first */
> > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > +                     return 0;
> > > +
> > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > +     }
> > > +
> > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > >  }
> >
> > [ ... ]
> >
> > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > > +                                    struct bpf_call_arg_meta *meta)
> > >  {
> > >       struct bpf_func_state *state = func(env, reg);
> > >       int spi = get_spi(reg->off);
> > >
> > > -     return state->stack[spi].spilled_ptr.id;
> > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > >  }
> > >
> > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > >                               case DYNPTR_TYPE_RINGBUF:
> > >                                       err_extra = "ringbuf ";
> > >                                       break;
> > > +                             case DYNPTR_TYPE_SKB:
> > > +                                     err_extra = "skb ";
> > > +                                     break;
> > >                               default:
> > >                                       break;
> > >                               }
> > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> > >                                       return -EFAULT;
> > >                               }
> > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > +                             /* Find the id and the type of the dynptr we're tracking
> > > +                              * the reference of.
> > > +                              */
> > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > >                       }
> > >               }
> > >               break;
> > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > +             else
> > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > check_packet_access() uses range.
> > It took me a while to figure range and mem_size is in union.
> > Mentioning here in case someone has similar question.
> For v2, I'll add this as a comment in the code or I'll include
> "regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
> obvious :)
'regs[BPF_REG_0].range = meta.mem_size' would be great.  No strong
opinion here.

> >
> > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > >               const struct btf_type *t;
> > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >                       goto patch_call_imm;
> > >               }
> > >
> > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > +                     else
> > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > +                     insn_buf[1] = *insn;
> > > +                     cnt = 2;
> > > +
> > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > +                     if (!new_prog)
> > > +                             return -ENOMEM;
> > > +
> > > +                     delta += cnt - 1;
> > > +                     env->prog = new_prog;
> > > +                     prog = new_prog;
> > > +                     insn = new_prog->insnsi + i + delta;
> > > +                     goto patch_call_imm;
> > > +             }
> > Have you considered to reject bpf_dynptr_write()
> > at prog load time?
> It's possible to reject bpf_dynptr_write() at prog load time but would
> require adding tracking in the verifier for whether a dynptr is
> read-only or not. Do you think it's better to reject it at load time
> instead of returning NULL at runtime?
The check_helper_call above seems to know 'meta.type == BPF_DYNPTR_TYPE_SKB'.
Together with may_access_direct_pkt_data(), would it be enough ?
Then no need to do patching for BPF_FUNC_dynptr_from_skb here.

Since we are on bpf_dynptr_write, what is the reason
on limiting it to the skb_headlen() ?  Not implying one
way is better than another.  would like to undertand the reason
behind it since it is not clear in the commit message.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-29 21:39       ` Martin KaFai Lau
@ 2022-08-01 17:52         ` Joanne Koong
  2022-08-01 19:38           ` Martin KaFai Lau
  2022-08-03 20:29         ` Joanne Koong
  1 sibling, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-01 17:52 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Fri, Jul 29, 2022 at 2:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Fri, Jul 29, 2022 at 01:26:31PM -0700, Joanne Koong wrote:
> > On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> > > >       if (bpf_dynptr_is_rdonly(ptr))
> > > Is it possible to allow data slice for rdonly dynptr-skb?
> > > and depends on the may_access_direct_pkt_data() check in the verifier.
> >
> > Ooh great idea. This should be very simple to do, since the data slice
> > that gets returned is assigned as PTR_TO_PACKET. So any stx operations
> > on it will by default go through the may_access_direct_pkt_data()
> > check. I'll add this for v2.
> It will be great.  Out of all three helpers (bpf_dynptr_read/write/data),
> bpf_dynptr_data will be the useful one to parse the header data (e.g. tcp-hdr-opt)
> that has runtime variable length because bpf_dynptr_data() can take a non-cost
> 'offset' argument.  It is useful to get a consistent usage across all bpf
> prog types that are either read-only or read-write of the skb.
>
> >
> > >
> > > >               return 0;
> > > >
> > > > +     type = bpf_dynptr_get_type(ptr);
> > > > +
> > > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > > +             struct sk_buff *skb = ptr->data;
> > > > +
> > > > +             /* if the data is paged, the caller needs to pull it first */
> > > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > > +                     return 0;
> > > > +
> > > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > > +     }
> > > > +
> > > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > > >  }
> > >
> > > [ ... ]
> > >
> > > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > > > +                                    struct bpf_call_arg_meta *meta)
> > > >  {
> > > >       struct bpf_func_state *state = func(env, reg);
> > > >       int spi = get_spi(reg->off);
> > > >
> > > > -     return state->stack[spi].spilled_ptr.id;
> > > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > > >  }
> > > >
> > > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > >                               case DYNPTR_TYPE_RINGBUF:
> > > >                                       err_extra = "ringbuf ";
> > > >                                       break;
> > > > +                             case DYNPTR_TYPE_SKB:
> > > > +                                     err_extra = "skb ";
> > > > +                                     break;
> > > >                               default:
> > > >                                       break;
> > > >                               }
> > > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> > > >                                       return -EFAULT;
> > > >                               }
> > > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > > +                             /* Find the id and the type of the dynptr we're tracking
> > > > +                              * the reference of.
> > > > +                              */
> > > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > > >                       }
> > > >               }
> > > >               break;
> > > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > +             else
> > > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > check_packet_access() uses range.
> > > It took me a while to figure range and mem_size is in union.
> > > Mentioning here in case someone has similar question.
> > For v2, I'll add this as a comment in the code or I'll include
> > "regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
> > obvious :)
> 'regs[BPF_REG_0].range = meta.mem_size' would be great.  No strong
> opinion here.
>
> > >
> > > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > > >               const struct btf_type *t;
> > > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > > >                       goto patch_call_imm;
> > > >               }
> > > >
> > > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > > +                     else
> > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > > +                     insn_buf[1] = *insn;
> > > > +                     cnt = 2;
> > > > +
> > > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > > +                     if (!new_prog)
> > > > +                             return -ENOMEM;
> > > > +
> > > > +                     delta += cnt - 1;
> > > > +                     env->prog = new_prog;
> > > > +                     prog = new_prog;
> > > > +                     insn = new_prog->insnsi + i + delta;
> > > > +                     goto patch_call_imm;
> > > > +             }
> > > Have you considered to reject bpf_dynptr_write()
> > > at prog load time?
> > It's possible to reject bpf_dynptr_write() at prog load time but would
> > require adding tracking in the verifier for whether a dynptr is
> > read-only or not. Do you think it's better to reject it at load time
> > instead of returning NULL at runtime?
> The check_helper_call above seems to know 'meta.type == BPF_DYNPTR_TYPE_SKB'.
> Together with may_access_direct_pkt_data(), would it be enough ?
> Then no need to do patching for BPF_FUNC_dynptr_from_skb here.
Yeah! That would detect it just as well. I'll add this to v2 :)
>
> Since we are on bpf_dynptr_write, what is the reason
> on limiting it to the skb_headlen() ?  Not implying one
> way is better than another.  would like to undertand the reason
> behind it since it is not clear in the commit message.
For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
may be writes that pull the skb, so any existing data slices to the
skb must be invalidated. However, in the verifier we can't detect when
the data slice should be invalidated vs. when it shouldn't (eg
detecting when a write goes into the paged area vs when the write is
only in the head). If the prog wants to write into the paged area, I
think the only way it can work is if it pulls the data first with
bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
the commit message in v2

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
  2022-07-26 19:44   ` Zvi Effron
@ 2022-08-01 17:58   ` Andrii Nakryiko
  2022-08-02 22:56     ` Joanne Koong
  2022-08-01 19:12   ` Alexei Starovoitov
  2 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-01 17:58 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Test skb and xdp dynptr functionality in the following ways:
>
> 1) progs/test_xdp.c
>    * Change existing test to use dynptrs to parse xdp data
>
>      There were no noticeable diferences in user + system time between
>      the original version vs. using dynptrs. Averaging the time for 10
>      runs (run using "time ./test_progs -t xdp_bpf2bpf"):
>          original version: 0.0449 sec
>          with dynptrs: 0.0429 sec
>
> 2) progs/test_l4lb_noinline.c
>    * Change existing test to use dynptrs to parse skb data
>
>      There were no noticeable diferences in user + system time between
>      the original version vs. using dynptrs. Averaging the time for 10
>      runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
>          original version: 0.0502 sec
>          with dynptrs: 0.055 sec
>
>      For number of processed verifier instructions:
>          original version: 6284 insns
>          with dynptrs: 2538 insns
>
> 3) progs/test_dynptr_xdp.c
>    * Add sample code for parsing tcp hdr opt lookup using dynptrs.
>      This logic is lifted from a real-world use case of packet parsing in
>      katran [0], a layer 4 load balancer
>
> 4) progs/dynptr_success.c
>    * Add test case "test_skb_readonly" for testing attempts at writes /
>      data slices on a prog type with read-only skb ctx.
>
> 5) progs/dynptr_fail.c
>    * Add test cases "skb_invalid_data_slice" and
>      "xdp_invalid_data_slice" for testing that helpers that modify the
>      underlying packet buffer automatically invalidate the associated
>      data slice.
>    * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
>      that prog types that do not support bpf_dynptr_from_skb/xdp don't
>      have access to the API.
>
> [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
>  .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
>  .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
>  .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
>  .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
>  .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
>  tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
>  .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
>  8 files changed, 416 insertions(+), 108 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> index bcf80b9f7c27..c40631f33c7b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> @@ -2,6 +2,7 @@
>  /* Copyright (c) 2022 Facebook */
>
>  #include <test_progs.h>
> +#include <network_helpers.h>
>  #include "dynptr_fail.skel.h"
>  #include "dynptr_success.skel.h"
>
> @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
>  static struct {
>         const char *prog_name;
>         const char *expected_err_msg;
> -} dynptr_tests[] = {
> -       /* failure cases */
> +} verifier_error_tests[] = {
> +       /* these cases should trigger a verifier error */
>         {"ringbuf_missing_release1", "Unreleased reference id=1"},
>         {"ringbuf_missing_release2", "Unreleased reference id=2"},
>         {"ringbuf_missing_release_callback", "Unreleased reference id"},
> @@ -42,11 +43,25 @@ static struct {
>         {"release_twice_callback", "arg 1 is an unacquired reference"},
>         {"dynptr_from_mem_invalid_api",
>                 "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> +       {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> +       {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> +       {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> +       {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> +};
> +
> +enum test_setup_type {
> +       SETUP_SYSCALL_SLEEP,
> +       SETUP_SKB_PROG,
> +};
>
> -       /* success cases */
> -       {"test_read_write", NULL},
> -       {"test_data_slice", NULL},
> -       {"test_ringbuf", NULL},
> +static struct {
> +       const char *prog_name;
> +       enum test_setup_type type;
> +} runtime_tests[] = {
> +       {"test_read_write", SETUP_SYSCALL_SLEEP},
> +       {"test_data_slice", SETUP_SYSCALL_SLEEP},
> +       {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> +       {"test_skb_readonly", SETUP_SKB_PROG},

nit: wouldn't it be better to add test_setup_type to dynptr_tests (and
keep fail and success cases together)? It's conceivable that you might
want different setups to test different error conditions, right?

>  };
>
>  static void verify_fail(const char *prog_name, const char *expected_err_msg)
> @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
>         dynptr_fail__destroy(skel);
>  }
>
> -static void verify_success(const char *prog_name)
> +static void run_tests(const char *prog_name, enum test_setup_type setup_type)
>  {
>         struct dynptr_success *skel;
>         struct bpf_program *prog;
> @@ -107,15 +122,42 @@ static void verify_success(const char *prog_name)
>         if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
>                 goto cleanup;
>
> -       link = bpf_program__attach(prog);
> -       if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> -               goto cleanup;
> +       switch (setup_type) {
> +       case SETUP_SYSCALL_SLEEP:
> +               link = bpf_program__attach(prog);
> +               if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> +                       goto cleanup;
>
> -       usleep(1);
> +               usleep(1);
>
> -       ASSERT_EQ(skel->bss->err, 0, "err");
> +               bpf_link__destroy(link);
> +               break;
> +       case SETUP_SKB_PROG:
> +       {
> +               int prog_fd, err;
> +               char buf[64];
> +
> +               prog_fd = bpf_program__fd(prog);
> +               if (CHECK_FAIL(prog_fd < 0))

please don't use CHECK and especially CHECK_FAIL

> +                       goto cleanup;
> +
> +               LIBBPF_OPTS(bpf_test_run_opts, topts,
> +                           .data_in = &pkt_v4,
> +                           .data_size_in = sizeof(pkt_v4),
> +                           .data_out = buf,
> +                           .data_size_out = sizeof(buf),
> +                           .repeat = 1,
> +               );

nit: LIBBPF_OPTS declares variable, so should be part of variable
declaration block

>
> -       bpf_link__destroy(link);
> +               err = bpf_prog_test_run_opts(prog_fd, &topts);
> +
> +               if (!ASSERT_OK(err, "test_run"))
> +                       goto cleanup;
> +
> +               break;
> +       }
> +       }
> +       ASSERT_EQ(skel->bss->err, 0, "err");
>
>  cleanup:
>         dynptr_success__destroy(skel);
> @@ -125,14 +167,17 @@ void test_dynptr(void)
>  {
>         int i;
>
> -       for (i = 0; i < ARRAY_SIZE(dynptr_tests); i++) {
> -               if (!test__start_subtest(dynptr_tests[i].prog_name))
> +       for (i = 0; i < ARRAY_SIZE(verifier_error_tests); i++) {
> +               if (!test__start_subtest(verifier_error_tests[i].prog_name))
> +                       continue;
> +
> +               verify_fail(verifier_error_tests[i].prog_name,
> +                           verifier_error_tests[i].expected_err_msg);
> +       }
> +       for (i = 0; i < ARRAY_SIZE(runtime_tests); i++) {
> +               if (!test__start_subtest(runtime_tests[i].prog_name))
>                         continue;
>
> -               if (dynptr_tests[i].expected_err_msg)
> -                       verify_fail(dynptr_tests[i].prog_name,
> -                                   dynptr_tests[i].expected_err_msg);
> -               else
> -                       verify_success(dynptr_tests[i].prog_name);
> +               run_tests(runtime_tests[i].prog_name, runtime_tests[i].type);
>         }
>  }
> diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> new file mode 100644
> index 000000000000..ca775d126b60
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include "test_dynptr_xdp.skel.h"
> +#include "test_tcp_hdr_options.h"
> +
> +struct test_pkt {
> +       struct ipv6_packet pk6_v6;
> +       u8 options[16];
> +} __packed;
> +
> +void test_dynptr_xdp(void)
> +{
> +       struct test_dynptr_xdp *skel;
> +       char buf[128];
> +       int err;
> +
> +       skel = test_dynptr_xdp__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> +               return;
> +
> +       struct test_pkt pkt = {
> +               .pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
> +               .pk6_v6.iph.nexthdr = IPPROTO_TCP,
> +               .pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
> +               .pk6_v6.tcp.urg_ptr = 123,
> +               .pk6_v6.tcp.doff = 9, /* 16 bytes of options */
> +
> +               .options = {
> +                       TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
> +                       skel->rodata->tcp_hdr_opt_kind_tpr, 6, 0, 0, 0, 9, TCPOPT_EOL
> +               },
> +       };
> +
> +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> +                   .data_in = &pkt,
> +                   .data_size_in = sizeof(pkt),
> +                   .data_out = buf,
> +                   .data_size_out = sizeof(buf),
> +                   .repeat = 3,
> +       );
> +

for topts and pkt, they should be up above with other variables
(unless we want to break off from kernel code style, which I didn't
think we want)

> +       err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.xdp_ingress_v6), &topts);
> +       ASSERT_OK(err, "ipv6 test_run");
> +       ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> +       ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
> +
> +       test_dynptr_xdp__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> index c1814938a5fd..4e3f853b2d02 100644
> --- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
> +++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> @@ -5,6 +5,7 @@
>  #include <string.h>
>  #include <linux/bpf.h>
>  #include <bpf/bpf_helpers.h>
> +#include <linux/if_ether.h>
>  #include "bpf_misc.h"
>
>  char _license[] SEC("license") = "GPL";
> @@ -622,3 +623,78 @@ int dynptr_from_mem_invalid_api(void *ctx)
>
>         return 0;
>  }
> +
> +/* The data slice is invalidated whenever a helper changes packet data */
> +SEC("?tc")
> +int skb_invalid_data_slice(struct __sk_buff *skb)
> +{
> +       struct bpf_dynptr ptr;
> +       struct ethhdr *hdr;
> +
> +       bpf_dynptr_from_skb(skb, 0, &ptr);
> +       hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
> +       if (!hdr)
> +               return SK_DROP;
> +
> +       hdr->h_proto = 12;
> +
> +       if (bpf_skb_pull_data(skb, skb->len))
> +               return SK_DROP;
> +
> +       /* this should fail */
> +       hdr->h_proto = 1;
> +
> +       return SK_PASS;
> +}
> +
> +/* The data slice is invalidated whenever a helper changes packet data */
> +SEC("?xdp")
> +int xdp_invalid_data_slice(struct xdp_md *xdp)
> +{
> +       struct bpf_dynptr ptr;
> +       struct ethhdr *hdr1, *hdr2;
> +
> +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> +       hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> +       if (!hdr1)
> +               return XDP_DROP;
> +
> +       hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> +       if (!hdr2)
> +               return XDP_DROP;
> +
> +       hdr1->h_proto = 12;
> +       hdr2->h_proto = 12;
> +
> +       if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> +               return XDP_DROP;
> +
> +       /* this should fail */
> +       hdr2->h_proto = 1;

is there something special about having both hdr1 and hdr2? Wouldn't
this test work with just single hdr pointer?

> +
> +       return XDP_PASS;
> +}
> +
> +/* Only supported prog type can create skb-type dynptrs */

[...]

> +       err = 1;
> +
> +       if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> +               return 0;
> +       err++;
> +
> +       data = bpf_dynptr_data(&ptr, 0, 1);
> +       if (data)
> +               /* it's an error if data is not NULL since cgroup skbs
> +                * are read only
> +                */
> +               return 0;
> +       err++;
> +
> +       ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> +       /* since cgroup skbs are read only, writes should fail */
> +       if (ret != -EINVAL)
> +               return 0;
> +
> +       err = 0;

hm, if data is NULL you'll still report success if bpf_dynptr_write
returns 0 or any other error but -EINVAL... The logic is a bit unclear
here...

> +
> +       return 0;
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> new file mode 100644
> index 000000000000..c879dfb6370a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> @@ -0,0 +1,115 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/* This logic is lifted from a real-world use case of packet parsing, used in
> + * the open source library katran, a layer 4 load balancer.
> + *
> + * This test demonstrates how to parse packet contents using dynptrs.
> + *
> + * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> + */
> +
> +#include <string.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +#include <linux/tcp.h>
> +#include <stdbool.h>
> +#include <linux/ipv6.h>
> +#include <linux/if_ether.h>
> +#include "test_tcp_hdr_options.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +/* Arbitrarily picked unused value from IANA TCP Option Kind Numbers */
> +const __u32 tcp_hdr_opt_kind_tpr = 0xB7;
> +/* Length of the tcp header option */
> +const __u32 tcp_hdr_opt_len_tpr = 6;
> +/* maximum number of header options to check to lookup server_id */
> +const __u32 tcp_hdr_opt_max_opt_checks = 15;
> +
> +__u32 server_id;
> +
> +static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
> +                        __u32 *server_id)
> +{
> +       __u8 *tcp_opt, kind, hdr_len;
> +       __u8 *data;
> +
> +       data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
> +                              sizeof(*server_id));
> +       if (!data)
> +               return -1;
> +
> +       kind = data[0];
> +
> +       if (kind == TCPOPT_EOL)
> +               return -1;
> +
> +       if (kind == TCPOPT_NOP) {
> +               *off += 1;
> +               /* continue to the next option */
> +               *hdr_bytes_remaining -= 1;
> +
> +               return 0;
> +       }
> +
> +       if (*hdr_bytes_remaining < 2)
> +               return -1;
> +
> +       hdr_len = data[1];
> +       if (hdr_len > *hdr_bytes_remaining)
> +               return -1;
> +
> +       if (kind == tcp_hdr_opt_kind_tpr) {
> +               if (hdr_len != tcp_hdr_opt_len_tpr)
> +                       return -1;
> +
> +               memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));

this implicitly relies on compiler inlining memcpy, let's use
__builtint_memcpy() here instead to set a good example?

> +               return 1;
> +       }
> +
> +       *off += hdr_len;
> +       *hdr_bytes_remaining -= hdr_len;
> +
> +       return 0;
> +}
> +
> +SEC("xdp")
> +int xdp_ingress_v6(struct xdp_md *xdp)
> +{
> +       __u8 hdr_bytes_remaining;
> +       struct tcphdr *tcp_hdr;
> +       __u8 tcp_hdr_opt_len;
> +       int err = 0;
> +       __u32 off;
> +
> +       struct bpf_dynptr ptr;
> +
> +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> +
> +       off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> +
> +       tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
> +       if (!tcp_hdr)
> +               return XDP_DROP;
> +
> +       tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
> +       if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
> +               return XDP_DROP;
> +
> +       hdr_bytes_remaining = tcp_hdr_opt_len;
> +
> +       off += sizeof(struct tcphdr);
> +
> +       /* max number of bytes of options in tcp header is 40 bytes */
> +       for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
> +               err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
> +
> +               if (err || !hdr_bytes_remaining)
> +                       break;
> +       }
> +
> +       if (!server_id)
> +               return XDP_DROP;
> +
> +       return XDP_PASS;
> +}

I'm not a networking BPF expert, but the logic of packet parsing here
looks pretty clean! Would it be possible to also backport original
code with data and data_end, both for testing but also to be able to
compare and contrast dynptr vs data/data_end approaches?


> diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> index c8bc0c6947aa..1fef7868ea8b 100644
> --- a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> +++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> @@ -230,21 +230,18 @@ static __noinline bool get_packet_dst(struct real_definition **real,
>         return true;
>  }
>
> -static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> +static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
>                                    struct packet_description *pckt)
>  {
>         struct icmp6hdr *icmp_hdr;
>         struct ipv6hdr *ip6h;
>
> -       icmp_hdr = data + off;
> -       if (icmp_hdr + 1 > data_end)
> +       icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*ip6h));
> +       if (!icmp_hdr)
>                 return TC_ACT_SHOT;
>         if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
>                 return TC_ACT_OK;

previously you can still TC_ACT_OK if it's ICMPV6_PKT_TOOBIG even if
packet size is < sizeof(*icmp_hdr) + sizeof(*ip6h), which might have
been a bug, but current logic will enforce that packet is at least
sizeof(*icmp_hdr) + sizeof(*ip6h). Is that a problem?

> -       off += sizeof(struct icmp6hdr);
> -       ip6h = data + off;
> -       if (ip6h + 1 > data_end)
> -               return TC_ACT_SHOT;
> +       ip6h = (struct ipv6hdr *)(icmp_hdr + 1);
>         pckt->proto = ip6h->nexthdr;
>         pckt->flags |= F_ICMP;
>         memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
> @@ -252,22 +249,19 @@ static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
>         return TC_ACT_UNSPEC;
>  }
>
> -static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> +static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
>                                  struct packet_description *pckt)
>  {
>         struct icmphdr *icmp_hdr;
>         struct iphdr *iph;
>
> -       icmp_hdr = data + off;
> -       if (icmp_hdr + 1 > data_end)
> +       icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*iph));
> +       if (!icmp_hdr)
>                 return TC_ACT_SHOT;
>         if (icmp_hdr->type != ICMP_DEST_UNREACH ||
>             icmp_hdr->code != ICMP_FRAG_NEEDED)
>                 return TC_ACT_OK;

similarly here, short packets can still be TC_ACT_OK in some
circumstances, while with dynptr they will be shot down early on. Not
saying this is wrong or bad, just bringing this up for you and others
to chime in if it's an ok change

> -       off += sizeof(struct icmphdr);
> -       iph = data + off;
> -       if (iph + 1 > data_end)
> -               return TC_ACT_SHOT;
> +       iph = (struct iphdr *)(icmp_hdr + 1);
>         if (iph->ihl != 5)
>                 return TC_ACT_SHOT;
>         pckt->proto = iph->protocol;
> @@ -277,13 +271,13 @@ static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
>         return TC_ACT_UNSPEC;
>  }

[...]

> -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
>  {
> -       void *data_end = (void *)(long)xdp->data_end;
> -       void *data = (void *)(long)xdp->data;
> +       struct bpf_dynptr new_xdp_ptr;
>         struct iptnl_info *tnl;
>         struct ethhdr *new_eth;
>         struct ethhdr *old_eth;
> -       struct iphdr *iph = data + sizeof(struct ethhdr);
> +       struct iphdr *iph;
>         __u16 *next_iph;
>         __u16 payload_len;
>         struct vip vip = {};
> @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
>         __u32 csum = 0;
>         int i;
>
> -       if (iph + 1 > data_end)
> +       iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> +                             iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));

tcphdr_sz (20) is always bigger than udphdr_sz (8), so just use the
bigger one here? Though again, for UDP packet it might be a bit too
pessimistic to reject small packets?

> +       if (!iph)
>                 return XDP_DROP;
>
> -       dport = get_dport(iph + 1, data_end, iph->protocol);
> +       dport = get_dport(iph + 1, iph->protocol);
>         if (dport == -1)
>                 return XDP_DROP;

[...]

> -static __always_inline int handle_ipv6(struct xdp_md *xdp)
> +static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
>  {
> -       void *data_end = (void *)(long)xdp->data_end;
> -       void *data = (void *)(long)xdp->data;
> +       struct bpf_dynptr new_xdp_ptr;
>         struct iptnl_info *tnl;
>         struct ethhdr *new_eth;
>         struct ethhdr *old_eth;
> -       struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
> +       struct ipv6hdr *ip6h;
>         __u16 payload_len;
>         struct vip vip = {};
>         int dport;
>
> -       if (ip6h + 1 > data_end)
> +       ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> +                              ipv6hdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));

ditto, there is no dynamism here, verifier actually enforces that this
value is statically known, I think this example will create false
assumptions if written this way

> +       if (!ip6h)
>                 return XDP_DROP;
>
> -       dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
> +       dport = get_dport(ip6h + 1, ip6h->nexthdr);
>         if (dport == -1)
>                 return XDP_DROP;
>

[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
  2022-07-26 19:44   ` Zvi Effron
  2022-08-01 17:58   ` Andrii Nakryiko
@ 2022-08-01 19:12   ` Alexei Starovoitov
  2022-08-02 22:21     ` Joanne Koong
  2 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-08-01 19:12 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
>
> -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
>  {
> -       void *data_end = (void *)(long)xdp->data_end;
> -       void *data = (void *)(long)xdp->data;
> +       struct bpf_dynptr new_xdp_ptr;
>         struct iptnl_info *tnl;
>         struct ethhdr *new_eth;
>         struct ethhdr *old_eth;
> -       struct iphdr *iph = data + sizeof(struct ethhdr);
> +       struct iphdr *iph;
>         __u16 *next_iph;
>         __u16 payload_len;
>         struct vip vip = {};
> @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
>         __u32 csum = 0;
>         int i;
>
> -       if (iph + 1 > data_end)
> +       iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> +                             iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> +       if (!iph)
>                 return XDP_DROP;

dynptr based xdp/skb access looks neat.
Maybe in addition to bpf_dynptr_data() we can add helper(s)
that return skb/xdp_md from dynptr?
This way the code will be passing dynptr only and there will
be no need to pass around 'struct xdp_md *xdp' (like this function).

Separately please keep the existing tests instead of converting them.
Either ifdef data/data_end vs dynptr style or copy paste
the whole test into a new .c file. Whichever is cleaner.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 17:52         ` Joanne Koong
@ 2022-08-01 19:38           ` Martin KaFai Lau
  2022-08-01 21:16             ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-01 19:38 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > Since we are on bpf_dynptr_write, what is the reason
> > on limiting it to the skb_headlen() ?  Not implying one
> > way is better than another.  would like to undertand the reason
> > behind it since it is not clear in the commit message.
> For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> may be writes that pull the skb, so any existing data slices to the
> skb must be invalidated. However, in the verifier we can't detect when
> the data slice should be invalidated vs. when it shouldn't (eg
> detecting when a write goes into the paged area vs when the write is
> only in the head). If the prog wants to write into the paged area, I
> think the only way it can work is if it pulls the data first with
> bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> the commit message in v2
Note that current verifier unconditionally invalidates PTR_TO_PACKET
after bpf_skb_store_bytes().  Potentially the same could be done for
other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
behavior cannot be changed later, so want to raise this possibility here
just in case it wasn't considered before.

Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
If the user changes the skb by directly using skb->data to avoid calling
load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 19:38           ` Martin KaFai Lau
@ 2022-08-01 21:16             ` Joanne Koong
  2022-08-01 22:14               ` Andrii Nakryiko
  2022-08-01 22:32               ` Martin KaFai Lau
  0 siblings, 2 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-01 21:16 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > Since we are on bpf_dynptr_write, what is the reason
> > > on limiting it to the skb_headlen() ?  Not implying one
> > > way is better than another.  would like to undertand the reason
> > > behind it since it is not clear in the commit message.
> > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > may be writes that pull the skb, so any existing data slices to the
> > skb must be invalidated. However, in the verifier we can't detect when
> > the data slice should be invalidated vs. when it shouldn't (eg
> > detecting when a write goes into the paged area vs when the write is
> > only in the head). If the prog wants to write into the paged area, I
> > think the only way it can work is if it pulls the data first with
> > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > the commit message in v2
> Note that current verifier unconditionally invalidates PTR_TO_PACKET
> after bpf_skb_store_bytes().  Potentially the same could be done for
> other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> behavior cannot be changed later, so want to raise this possibility here
> just in case it wasn't considered before.

Thanks for raising this possibility. To me, it seems more intuitive
from the user standpoint to have bpf_dynptr_write() on a paged area
fail (even if bpf_dynptr_read() on that same offset succeeds) than to
have bpf_dynptr_write() always invalidate all dynptr slices related to
that skb. I think most writes will be to the data in the head area,
which seems unfortunate that bpf_dynptr_writes to the head area would
invalidate the dynptr slices regardless.

What are your thoughts? Do you think you prefer having
bpf_dynptr_write() always work regardless of where the data is? If so,
I'm happy to make that change for v2 :)

>
> Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
> If the user changes the skb by directly using skb->data to avoid calling
> load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
> before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
> it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
> byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
> bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
                     ` (2 preceding siblings ...)
  2022-07-28 23:39   ` Martin KaFai Lau
@ 2022-08-01 22:11   ` Andrii Nakryiko
  2022-08-02  0:15     ` Joanne Koong
  2022-08-01 23:33   ` Jakub Kicinski
  2022-08-03  6:37   ` Martin KaFai Lau
  5 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-01 22:11 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (writes and data slices are not permitted). For reads on the
> dynptr, this includes reading into data in the non-linear paged buffers
> but for writes and data slices, if the data is in a paged buffer, the
> user must first call bpf_skb_pull_data to pull the data into the linear
> portion.
>
> Additionally, any helper calls that change the underlying packet buffer
> (eg bpf_skb_pull_data) invalidates any data slices of the associated
> dynptr.
>
> Right now, skb dynptrs can only be constructed from skbs that are
> the bpf program context - as such, there does not need to be any
> reference tracking or release on skb dynptrs.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  8 ++++-
>  include/linux/filter.h         |  4 +++
>  include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
>  kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
>  net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
>  tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
>  7 files changed, 229 insertions(+), 17 deletions(-)
>

[...]

> +       type = bpf_dynptr_get_type(dst);
> +
> +       if (flags) {
> +               if (type == BPF_DYNPTR_TYPE_SKB) {
> +                       if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
> +                               return -EINVAL;
> +               } else {
> +                       return -EINVAL;
> +               }
> +       }
> +
> +       if (type == BPF_DYNPTR_TYPE_SKB) {
> +               struct sk_buff *skb = dst->data;
> +
> +               /* if the data is paged, the caller needs to pull it first */
> +               if (dst->offset + offset + len > skb->len - skb->data_len)
> +                       return -EAGAIN;
> +
> +               return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> +                                            flags);
> +       }

It seems like it would be cleaner to have a switch per dynptr type and
each case doing its extra error checking (like CSUM and HASH flags for
TYPE_SKB) and then performing write operation.


memcpy can be either a catch-all default case, or perhaps it's safer
to explicitly list TYPE_LOCAL and TYPE_RINGBUF to do memcpy, and then
default should WARN() and return error?

> +
>         memcpy(dst->data + dst->offset + offset, src, len);
>
>         return 0;
> @@ -1555,6 +1594,7 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
>
>  BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
>  {
> +       enum bpf_dynptr_type type;
>         int err;
>
>         if (!ptr->data)
> @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>         if (bpf_dynptr_is_rdonly(ptr))
>                 return 0;
>
> +       type = bpf_dynptr_get_type(ptr);
> +
> +       if (type == BPF_DYNPTR_TYPE_SKB) {
> +               struct sk_buff *skb = ptr->data;
> +
> +               /* if the data is paged, the caller needs to pull it first */
> +               if (ptr->offset + offset + len > skb->len - skb->data_len)
> +                       return 0;
> +
> +               return (unsigned long)(skb->data + ptr->offset + offset);
> +       }
> +
>         return (unsigned long)(ptr->data + ptr->offset + offset);

Similarly, all these dynptr helpers effectively dispatch different
implementations based on dynptr type. I think switch is most
appropriate for this.

>  }
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 0d523741a543..0838653eeb4e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
>         u32 subprogno;
>         struct bpf_map_value_off_desc *kptr_off_desc;
>         u8 uninit_dynptr_regno;
> +       enum bpf_dynptr_type type;
>  };
>

[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 21:16             ` Joanne Koong
@ 2022-08-01 22:14               ` Andrii Nakryiko
  2022-08-01 22:32               ` Martin KaFai Lau
  1 sibling, 0 replies; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-01 22:14 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Mon, Aug 1, 2022 at 2:16 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > Since we are on bpf_dynptr_write, what is the reason
> > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > way is better than another.  would like to undertand the reason
> > > > behind it since it is not clear in the commit message.
> > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > may be writes that pull the skb, so any existing data slices to the
> > > skb must be invalidated. However, in the verifier we can't detect when
> > > the data slice should be invalidated vs. when it shouldn't (eg
> > > detecting when a write goes into the paged area vs when the write is
> > > only in the head). If the prog wants to write into the paged area, I
> > > think the only way it can work is if it pulls the data first with
> > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > the commit message in v2
> > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > after bpf_skb_store_bytes().  Potentially the same could be done for
> > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > behavior cannot be changed later, so want to raise this possibility here
> > just in case it wasn't considered before.
>
> Thanks for raising this possibility. To me, it seems more intuitive
> from the user standpoint to have bpf_dynptr_write() on a paged area
> fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> have bpf_dynptr_write() always invalidate all dynptr slices related to
> that skb. I think most writes will be to the data in the head area,
> which seems unfortunate that bpf_dynptr_writes to the head area would
> invalidate the dynptr slices regardless.

+1. Given bpf_skb_store_bytes() is a more powerful superset of
bpf_dynptr_write(), I'd keep bpf_dynptr_write() in such a form as to
play nicely with bpf_dynptr_data() pointers.

>
> What are your thoughts? Do you think you prefer having
> bpf_dynptr_write() always work regardless of where the data is? If so,
> I'm happy to make that change for v2 :)
>
> >
> > Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
> > If the user changes the skb by directly using skb->data to avoid calling
> > load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
> > before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
> > it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
> > byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
> > bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 21:16             ` Joanne Koong
  2022-08-01 22:14               ` Andrii Nakryiko
@ 2022-08-01 22:32               ` Martin KaFai Lau
  2022-08-01 22:58                 ` Andrii Nakryiko
  1 sibling, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-01 22:32 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > Since we are on bpf_dynptr_write, what is the reason
> > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > way is better than another.  would like to undertand the reason
> > > > behind it since it is not clear in the commit message.
> > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > may be writes that pull the skb, so any existing data slices to the
> > > skb must be invalidated. However, in the verifier we can't detect when
> > > the data slice should be invalidated vs. when it shouldn't (eg
> > > detecting when a write goes into the paged area vs when the write is
> > > only in the head). If the prog wants to write into the paged area, I
> > > think the only way it can work is if it pulls the data first with
> > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > the commit message in v2
> > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > after bpf_skb_store_bytes().  Potentially the same could be done for
> > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > behavior cannot be changed later, so want to raise this possibility here
> > just in case it wasn't considered before.
> 
> Thanks for raising this possibility. To me, it seems more intuitive
> from the user standpoint to have bpf_dynptr_write() on a paged area
> fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> have bpf_dynptr_write() always invalidate all dynptr slices related to
> that skb. I think most writes will be to the data in the head area,
> which seems unfortunate that bpf_dynptr_writes to the head area would
> invalidate the dynptr slices regardless.
> 
> What are your thoughts? Do you think you prefer having
> bpf_dynptr_write() always work regardless of where the data is? If so,
> I'm happy to make that change for v2 :)
Yeah, it sounds like an optimization to avoid unnecessarily
invalidating the sliced data.

To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
be used considering there is usually a pkt read before a pkt write in
the pkt modification use case.  If I got that far to have a sliced data pointer
to satisfy what I need for reading,  I would try to avoid making extra call
to dyptr_write() to modify it.

I would prefer user can have similar expectation (no need to worry pkt layout)
between dynptr_read() and dynptr_write(), and also has similar experience to
the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
unnecessary rules for user to remember while there is no clear benefit on
the chance of this optimization.

I won't insist though.  User can always stay with the bpf_skb_load_bytes()
and bpf_skb_store_bytes() to avoid worrying about the skb layout.

> >
> > Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
> > If the user changes the skb by directly using skb->data to avoid calling
> > load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
> > before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
> > it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
> > byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
> > bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 22:32               ` Martin KaFai Lau
@ 2022-08-01 22:58                 ` Andrii Nakryiko
  2022-08-01 23:23                   ` Martin KaFai Lau
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-01 22:58 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > way is better than another.  would like to undertand the reason
> > > > > behind it since it is not clear in the commit message.
> > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > may be writes that pull the skb, so any existing data slices to the
> > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > detecting when a write goes into the paged area vs when the write is
> > > > only in the head). If the prog wants to write into the paged area, I
> > > > think the only way it can work is if it pulls the data first with
> > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > the commit message in v2
> > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > behavior cannot be changed later, so want to raise this possibility here
> > > just in case it wasn't considered before.
> >
> > Thanks for raising this possibility. To me, it seems more intuitive
> > from the user standpoint to have bpf_dynptr_write() on a paged area
> > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > that skb. I think most writes will be to the data in the head area,
> > which seems unfortunate that bpf_dynptr_writes to the head area would
> > invalidate the dynptr slices regardless.
> >
> > What are your thoughts? Do you think you prefer having
> > bpf_dynptr_write() always work regardless of where the data is? If so,
> > I'm happy to make that change for v2 :)
> Yeah, it sounds like an optimization to avoid unnecessarily
> invalidating the sliced data.
>
> To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> be used considering there is usually a pkt read before a pkt write in
> the pkt modification use case.  If I got that far to have a sliced data pointer
> to satisfy what I need for reading,  I would try to avoid making extra call
> to dyptr_write() to modify it.
>
> I would prefer user can have similar expectation (no need to worry pkt layout)
> between dynptr_read() and dynptr_write(), and also has similar experience to
> the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> unnecessary rules for user to remember while there is no clear benefit on
> the chance of this optimization.
>

Are you saying that bpf_dynptr_read() shouldn't read from non-linear
part of skb (and thus match more restrictive bpf_dynptr_write), or are
you saying you'd rather have bpf_dynptr_write() write into non-linear
part but invalidate bpf_dynptr_data() pointers?

I guess I agree about consistency and that it seems like in practice
you'd use bpf_dynptr_data() to work with headers and stuff like that
at known locations, and then if you need to modify the rest of payload
you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
bpf_dynptr_data() pointers (but that would be ok by that time).


> I won't insist though.  User can always stay with the bpf_skb_load_bytes()
> and bpf_skb_store_bytes() to avoid worrying about the skb layout.
>
> > >
> > > Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
> > > If the user changes the skb by directly using skb->data to avoid calling
> > > load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
> > > before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
> > > it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
> > > byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
> > > bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 22:58                 ` Andrii Nakryiko
@ 2022-08-01 23:23                   ` Martin KaFai Lau
  2022-08-02  0:56                     ` Martin KaFai Lau
  0 siblings, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-01 23:23 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 01, 2022 at 03:58:41PM -0700, Andrii Nakryiko wrote:
> On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > > way is better than another.  would like to undertand the reason
> > > > > > behind it since it is not clear in the commit message.
> > > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > > may be writes that pull the skb, so any existing data slices to the
> > > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > > detecting when a write goes into the paged area vs when the write is
> > > > > only in the head). If the prog wants to write into the paged area, I
> > > > > think the only way it can work is if it pulls the data first with
> > > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > > the commit message in v2
> > > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > > behavior cannot be changed later, so want to raise this possibility here
> > > > just in case it wasn't considered before.
> > >
> > > Thanks for raising this possibility. To me, it seems more intuitive
> > > from the user standpoint to have bpf_dynptr_write() on a paged area
> > > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > > that skb. I think most writes will be to the data in the head area,
> > > which seems unfortunate that bpf_dynptr_writes to the head area would
> > > invalidate the dynptr slices regardless.
> > >
> > > What are your thoughts? Do you think you prefer having
> > > bpf_dynptr_write() always work regardless of where the data is? If so,
> > > I'm happy to make that change for v2 :)
> > Yeah, it sounds like an optimization to avoid unnecessarily
> > invalidating the sliced data.
> >
> > To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> > be used considering there is usually a pkt read before a pkt write in
> > the pkt modification use case.  If I got that far to have a sliced data pointer
> > to satisfy what I need for reading,  I would try to avoid making extra call
> > to dyptr_write() to modify it.
> >
> > I would prefer user can have similar expectation (no need to worry pkt layout)
> > between dynptr_read() and dynptr_write(), and also has similar experience to
> > the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> > unnecessary rules for user to remember while there is no clear benefit on
> > the chance of this optimization.
> >
> 
> Are you saying that bpf_dynptr_read() shouldn't read from non-linear
> part of skb (and thus match more restrictive bpf_dynptr_write), or are
> you saying you'd rather have bpf_dynptr_write() write into non-linear
> part but invalidate bpf_dynptr_data() pointers?
The latter.  Read and write without worrying about the skb layout.

Also, if the prog needs to call a helper to write, it knows the bytes are
not in the data pointer.  Then it needs to bpf_skb_pull_data() before
it can call write.  However, after bpf_skb_pull_data(), why the prog
needs to call the write helper instead of directly getting a new
data pointer and write to it?  If the prog needs to write many many
bytes, a write helper may then help.

> 
> I guess I agree about consistency and that it seems like in practice
> you'd use bpf_dynptr_data() to work with headers and stuff like that
> at known locations, and then if you need to modify the rest of payload
> you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
> bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
> bpf_dynptr_data() pointers (but that would be ok by that time).
imo, read, write and then go back to read is less common.
writing bytes without first reading them is also less common.

> 
> 
> > I won't insist though.  User can always stay with the bpf_skb_load_bytes()
> > and bpf_skb_store_bytes() to avoid worrying about the skb layout.
> >
> > > >
> > > > Thinking from the existing bpf_skb_{load,store}_bytes() and skb->data perspective.
> > > > If the user changes the skb by directly using skb->data to avoid calling
> > > > load_bytes()/store_bytes(), the user will do the necessary bpf_skb_pull_data()
> > > > before reading/writing the skb->data.  If load_bytes()+store_bytes() is used instead,
> > > > it would be hard to reason why the earlier bpf_skb_load_bytes() can load a particular
> > > > byte but [may] need to make an extra bpf_skb_pull_data() call before it can use
> > > > bpf_skb_store_bytes() to store a modified byte at the same offset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
                     ` (3 preceding siblings ...)
  2022-08-01 22:11   ` Andrii Nakryiko
@ 2022-08-01 23:33   ` Jakub Kicinski
  2022-08-02  2:12     ` Joanne Koong
  2022-08-03  6:37   ` Martin KaFai Lau
  5 siblings, 1 reply; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-01 23:33 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

(consider cross-posting network-related stuff to netdev@)

On Tue, 26 Jul 2022 11:47:04 -0700 Joanne Koong wrote:
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).

Is there really a need for dynptr_from_{skb,xdp} to be different
function IDs? I was hoping this work would improve portability of
networking BPF programs across the hooks.

> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (writes and data slices are not permitted). For reads on the
> dynptr, this includes reading into data in the non-linear paged buffers
> but for writes and data slices, if the data is in a paged buffer, the
> user must first call bpf_skb_pull_data to pull the data into the linear
> portion.
> 
> Additionally, any helper calls that change the underlying packet buffer
> (eg bpf_skb_pull_data) invalidates any data slices of the associated
> dynptr.

Grepping the verifier did not help me find that, would you mind
pointing me to the code?

> Right now, skb dynptrs can only be constructed from skbs that are
> the bpf program context - as such, there does not need to be any
> reference tracking or release on skb dynptrs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 22:11   ` Andrii Nakryiko
@ 2022-08-02  0:15     ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-02  0:15 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, andrii, daniel, ast

On Mon, Aug 1, 2022 at 3:11 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (writes and data slices are not permitted). For reads on the
> > dynptr, this includes reading into data in the non-linear paged buffers
> > but for writes and data slices, if the data is in a paged buffer, the
> > user must first call bpf_skb_pull_data to pull the data into the linear
> > portion.
> >
> > Additionally, any helper calls that change the underlying packet buffer
> > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > dynptr.
> >
> > Right now, skb dynptrs can only be constructed from skbs that are
> > the bpf program context - as such, there does not need to be any
> > reference tracking or release on skb dynptrs.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  8 ++++-
> >  include/linux/filter.h         |  4 +++
> >  include/uapi/linux/bpf.h       | 42 ++++++++++++++++++++++++--
> >  kernel/bpf/helpers.c           | 54 +++++++++++++++++++++++++++++++++-
> >  kernel/bpf/verifier.c          | 43 +++++++++++++++++++++++----
> >  net/core/filter.c              | 53 ++++++++++++++++++++++++++++++---
> >  tools/include/uapi/linux/bpf.h | 42 ++++++++++++++++++++++++--
> >  7 files changed, 229 insertions(+), 17 deletions(-)
> >
>
> [...]
>
> > +       type = bpf_dynptr_get_type(dst);
> > +
> > +       if (flags) {
> > +               if (type == BPF_DYNPTR_TYPE_SKB) {
> > +                       if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
> > +                               return -EINVAL;
> > +               } else {
> > +                       return -EINVAL;
> > +               }
> > +       }
> > +
> > +       if (type == BPF_DYNPTR_TYPE_SKB) {
> > +               struct sk_buff *skb = dst->data;
> > +
> > +               /* if the data is paged, the caller needs to pull it first */
> > +               if (dst->offset + offset + len > skb->len - skb->data_len)
> > +                       return -EAGAIN;
> > +
> > +               return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> > +                                            flags);
> > +       }
>
> It seems like it would be cleaner to have a switch per dynptr type and
> each case doing its extra error checking (like CSUM and HASH flags for
> TYPE_SKB) and then performing write operation.
>
>
> memcpy can be either a catch-all default case, or perhaps it's safer
> to explicitly list TYPE_LOCAL and TYPE_RINGBUF to do memcpy, and then
> default should WARN() and return error?

Sounds great, I will make these changes (and the one below) for v2

>
> > +
> >         memcpy(dst->data + dst->offset + offset, src, len);
> >
> >         return 0;
> > @@ -1555,6 +1594,7 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
> >
> >  BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
> >  {
> > +       enum bpf_dynptr_type type;
> >         int err;
> >
> >         if (!ptr->data)
> > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >         if (bpf_dynptr_is_rdonly(ptr))
> >                 return 0;
> >
> > +       type = bpf_dynptr_get_type(ptr);
> > +
> > +       if (type == BPF_DYNPTR_TYPE_SKB) {
> > +               struct sk_buff *skb = ptr->data;
> > +
> > +               /* if the data is paged, the caller needs to pull it first */
> > +               if (ptr->offset + offset + len > skb->len - skb->data_len)
> > +                       return 0;
> > +
> > +               return (unsigned long)(skb->data + ptr->offset + offset);
> > +       }
> > +
> >         return (unsigned long)(ptr->data + ptr->offset + offset);
>
> Similarly, all these dynptr helpers effectively dispatch different
> implementations based on dynptr type. I think switch is most
> appropriate for this.
>
> >  }
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 0d523741a543..0838653eeb4e 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -263,6 +263,7 @@ struct bpf_call_arg_meta {
> >         u32 subprogno;
> >         struct bpf_map_value_off_desc *kptr_off_desc;
> >         u8 uninit_dynptr_regno;
> > +       enum bpf_dynptr_type type;
> >  };
> >
>
> [...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 23:23                   ` Martin KaFai Lau
@ 2022-08-02  0:56                     ` Martin KaFai Lau
  2022-08-02  3:51                       ` Andrii Nakryiko
  0 siblings, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-02  0:56 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 01, 2022 at 04:23:16PM -0700, Martin KaFai Lau wrote:
> On Mon, Aug 01, 2022 at 03:58:41PM -0700, Andrii Nakryiko wrote:
> > On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > > > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > > > way is better than another.  would like to undertand the reason
> > > > > > > behind it since it is not clear in the commit message.
> > > > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > > > may be writes that pull the skb, so any existing data slices to the
> > > > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > > > detecting when a write goes into the paged area vs when the write is
> > > > > > only in the head). If the prog wants to write into the paged area, I
> > > > > > think the only way it can work is if it pulls the data first with
> > > > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > > > the commit message in v2
> > > > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > > > behavior cannot be changed later, so want to raise this possibility here
> > > > > just in case it wasn't considered before.
> > > >
> > > > Thanks for raising this possibility. To me, it seems more intuitive
> > > > from the user standpoint to have bpf_dynptr_write() on a paged area
> > > > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > > > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > > > that skb. I think most writes will be to the data in the head area,
> > > > which seems unfortunate that bpf_dynptr_writes to the head area would
> > > > invalidate the dynptr slices regardless.
> > > >
> > > > What are your thoughts? Do you think you prefer having
> > > > bpf_dynptr_write() always work regardless of where the data is? If so,
> > > > I'm happy to make that change for v2 :)
> > > Yeah, it sounds like an optimization to avoid unnecessarily
> > > invalidating the sliced data.
> > >
> > > To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> > > be used considering there is usually a pkt read before a pkt write in
> > > the pkt modification use case.  If I got that far to have a sliced data pointer
> > > to satisfy what I need for reading,  I would try to avoid making extra call
> > > to dyptr_write() to modify it.
> > >
> > > I would prefer user can have similar expectation (no need to worry pkt layout)
> > > between dynptr_read() and dynptr_write(), and also has similar experience to
> > > the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> > > unnecessary rules for user to remember while there is no clear benefit on
> > > the chance of this optimization.
> > >
> > 
> > Are you saying that bpf_dynptr_read() shouldn't read from non-linear
> > part of skb (and thus match more restrictive bpf_dynptr_write), or are
> > you saying you'd rather have bpf_dynptr_write() write into non-linear
> > part but invalidate bpf_dynptr_data() pointers?
> The latter.  Read and write without worrying about the skb layout.
> 
> Also, if the prog needs to call a helper to write, it knows the bytes are
> not in the data pointer.  Then it needs to bpf_skb_pull_data() before
> it can call write.  However, after bpf_skb_pull_data(), why the prog
> needs to call the write helper instead of directly getting a new
> data pointer and write to it?  If the prog needs to write many many
> bytes, a write helper may then help.
After another thought, other than the non-linear handling,
bpf_skb_store_bytes() / dynptr_write() is more useful in
the 'BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH' flags.

That said,  my preference is still to have the same expectation on
non-linear data for both dynptr_read() and dynptr_write().  Considering
the user can fall back to use bpf_skb_load_bytes() and
bpf_skb_store_bytes(), I am fine with the current patch also.

> 
> > 
> > I guess I agree about consistency and that it seems like in practice
> > you'd use bpf_dynptr_data() to work with headers and stuff like that
> > at known locations, and then if you need to modify the rest of payload
> > you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
> > bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
> > bpf_dynptr_data() pointers (but that would be ok by that time).
> imo, read, write and then go back to read is less common.
> writing bytes without first reading them is also less common.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-01 23:33   ` Jakub Kicinski
@ 2022-08-02  2:12     ` Joanne Koong
  2022-08-04 21:55       ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-02  2:12 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: bpf, andrii, daniel, ast

On Mon, Aug 1, 2022 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> (consider cross-posting network-related stuff to netdev@)

Great, I will start cc-ing netdev@

>
> On Tue, 26 Jul 2022 11:47:04 -0700 Joanne Koong wrote:
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
>
> Is there really a need for dynptr_from_{skb,xdp} to be different
> function IDs? I was hoping this work would improve portability of
> networking BPF programs across the hooks.

Awesome, I like this idea of having just one generic API named
something like bpf_dynptr_from_packet. I'll add this for v2!

>
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (writes and data slices are not permitted). For reads on the
> > dynptr, this includes reading into data in the non-linear paged buffers
> > but for writes and data slices, if the data is in a paged buffer, the
> > user must first call bpf_skb_pull_data to pull the data into the linear
> > portion.
> >
> > Additionally, any helper calls that change the underlying packet buffer
> > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > dynptr.
>
> Grepping the verifier did not help me find that, would you mind
> pointing me to the code?

The base reg type of a skb data slice will be PTR_TO_PACKET - this
gets set in this patch in check_helper_call() in verifier.c:

+ if (func_id == BPF_FUNC_dynptr_data &&
+    meta.type == BPF_DYNPTR_TYPE_SKB)
+ regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;

Anytime there is a helper call that changes the underlying packet
buffer [0], the verifier iterates through the registers and marks all
PTR_TO_PACKET reg types as unknown, which invalidates them. The dynptr
data slice will be invalidated since its base reg type is
PTR_TO_PACKET

The stack trace is:
   check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()


I will add this explanation to the commit message for v2 since it is non-obvious


[0] https://elixir.bootlin.com/linux/latest/source/kernel/bpf/verifier.c#L7143

[1] https://elixir.bootlin.com/linux/latest/source/kernel/bpf/verifier.c#L6489


>
> > Right now, skb dynptrs can only be constructed from skbs that are
> > the bpf program context - as such, there does not need to be any
> > reference tracking or release on skb dynptrs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-02  0:56                     ` Martin KaFai Lau
@ 2022-08-02  3:51                       ` Andrii Nakryiko
  2022-08-02  4:53                         ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-02  3:51 UTC (permalink / raw)
  To: Martin KaFai Lau, Jakub Kicinski
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov, Networking

On Mon, Aug 1, 2022 at 5:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Mon, Aug 01, 2022 at 04:23:16PM -0700, Martin KaFai Lau wrote:
> > On Mon, Aug 01, 2022 at 03:58:41PM -0700, Andrii Nakryiko wrote:
> > > On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > > > > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > > > > way is better than another.  would like to undertand the reason
> > > > > > > > behind it since it is not clear in the commit message.
> > > > > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > > > > may be writes that pull the skb, so any existing data slices to the
> > > > > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > > > > detecting when a write goes into the paged area vs when the write is
> > > > > > > only in the head). If the prog wants to write into the paged area, I
> > > > > > > think the only way it can work is if it pulls the data first with
> > > > > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > > > > the commit message in v2
> > > > > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > > > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > > > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > > > > behavior cannot be changed later, so want to raise this possibility here
> > > > > > just in case it wasn't considered before.
> > > > >
> > > > > Thanks for raising this possibility. To me, it seems more intuitive
> > > > > from the user standpoint to have bpf_dynptr_write() on a paged area
> > > > > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > > > > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > > > > that skb. I think most writes will be to the data in the head area,
> > > > > which seems unfortunate that bpf_dynptr_writes to the head area would
> > > > > invalidate the dynptr slices regardless.
> > > > >
> > > > > What are your thoughts? Do you think you prefer having
> > > > > bpf_dynptr_write() always work regardless of where the data is? If so,
> > > > > I'm happy to make that change for v2 :)
> > > > Yeah, it sounds like an optimization to avoid unnecessarily
> > > > invalidating the sliced data.
> > > >
> > > > To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> > > > be used considering there is usually a pkt read before a pkt write in
> > > > the pkt modification use case.  If I got that far to have a sliced data pointer
> > > > to satisfy what I need for reading,  I would try to avoid making extra call
> > > > to dyptr_write() to modify it.
> > > >
> > > > I would prefer user can have similar expectation (no need to worry pkt layout)
> > > > between dynptr_read() and dynptr_write(), and also has similar experience to
> > > > the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> > > > unnecessary rules for user to remember while there is no clear benefit on
> > > > the chance of this optimization.
> > > >
> > >
> > > Are you saying that bpf_dynptr_read() shouldn't read from non-linear
> > > part of skb (and thus match more restrictive bpf_dynptr_write), or are
> > > you saying you'd rather have bpf_dynptr_write() write into non-linear
> > > part but invalidate bpf_dynptr_data() pointers?
> > The latter.  Read and write without worrying about the skb layout.
> >
> > Also, if the prog needs to call a helper to write, it knows the bytes are
> > not in the data pointer.  Then it needs to bpf_skb_pull_data() before
> > it can call write.  However, after bpf_skb_pull_data(), why the prog
> > needs to call the write helper instead of directly getting a new
> > data pointer and write to it?  If the prog needs to write many many
> > bytes, a write helper may then help.
> After another thought, other than the non-linear handling,
> bpf_skb_store_bytes() / dynptr_write() is more useful in
> the 'BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH' flags.
>
> That said,  my preference is still to have the same expectation on
> non-linear data for both dynptr_read() and dynptr_write().  Considering
> the user can fall back to use bpf_skb_load_bytes() and
> bpf_skb_store_bytes(), I am fine with the current patch also.
>

Honestly, I don't have any specific preference, because I don't have
much specific experience writing networking BPF :)

But considering Jakub's point about trying to unify skb/xdp dynptr,
while I can see how we might have symmetrical dynptr_{read,write}()
for skb case (because you can pull skb), I believe this is not
possible with XDP (e.g., multi-buffer one), so bpf_dynptr_write()
would always be more limited for XDP case.

Or maybe it is possible for XDP and I'm totally wrong here? I'm happy
to be educated about this!

> >
> > >
> > > I guess I agree about consistency and that it seems like in practice
> > > you'd use bpf_dynptr_data() to work with headers and stuff like that
> > > at known locations, and then if you need to modify the rest of payload
> > > you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
> > > bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
> > > bpf_dynptr_data() pointers (but that would be ok by that time).
> > imo, read, write and then go back to read is less common.
> > writing bytes without first reading them is also less common.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-02  3:51                       ` Andrii Nakryiko
@ 2022-08-02  4:53                         ` Joanne Koong
  2022-08-02  5:14                           ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-02  4:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Martin KaFai Lau, Jakub Kicinski, bpf, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Networking

On Mon, Aug 1, 2022 at 8:51 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 5:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Mon, Aug 01, 2022 at 04:23:16PM -0700, Martin KaFai Lau wrote:
> > > On Mon, Aug 01, 2022 at 03:58:41PM -0700, Andrii Nakryiko wrote:
> > > > On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > > > > > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > > > > > way is better than another.  would like to undertand the reason
> > > > > > > > > behind it since it is not clear in the commit message.
> > > > > > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > > > > > may be writes that pull the skb, so any existing data slices to the
> > > > > > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > > > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > > > > > detecting when a write goes into the paged area vs when the write is
> > > > > > > > only in the head). If the prog wants to write into the paged area, I
> > > > > > > > think the only way it can work is if it pulls the data first with
> > > > > > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > > > > > the commit message in v2
> > > > > > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > > > > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > > > > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > > > > > behavior cannot be changed later, so want to raise this possibility here
> > > > > > > just in case it wasn't considered before.
> > > > > >
> > > > > > Thanks for raising this possibility. To me, it seems more intuitive
> > > > > > from the user standpoint to have bpf_dynptr_write() on a paged area
> > > > > > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > > > > > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > > > > > that skb. I think most writes will be to the data in the head area,
> > > > > > which seems unfortunate that bpf_dynptr_writes to the head area would
> > > > > > invalidate the dynptr slices regardless.
> > > > > >
> > > > > > What are your thoughts? Do you think you prefer having
> > > > > > bpf_dynptr_write() always work regardless of where the data is? If so,
> > > > > > I'm happy to make that change for v2 :)
> > > > > Yeah, it sounds like an optimization to avoid unnecessarily
> > > > > invalidating the sliced data.
> > > > >
> > > > > To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> > > > > be used considering there is usually a pkt read before a pkt write in
> > > > > the pkt modification use case.  If I got that far to have a sliced data pointer
> > > > > to satisfy what I need for reading,  I would try to avoid making extra call
> > > > > to dyptr_write() to modify it.
> > > > >
> > > > > I would prefer user can have similar expectation (no need to worry pkt layout)
> > > > > between dynptr_read() and dynptr_write(), and also has similar experience to
> > > > > the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> > > > > unnecessary rules for user to remember while there is no clear benefit on
> > > > > the chance of this optimization.
> > > > >
> > > >
> > > > Are you saying that bpf_dynptr_read() shouldn't read from non-linear
> > > > part of skb (and thus match more restrictive bpf_dynptr_write), or are
> > > > you saying you'd rather have bpf_dynptr_write() write into non-linear
> > > > part but invalidate bpf_dynptr_data() pointers?
> > > The latter.  Read and write without worrying about the skb layout.
> > >
> > > Also, if the prog needs to call a helper to write, it knows the bytes are
> > > not in the data pointer.  Then it needs to bpf_skb_pull_data() before
> > > it can call write.  However, after bpf_skb_pull_data(), why the prog
> > > needs to call the write helper instead of directly getting a new
> > > data pointer and write to it?  If the prog needs to write many many
> > > bytes, a write helper may then help.
> > After another thought, other than the non-linear handling,
> > bpf_skb_store_bytes() / dynptr_write() is more useful in
> > the 'BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH' flags.
> >
> > That said,  my preference is still to have the same expectation on
> > non-linear data for both dynptr_read() and dynptr_write().  Considering
> > the user can fall back to use bpf_skb_load_bytes() and
> > bpf_skb_store_bytes(), I am fine with the current patch also.
> >
>
> Honestly, I don't have any specific preference, because I don't have
> much specific experience writing networking BPF :)
>
> But considering Jakub's point about trying to unify skb/xdp dynptr,
> while I can see how we might have symmetrical dynptr_{read,write}()
> for skb case (because you can pull skb), I believe this is not
> possible with XDP (e.g., multi-buffer one), so bpf_dynptr_write()
> would always be more limited for XDP case.
>
> Or maybe it is possible for XDP and I'm totally wrong here? I'm happy
> to be educated about this!

My understanding is that it's possible for XDP because the data in the
frags are mapped [eg we can use skb_frag_address() to get the address
and then copy into it with direct memcpys [0]] whereas skb frags are
unmapped (eg access into the frag requires kmapping [1]).

Maybe one solution is to add a function that does the mapping + write
to a skb frag without pulling it to the head. This would allow
bpf_dynptr_write to all data without needing to invalidate any dynptr
slices. But I don't know whether this is compatible with recomputing
the checksum or not, maybe the written data needs to be mapped (and
hence part of head) so that it can be used to compute the checksum [2]
- I'll read up some more on the checksumming code.

I like your point Martin that if people are using bpf_dynptr_write,
then they probably aren't using data slices much anyways so it
wouldn't be too inconvenient that their slices are invalidated (eg if
they are using bpf_dynptr_write it's to write into the skb frag, at
which point they would need to call pull before bpf_dynptr_write,
which would lead to same scenario where the data slices are
invalidated). My main concern was that slices would be invalidated for
bpf_dynptr_writes on data in the head area, but you're right that that
shouldn't be too likely since they'd just be using a direct data slice
access instead to read/write. I'll change it so that bpf_dynptr_write
always succeeds and it'll always invalidate the data slices for v2.

[0] https://elixir.bootlin.com/linux/v5.19/source/net/core/filter.c#L3846
[1] https://elixir.bootlin.com/linux/v5.19/source/net/core/skbuff.c#L2367
[2] https://elixir.bootlin.com/linux/v5.19/source/include/linux/skbuff.h#L3839

>
> > >
> > > >
> > > > I guess I agree about consistency and that it seems like in practice
> > > > you'd use bpf_dynptr_data() to work with headers and stuff like that
> > > > at known locations, and then if you need to modify the rest of payload
> > > > you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
> > > > bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
> > > > bpf_dynptr_data() pointers (but that would be ok by that time).
> > > imo, read, write and then go back to read is less common.
> > > writing bytes without first reading them is also less common.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-02  4:53                         ` Joanne Koong
@ 2022-08-02  5:14                           ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-02  5:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Martin KaFai Lau, Jakub Kicinski, bpf, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov, Networking

On Mon, Aug 1, 2022 at 9:53 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 8:51 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Aug 1, 2022 at 5:56 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Mon, Aug 01, 2022 at 04:23:16PM -0700, Martin KaFai Lau wrote:
> > > > On Mon, Aug 01, 2022 at 03:58:41PM -0700, Andrii Nakryiko wrote:
> > > > > On Mon, Aug 1, 2022 at 3:33 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 01, 2022 at 02:16:23PM -0700, Joanne Koong wrote:
> > > > > > > On Mon, Aug 1, 2022 at 12:38 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 01, 2022 at 10:52:14AM -0700, Joanne Koong wrote:
> > > > > > > > > > Since we are on bpf_dynptr_write, what is the reason
> > > > > > > > > > on limiting it to the skb_headlen() ?  Not implying one
> > > > > > > > > > way is better than another.  would like to undertand the reason
> > > > > > > > > > behind it since it is not clear in the commit message.
> > > > > > > > > For bpf_dynptr_write, if we don't limit it to skb_headlen() then there
> > > > > > > > > may be writes that pull the skb, so any existing data slices to the
> > > > > > > > > skb must be invalidated. However, in the verifier we can't detect when
> > > > > > > > > the data slice should be invalidated vs. when it shouldn't (eg
> > > > > > > > > detecting when a write goes into the paged area vs when the write is
> > > > > > > > > only in the head). If the prog wants to write into the paged area, I
> > > > > > > > > think the only way it can work is if it pulls the data first with
> > > > > > > > > bpf_skb_pull_data before calling bpf_dynptr_write. I will add this to
> > > > > > > > > the commit message in v2
> > > > > > > > Note that current verifier unconditionally invalidates PTR_TO_PACKET
> > > > > > > > after bpf_skb_store_bytes().  Potentially the same could be done for
> > > > > > > > other new helper like bpf_dynptr_write().  I think this bpf_dynptr_write()
> > > > > > > > behavior cannot be changed later, so want to raise this possibility here
> > > > > > > > just in case it wasn't considered before.
> > > > > > >
> > > > > > > Thanks for raising this possibility. To me, it seems more intuitive
> > > > > > > from the user standpoint to have bpf_dynptr_write() on a paged area
> > > > > > > fail (even if bpf_dynptr_read() on that same offset succeeds) than to
> > > > > > > have bpf_dynptr_write() always invalidate all dynptr slices related to
> > > > > > > that skb. I think most writes will be to the data in the head area,
> > > > > > > which seems unfortunate that bpf_dynptr_writes to the head area would
> > > > > > > invalidate the dynptr slices regardless.
> > > > > > >
> > > > > > > What are your thoughts? Do you think you prefer having
> > > > > > > bpf_dynptr_write() always work regardless of where the data is? If so,
> > > > > > > I'm happy to make that change for v2 :)
> > > > > > Yeah, it sounds like an optimization to avoid unnecessarily
> > > > > > invalidating the sliced data.
> > > > > >
> > > > > > To be honest, I am not sure how often the dynptr_data()+dynptr_write() combo will
> > > > > > be used considering there is usually a pkt read before a pkt write in
> > > > > > the pkt modification use case.  If I got that far to have a sliced data pointer
> > > > > > to satisfy what I need for reading,  I would try to avoid making extra call
> > > > > > to dyptr_write() to modify it.
> > > > > >
> > > > > > I would prefer user can have similar expectation (no need to worry pkt layout)
> > > > > > between dynptr_read() and dynptr_write(), and also has similar experience to
> > > > > > the bpf_skb_load_bytes() and bpf_skb_store_bytes().  Otherwise, it is just
> > > > > > unnecessary rules for user to remember while there is no clear benefit on
> > > > > > the chance of this optimization.
> > > > > >
> > > > >
> > > > > Are you saying that bpf_dynptr_read() shouldn't read from non-linear
> > > > > part of skb (and thus match more restrictive bpf_dynptr_write), or are
> > > > > you saying you'd rather have bpf_dynptr_write() write into non-linear
> > > > > part but invalidate bpf_dynptr_data() pointers?
> > > > The latter.  Read and write without worrying about the skb layout.
> > > >
> > > > Also, if the prog needs to call a helper to write, it knows the bytes are
> > > > not in the data pointer.  Then it needs to bpf_skb_pull_data() before
> > > > it can call write.  However, after bpf_skb_pull_data(), why the prog
> > > > needs to call the write helper instead of directly getting a new
> > > > data pointer and write to it?  If the prog needs to write many many
> > > > bytes, a write helper may then help.
> > > After another thought, other than the non-linear handling,
> > > bpf_skb_store_bytes() / dynptr_write() is more useful in
> > > the 'BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH' flags.
> > >
> > > That said,  my preference is still to have the same expectation on
> > > non-linear data for both dynptr_read() and dynptr_write().  Considering
> > > the user can fall back to use bpf_skb_load_bytes() and
> > > bpf_skb_store_bytes(), I am fine with the current patch also.
> > >
> >
> > Honestly, I don't have any specific preference, because I don't have
> > much specific experience writing networking BPF :)
> >
> > But considering Jakub's point about trying to unify skb/xdp dynptr,
> > while I can see how we might have symmetrical dynptr_{read,write}()
> > for skb case (because you can pull skb), I believe this is not
> > possible with XDP (e.g., multi-buffer one), so bpf_dynptr_write()
> > would always be more limited for XDP case.
> >
> > Or maybe it is possible for XDP and I'm totally wrong here? I'm happy
> > to be educated about this!
>
> My understanding is that it's possible for XDP because the data in the
> frags are mapped [eg we can use skb_frag_address() to get the address
> and then copy into it with direct memcpys [0]] whereas skb frags are
> unmapped (eg access into the frag requires kmapping [1]).
>
> Maybe one solution is to add a function that does the mapping + write
> to a skb frag without pulling it to the head. This would allow
> bpf_dynptr_write to all data without needing to invalidate any dynptr
> slices. But I don't know whether this is compatible with recomputing
> the checksum or not, maybe the written data needs to be mapped (and
> hence part of head) so that it can be used to compute the checksum [2]
> - I'll read up some more on the checksumming code.
>
> I like your point Martin that if people are using bpf_dynptr_write,
> then they probably aren't using data slices much anyways so it
> wouldn't be too inconvenient that their slices are invalidated (eg if
> they are using bpf_dynptr_write it's to write into the skb frag, at
> which point they would need to call pull before bpf_dynptr_write,
> which would lead to same scenario where the data slices are
> invalidated). My main concern was that slices would be invalidated for
> bpf_dynptr_writes on data in the head area, but you're right that that
> shouldn't be too likely since they'd just be using a direct data slice
> access instead to read/write. I'll change it so that bpf_dynptr_write
> always succeeds and it'll always invalidate the data slices for v2.

On second thought, for v2 I plan to combine xdp and skb to 1 generic
function ("bpf_dynptr_from_packet") per Jakub's suggestion. I think in
that case then, for consistency, it'd be lbetter if we don't
invalidate the slices since it would be confusing if bpf_dynptr_write
invalidates slices for skb-type progs but not xdp-type ones. I think
bpf_dynptr_write returning an error for writes into the frag area for
skb type progs but not xdp type progs is less jarring than dynptr
slices being invalidated for skb type progs but not xdp type progs.

>
> [0] https://elixir.bootlin.com/linux/v5.19/source/net/core/filter.c#L3846
> [1] https://elixir.bootlin.com/linux/v5.19/source/net/core/skbuff.c#L2367
> [2] https://elixir.bootlin.com/linux/v5.19/source/include/linux/skbuff.h#L3839
>
> >
> > > >
> > > > >
> > > > > I guess I agree about consistency and that it seems like in practice
> > > > > you'd use bpf_dynptr_data() to work with headers and stuff like that
> > > > > at known locations, and then if you need to modify the rest of payload
> > > > > you'd do either bpf_skb_load_bytes()/bpf_skb_store_bytes() or
> > > > > bpf_dynptr_read()/bpf_dynptr_write() which would invalidate
> > > > > bpf_dynptr_data() pointers (but that would be ok by that time).
> > > > imo, read, write and then go back to read is less common.
> > > > writing bytes without first reading them is also less common.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-01 19:12   ` Alexei Starovoitov
@ 2022-08-02 22:21     ` Joanne Koong
  2022-08-04 21:46       ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-02 22:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 1, 2022 at 12:12 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> >
> > -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> >  {
> > -       void *data_end = (void *)(long)xdp->data_end;
> > -       void *data = (void *)(long)xdp->data;
> > +       struct bpf_dynptr new_xdp_ptr;
> >         struct iptnl_info *tnl;
> >         struct ethhdr *new_eth;
> >         struct ethhdr *old_eth;
> > -       struct iphdr *iph = data + sizeof(struct ethhdr);
> > +       struct iphdr *iph;
> >         __u16 *next_iph;
> >         __u16 payload_len;
> >         struct vip vip = {};
> > @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> >         __u32 csum = 0;
> >         int i;
> >
> > -       if (iph + 1 > data_end)
> > +       iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > +                             iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> > +       if (!iph)
> >                 return XDP_DROP;
>
> dynptr based xdp/skb access looks neat.
> Maybe in addition to bpf_dynptr_data() we can add helper(s)
> that return skb/xdp_md from dynptr?
> This way the code will be passing dynptr only and there will
> be no need to pass around 'struct xdp_md *xdp' (like this function).

Great idea! I'll add this to v2.

>
> Separately please keep the existing tests instead of converting them.
> Either ifdef data/data_end vs dynptr style or copy paste
> the whole test into a new .c file. Whichever is cleaner.

Will do for v2.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-01 17:58   ` Andrii Nakryiko
@ 2022-08-02 22:56     ` Joanne Koong
  2022-08-03  0:53       ` Andrii Nakryiko
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-02 22:56 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Mon, Aug 1, 2022 at 10:58 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Test skb and xdp dynptr functionality in the following ways:
> >
> > 1) progs/test_xdp.c
> >    * Change existing test to use dynptrs to parse xdp data
> >
> >      There were no noticeable diferences in user + system time between
> >      the original version vs. using dynptrs. Averaging the time for 10
> >      runs (run using "time ./test_progs -t xdp_bpf2bpf"):
> >          original version: 0.0449 sec
> >          with dynptrs: 0.0429 sec
> >
> > 2) progs/test_l4lb_noinline.c
> >    * Change existing test to use dynptrs to parse skb data
> >
> >      There were no noticeable diferences in user + system time between
> >      the original version vs. using dynptrs. Averaging the time for 10
> >      runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
> >          original version: 0.0502 sec
> >          with dynptrs: 0.055 sec
> >
> >      For number of processed verifier instructions:
> >          original version: 6284 insns
> >          with dynptrs: 2538 insns
> >
> > 3) progs/test_dynptr_xdp.c
> >    * Add sample code for parsing tcp hdr opt lookup using dynptrs.
> >      This logic is lifted from a real-world use case of packet parsing in
> >      katran [0], a layer 4 load balancer
> >
> > 4) progs/dynptr_success.c
> >    * Add test case "test_skb_readonly" for testing attempts at writes /
> >      data slices on a prog type with read-only skb ctx.
> >
> > 5) progs/dynptr_fail.c
> >    * Add test cases "skb_invalid_data_slice" and
> >      "xdp_invalid_data_slice" for testing that helpers that modify the
> >      underlying packet buffer automatically invalidate the associated
> >      data slice.
> >    * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
> >      that prog types that do not support bpf_dynptr_from_skb/xdp don't
> >      have access to the API.
> >
> > [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
> >  .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
> >  .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
> >  .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
> >  .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
> >  .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
> >  tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
> >  .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
> >  8 files changed, 416 insertions(+), 108 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > index bcf80b9f7c27..c40631f33c7b 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > @@ -2,6 +2,7 @@
> >  /* Copyright (c) 2022 Facebook */
> >
> >  #include <test_progs.h>
> > +#include <network_helpers.h>
> >  #include "dynptr_fail.skel.h"
> >  #include "dynptr_success.skel.h"
> >
> > @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
> >  static struct {
> >         const char *prog_name;
> >         const char *expected_err_msg;
> > -} dynptr_tests[] = {
> > -       /* failure cases */
> > +} verifier_error_tests[] = {
> > +       /* these cases should trigger a verifier error */
> >         {"ringbuf_missing_release1", "Unreleased reference id=1"},
> >         {"ringbuf_missing_release2", "Unreleased reference id=2"},
> >         {"ringbuf_missing_release_callback", "Unreleased reference id"},
> > @@ -42,11 +43,25 @@ static struct {
> >         {"release_twice_callback", "arg 1 is an unacquired reference"},
> >         {"dynptr_from_mem_invalid_api",
> >                 "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> > +       {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> > +       {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> > +       {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> > +       {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> > +};
> > +
> > +enum test_setup_type {
> > +       SETUP_SYSCALL_SLEEP,
> > +       SETUP_SKB_PROG,
> > +};
> >
> > -       /* success cases */
> > -       {"test_read_write", NULL},
> > -       {"test_data_slice", NULL},
> > -       {"test_ringbuf", NULL},
> > +static struct {
> > +       const char *prog_name;
> > +       enum test_setup_type type;
> > +} runtime_tests[] = {
> > +       {"test_read_write", SETUP_SYSCALL_SLEEP},
> > +       {"test_data_slice", SETUP_SYSCALL_SLEEP},
> > +       {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> > +       {"test_skb_readonly", SETUP_SKB_PROG},
>
> nit: wouldn't it be better to add test_setup_type to dynptr_tests (and
> keep fail and success cases together)? It's conceivable that you might
> want different setups to test different error conditions, right?

Yeah! I originally separated it out because the success tests don't
have an error message while the fail ones do, and fail ones don't have
a setup (I don't think we'll need any custom userspace setup for those
since we're checking for verifier failures at prog load time) and the
success ones do. But I can combine them into 1 so that it's simpler. I
will do this in v2.
>
> >  };
> >
> >  static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
> >         dynptr_fail__destroy(skel);
> >  }
> >
> > -static void verify_success(const char *prog_name)
> > +static void run_tests(const char *prog_name, enum test_setup_type setup_type)
> >  {
> >         struct dynptr_success *skel;
> >         struct bpf_program *prog;
> > @@ -107,15 +122,42 @@ static void verify_success(const char *prog_name)
> >         if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> >                 goto cleanup;
> >
> > -       link = bpf_program__attach(prog);
> > -       if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> > -               goto cleanup;
> > +       switch (setup_type) {
> > +       case SETUP_SYSCALL_SLEEP:
> > +               link = bpf_program__attach(prog);
> > +               if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> > +                       goto cleanup;
> >
> > -       usleep(1);
> > +               usleep(1);
> >
> > -       ASSERT_EQ(skel->bss->err, 0, "err");
> > +               bpf_link__destroy(link);
> > +               break;
> > +       case SETUP_SKB_PROG:
> > +       {
> > +               int prog_fd, err;
> > +               char buf[64];
> > +
> > +               prog_fd = bpf_program__fd(prog);
> > +               if (CHECK_FAIL(prog_fd < 0))
>
> please don't use CHECK and especially CHECK_FAIL
>
> > +                       goto cleanup;
> > +
> > +               LIBBPF_OPTS(bpf_test_run_opts, topts,
> > +                           .data_in = &pkt_v4,
> > +                           .data_size_in = sizeof(pkt_v4),
> > +                           .data_out = buf,
> > +                           .data_size_out = sizeof(buf),
> > +                           .repeat = 1,
> > +               );
>
> nit: LIBBPF_OPTS declares variable, so should be part of variable
> declaration block
>
> >
> > -       bpf_link__destroy(link);
> > +               err = bpf_prog_test_run_opts(prog_fd, &topts);
> > +
> > +               if (!ASSERT_OK(err, "test_run"))
> > +                       goto cleanup;
> > +
> > +               break;
> > +       }
> > +       }
> > +       ASSERT_EQ(skel->bss->err, 0, "err");
> >
> >  cleanup:
> >         dynptr_success__destroy(skel);
> > @@ -125,14 +167,17 @@ void test_dynptr(void)
> >  {
> >         int i;
> >
> > -       for (i = 0; i < ARRAY_SIZE(dynptr_tests); i++) {
> > -               if (!test__start_subtest(dynptr_tests[i].prog_name))
> > +       for (i = 0; i < ARRAY_SIZE(verifier_error_tests); i++) {
> > +               if (!test__start_subtest(verifier_error_tests[i].prog_name))
> > +                       continue;
> > +
> > +               verify_fail(verifier_error_tests[i].prog_name,
> > +                           verifier_error_tests[i].expected_err_msg);
> > +       }
> > +       for (i = 0; i < ARRAY_SIZE(runtime_tests); i++) {
> > +               if (!test__start_subtest(runtime_tests[i].prog_name))
> >                         continue;
> >
> > -               if (dynptr_tests[i].expected_err_msg)
> > -                       verify_fail(dynptr_tests[i].prog_name,
> > -                                   dynptr_tests[i].expected_err_msg);
> > -               else
> > -                       verify_success(dynptr_tests[i].prog_name);
> > +               run_tests(runtime_tests[i].prog_name, runtime_tests[i].type);
> >         }
> >  }
> > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > new file mode 100644
> > index 000000000000..ca775d126b60
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > @@ -0,0 +1,49 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <test_progs.h>
> > +#include <network_helpers.h>
> > +#include "test_dynptr_xdp.skel.h"
> > +#include "test_tcp_hdr_options.h"
> > +
> > +struct test_pkt {
> > +       struct ipv6_packet pk6_v6;
> > +       u8 options[16];
> > +} __packed;
> > +
> > +void test_dynptr_xdp(void)
> > +{
> > +       struct test_dynptr_xdp *skel;
> > +       char buf[128];
> > +       int err;
> > +
> > +       skel = test_dynptr_xdp__open_and_load();
> > +       if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
> > +               return;
> > +
> > +       struct test_pkt pkt = {
> > +               .pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
> > +               .pk6_v6.iph.nexthdr = IPPROTO_TCP,
> > +               .pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
> > +               .pk6_v6.tcp.urg_ptr = 123,
> > +               .pk6_v6.tcp.doff = 9, /* 16 bytes of options */
> > +
> > +               .options = {
> > +                       TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
> > +                       skel->rodata->tcp_hdr_opt_kind_tpr, 6, 0, 0, 0, 9, TCPOPT_EOL
> > +               },
> > +       };
> > +
> > +       LIBBPF_OPTS(bpf_test_run_opts, topts,
> > +                   .data_in = &pkt,
> > +                   .data_size_in = sizeof(pkt),
> > +                   .data_out = buf,
> > +                   .data_size_out = sizeof(buf),
> > +                   .repeat = 3,
> > +       );
> > +
>
> for topts and pkt, they should be up above with other variables
> (unless we want to break off from kernel code style, which I didn't
> think we want)
>
> > +       err = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.xdp_ingress_v6), &topts);
> > +       ASSERT_OK(err, "ipv6 test_run");
> > +       ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
> > +       ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
> > +
> > +       test_dynptr_xdp__destroy(skel);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > index c1814938a5fd..4e3f853b2d02 100644
> > --- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > +++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
> > @@ -5,6 +5,7 @@
> >  #include <string.h>
> >  #include <linux/bpf.h>
> >  #include <bpf/bpf_helpers.h>
> > +#include <linux/if_ether.h>
> >  #include "bpf_misc.h"
> >
> >  char _license[] SEC("license") = "GPL";
> > @@ -622,3 +623,78 @@ int dynptr_from_mem_invalid_api(void *ctx)
> >
> >         return 0;
> >  }
> > +
> > +/* The data slice is invalidated whenever a helper changes packet data */
> > +SEC("?tc")
> > +int skb_invalid_data_slice(struct __sk_buff *skb)
> > +{
> > +       struct bpf_dynptr ptr;
> > +       struct ethhdr *hdr;
> > +
> > +       bpf_dynptr_from_skb(skb, 0, &ptr);
> > +       hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
> > +       if (!hdr)
> > +               return SK_DROP;
> > +
> > +       hdr->h_proto = 12;
> > +
> > +       if (bpf_skb_pull_data(skb, skb->len))
> > +               return SK_DROP;
> > +
> > +       /* this should fail */
> > +       hdr->h_proto = 1;
> > +
> > +       return SK_PASS;
> > +}
> > +
> > +/* The data slice is invalidated whenever a helper changes packet data */
> > +SEC("?xdp")
> > +int xdp_invalid_data_slice(struct xdp_md *xdp)
> > +{
> > +       struct bpf_dynptr ptr;
> > +       struct ethhdr *hdr1, *hdr2;
> > +
> > +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > +       hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> > +       if (!hdr1)
> > +               return XDP_DROP;
> > +
> > +       hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> > +       if (!hdr2)
> > +               return XDP_DROP;
> > +
> > +       hdr1->h_proto = 12;
> > +       hdr2->h_proto = 12;
> > +
> > +       if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> > +               return XDP_DROP;
> > +
> > +       /* this should fail */
> > +       hdr2->h_proto = 1;
>
> is there something special about having both hdr1 and hdr2? Wouldn't
> this test work with just single hdr pointer?

Yes, this test would work with just 1 single hdr pointer (which is
what skb_invalid_data_slice does) but I wanted to ensure that this
also works in the case of multiple data slices. If you think this is
unnecessary / adds clutter, I can remove hdr2.

>
> > +
> > +       return XDP_PASS;
> > +}
> > +
> > +/* Only supported prog type can create skb-type dynptrs */
>
> [...]
>
> > +       err = 1;
> > +
> > +       if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> > +               return 0;
> > +       err++;
> > +
> > +       data = bpf_dynptr_data(&ptr, 0, 1);
> > +       if (data)
> > +               /* it's an error if data is not NULL since cgroup skbs
> > +                * are read only
> > +                */
> > +               return 0;
> > +       err++;
> > +
> > +       ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> > +       /* since cgroup skbs are read only, writes should fail */
> > +       if (ret != -EINVAL)
> > +               return 0;
> > +
> > +       err = 0;
>
> hm, if data is NULL you'll still report success if bpf_dynptr_write
> returns 0 or any other error but -EINVAL... The logic is a bit unclear
> here...
>
If data is NULL and bpf_dynptr_write returns 0 or any other error
besides -EINVAL, I think we report a failure here (err is set to a
non-zero value, which userspace checks against)

> > +
> > +       return 0;
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > new file mode 100644
> > index 000000000000..c879dfb6370a
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > @@ -0,0 +1,115 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/* This logic is lifted from a real-world use case of packet parsing, used in
> > + * the open source library katran, a layer 4 load balancer.
> > + *
> > + * This test demonstrates how to parse packet contents using dynptrs.
> > + *
> > + * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> > + */
> > +
> > +#include <string.h>
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf_helpers.h>
> > +#include <linux/tcp.h>
> > +#include <stdbool.h>
> > +#include <linux/ipv6.h>
> > +#include <linux/if_ether.h>
> > +#include "test_tcp_hdr_options.h"
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +/* Arbitrarily picked unused value from IANA TCP Option Kind Numbers */
> > +const __u32 tcp_hdr_opt_kind_tpr = 0xB7;
> > +/* Length of the tcp header option */
> > +const __u32 tcp_hdr_opt_len_tpr = 6;
> > +/* maximum number of header options to check to lookup server_id */
> > +const __u32 tcp_hdr_opt_max_opt_checks = 15;
> > +
> > +__u32 server_id;
> > +
> > +static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
> > +                        __u32 *server_id)
> > +{
> > +       __u8 *tcp_opt, kind, hdr_len;
> > +       __u8 *data;
> > +
> > +       data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
> > +                              sizeof(*server_id));
> > +       if (!data)
> > +               return -1;
> > +
> > +       kind = data[0];
> > +
> > +       if (kind == TCPOPT_EOL)
> > +               return -1;
> > +
> > +       if (kind == TCPOPT_NOP) {
> > +               *off += 1;
> > +               /* continue to the next option */
> > +               *hdr_bytes_remaining -= 1;
> > +
> > +               return 0;
> > +       }
> > +
> > +       if (*hdr_bytes_remaining < 2)
> > +               return -1;
> > +
> > +       hdr_len = data[1];
> > +       if (hdr_len > *hdr_bytes_remaining)
> > +               return -1;
> > +
> > +       if (kind == tcp_hdr_opt_kind_tpr) {
> > +               if (hdr_len != tcp_hdr_opt_len_tpr)
> > +                       return -1;
> > +
> > +               memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
>
> this implicitly relies on compiler inlining memcpy, let's use
> __builtint_memcpy() here instead to set a good example?

Sounds good, I will change this for v2. Should memcpys in bpf progs
always use __builtin_memcpy or is it on a case-by-case basis where if
the size is small enough, then you use it?
>
> > +               return 1;
> > +       }
> > +
> > +       *off += hdr_len;
> > +       *hdr_bytes_remaining -= hdr_len;
> > +
> > +       return 0;
> > +}
> > +
> > +SEC("xdp")
> > +int xdp_ingress_v6(struct xdp_md *xdp)
> > +{
> > +       __u8 hdr_bytes_remaining;
> > +       struct tcphdr *tcp_hdr;
> > +       __u8 tcp_hdr_opt_len;
> > +       int err = 0;
> > +       __u32 off;
> > +
> > +       struct bpf_dynptr ptr;
> > +
> > +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > +
> > +       off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
> > +
> > +       tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
> > +       if (!tcp_hdr)
> > +               return XDP_DROP;
> > +
> > +       tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
> > +       if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
> > +               return XDP_DROP;
> > +
> > +       hdr_bytes_remaining = tcp_hdr_opt_len;
> > +
> > +       off += sizeof(struct tcphdr);
> > +
> > +       /* max number of bytes of options in tcp header is 40 bytes */
> > +       for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
> > +               err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
> > +
> > +               if (err || !hdr_bytes_remaining)
> > +                       break;
> > +       }
> > +
> > +       if (!server_id)
> > +               return XDP_DROP;
> > +
> > +       return XDP_PASS;
> > +}
>
> I'm not a networking BPF expert, but the logic of packet parsing here
> looks pretty clean! Would it be possible to also backport original
> code with data and data_end, both for testing but also to be able to
> compare and contrast dynptr vs data/data_end approaches?
>

Yeah, good idea! I'll backport the original code from katran for v2.

>
> > diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > index c8bc0c6947aa..1fef7868ea8b 100644
> > --- a/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > +++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline.c
> > @@ -230,21 +230,18 @@ static __noinline bool get_packet_dst(struct real_definition **real,
> >         return true;
> >  }
> >
> > -static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> > +static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
> >                                    struct packet_description *pckt)
> >  {
> >         struct icmp6hdr *icmp_hdr;
> >         struct ipv6hdr *ip6h;
> >
> > -       icmp_hdr = data + off;
> > -       if (icmp_hdr + 1 > data_end)
> > +       icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*ip6h));
> > +       if (!icmp_hdr)
> >                 return TC_ACT_SHOT;
> >         if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
> >                 return TC_ACT_OK;
>
> previously you can still TC_ACT_OK if it's ICMPV6_PKT_TOOBIG even if
> packet size is < sizeof(*icmp_hdr) + sizeof(*ip6h), which might have
> been a bug, but current logic will enforce that packet is at least
> sizeof(*icmp_hdr) + sizeof(*ip6h). Is that a problem?

Good point, I think it's best to maintain the original behavior. I'll
fix this (and the one below) for v2.

>
> > -       off += sizeof(struct icmp6hdr);
> > -       ip6h = data + off;
> > -       if (ip6h + 1 > data_end)
> > -               return TC_ACT_SHOT;
> > +       ip6h = (struct ipv6hdr *)(icmp_hdr + 1);
> >         pckt->proto = ip6h->nexthdr;
> >         pckt->flags |= F_ICMP;
> >         memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
> > @@ -252,22 +249,19 @@ static __noinline int parse_icmpv6(void *data, void *data_end, __u64 off,
> >         return TC_ACT_UNSPEC;
> >  }
> >
> > -static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> > +static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
> >                                  struct packet_description *pckt)
> >  {
> >         struct icmphdr *icmp_hdr;
> >         struct iphdr *iph;
> >
> > -       icmp_hdr = data + off;
> > -       if (icmp_hdr + 1 > data_end)
> > +       icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr) + sizeof(*iph));
> > +       if (!icmp_hdr)
> >                 return TC_ACT_SHOT;
> >         if (icmp_hdr->type != ICMP_DEST_UNREACH ||
> >             icmp_hdr->code != ICMP_FRAG_NEEDED)
> >                 return TC_ACT_OK;
>
> similarly here, short packets can still be TC_ACT_OK in some
> circumstances, while with dynptr they will be shot down early on. Not
> saying this is wrong or bad, just bringing this up for you and others
> to chime in if it's an ok change
>
> > -       off += sizeof(struct icmphdr);
> > -       iph = data + off;
> > -       if (iph + 1 > data_end)
> > -               return TC_ACT_SHOT;
> > +       iph = (struct iphdr *)(icmp_hdr + 1);
> >         if (iph->ihl != 5)
> >                 return TC_ACT_SHOT;
> >         pckt->proto = iph->protocol;
> > @@ -277,13 +271,13 @@ static __noinline int parse_icmp(void *data, void *data_end, __u64 off,
> >         return TC_ACT_UNSPEC;
> >  }
>
> [...]
>
> > -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> >  {
> > -       void *data_end = (void *)(long)xdp->data_end;
> > -       void *data = (void *)(long)xdp->data;
> > +       struct bpf_dynptr new_xdp_ptr;
> >         struct iptnl_info *tnl;
> >         struct ethhdr *new_eth;
> >         struct ethhdr *old_eth;
> > -       struct iphdr *iph = data + sizeof(struct ethhdr);
> > +       struct iphdr *iph;
> >         __u16 *next_iph;
> >         __u16 payload_len;
> >         struct vip vip = {};
> > @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> >         __u32 csum = 0;
> >         int i;
> >
> > -       if (iph + 1 > data_end)
> > +       iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > +                             iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
>
> tcphdr_sz (20) is always bigger than udphdr_sz (8), so just use the
> bigger one here? Though again, for UDP packet it might be a bit too
> pessimistic to reject small packets?

Yeah, maybe the best way here is to first check that if data_end -
data is less than the size of the ethhdr + iphdr + tcphdr, then we
must use udphdr size, otherwise we can use tcphdr size. I'll change it
(and the one below) to this for v2

>
> > +       if (!iph)
> >                 return XDP_DROP;
> >
> > -       dport = get_dport(iph + 1, data_end, iph->protocol);
> > +       dport = get_dport(iph + 1, iph->protocol);
> >         if (dport == -1)
> >                 return XDP_DROP;
>
> [...]
>
> > -static __always_inline int handle_ipv6(struct xdp_md *xdp)
> > +static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> >  {
> > -       void *data_end = (void *)(long)xdp->data_end;
> > -       void *data = (void *)(long)xdp->data;
> > +       struct bpf_dynptr new_xdp_ptr;
> >         struct iptnl_info *tnl;
> >         struct ethhdr *new_eth;
> >         struct ethhdr *old_eth;
> > -       struct ipv6hdr *ip6h = data + sizeof(struct ethhdr);
> > +       struct ipv6hdr *ip6h;
> >         __u16 payload_len;
> >         struct vip vip = {};
> >         int dport;
> >
> > -       if (ip6h + 1 > data_end)
> > +       ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > +                              ipv6hdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
>
> ditto, there is no dynamism here, verifier actually enforces that this
> value is statically known, I think this example will create false
> assumptions if written this way
>
> > +       if (!ip6h)
> >                 return XDP_DROP;
> >
> > -       dport = get_dport(ip6h + 1, data_end, ip6h->nexthdr);
> > +       dport = get_dport(ip6h + 1, ip6h->nexthdr);
> >         if (dport == -1)
> >                 return XDP_DROP;
> >
>
> [...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-02 22:56     ` Joanne Koong
@ 2022-08-03  0:53       ` Andrii Nakryiko
  2022-08-03 16:11         ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-03  0:53 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Aug 2, 2022 at 3:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 10:58 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Test skb and xdp dynptr functionality in the following ways:
> > >
> > > 1) progs/test_xdp.c
> > >    * Change existing test to use dynptrs to parse xdp data
> > >
> > >      There were no noticeable diferences in user + system time between
> > >      the original version vs. using dynptrs. Averaging the time for 10
> > >      runs (run using "time ./test_progs -t xdp_bpf2bpf"):
> > >          original version: 0.0449 sec
> > >          with dynptrs: 0.0429 sec
> > >
> > > 2) progs/test_l4lb_noinline.c
> > >    * Change existing test to use dynptrs to parse skb data
> > >
> > >      There were no noticeable diferences in user + system time between
> > >      the original version vs. using dynptrs. Averaging the time for 10
> > >      runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
> > >          original version: 0.0502 sec
> > >          with dynptrs: 0.055 sec
> > >
> > >      For number of processed verifier instructions:
> > >          original version: 6284 insns
> > >          with dynptrs: 2538 insns
> > >
> > > 3) progs/test_dynptr_xdp.c
> > >    * Add sample code for parsing tcp hdr opt lookup using dynptrs.
> > >      This logic is lifted from a real-world use case of packet parsing in
> > >      katran [0], a layer 4 load balancer
> > >
> > > 4) progs/dynptr_success.c
> > >    * Add test case "test_skb_readonly" for testing attempts at writes /
> > >      data slices on a prog type with read-only skb ctx.
> > >
> > > 5) progs/dynptr_fail.c
> > >    * Add test cases "skb_invalid_data_slice" and
> > >      "xdp_invalid_data_slice" for testing that helpers that modify the
> > >      underlying packet buffer automatically invalidate the associated
> > >      data slice.
> > >    * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
> > >      that prog types that do not support bpf_dynptr_from_skb/xdp don't
> > >      have access to the API.
> > >
> > > [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >  .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
> > >  .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
> > >  .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
> > >  .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
> > >  .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
> > >  .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
> > >  tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
> > >  .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
> > >  8 files changed, 416 insertions(+), 108 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > index bcf80b9f7c27..c40631f33c7b 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > @@ -2,6 +2,7 @@
> > >  /* Copyright (c) 2022 Facebook */
> > >
> > >  #include <test_progs.h>
> > > +#include <network_helpers.h>
> > >  #include "dynptr_fail.skel.h"
> > >  #include "dynptr_success.skel.h"
> > >
> > > @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
> > >  static struct {
> > >         const char *prog_name;
> > >         const char *expected_err_msg;
> > > -} dynptr_tests[] = {
> > > -       /* failure cases */
> > > +} verifier_error_tests[] = {
> > > +       /* these cases should trigger a verifier error */
> > >         {"ringbuf_missing_release1", "Unreleased reference id=1"},
> > >         {"ringbuf_missing_release2", "Unreleased reference id=2"},
> > >         {"ringbuf_missing_release_callback", "Unreleased reference id"},
> > > @@ -42,11 +43,25 @@ static struct {
> > >         {"release_twice_callback", "arg 1 is an unacquired reference"},
> > >         {"dynptr_from_mem_invalid_api",
> > >                 "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> > > +       {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> > > +       {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> > > +       {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> > > +       {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> > > +};
> > > +
> > > +enum test_setup_type {
> > > +       SETUP_SYSCALL_SLEEP,
> > > +       SETUP_SKB_PROG,
> > > +};
> > >
> > > -       /* success cases */
> > > -       {"test_read_write", NULL},
> > > -       {"test_data_slice", NULL},
> > > -       {"test_ringbuf", NULL},
> > > +static struct {
> > > +       const char *prog_name;
> > > +       enum test_setup_type type;
> > > +} runtime_tests[] = {
> > > +       {"test_read_write", SETUP_SYSCALL_SLEEP},
> > > +       {"test_data_slice", SETUP_SYSCALL_SLEEP},
> > > +       {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> > > +       {"test_skb_readonly", SETUP_SKB_PROG},
> >
> > nit: wouldn't it be better to add test_setup_type to dynptr_tests (and
> > keep fail and success cases together)? It's conceivable that you might
> > want different setups to test different error conditions, right?
>
> Yeah! I originally separated it out because the success tests don't
> have an error message while the fail ones do, and fail ones don't have
> a setup (I don't think we'll need any custom userspace setup for those
> since we're checking for verifier failures at prog load time) and the
> success ones do. But I can combine them into 1 so that it's simpler. I
> will do this in v2.

great, thanks! you might actually need custom setup for SKB vs XDP
programs if you are unifying bpf_dynptr_from_packet?

> >
> > >  };
> > >
> > >  static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > > @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > >         dynptr_fail__destroy(skel);
> > >  }
> > >

[...]

> > > +/* The data slice is invalidated whenever a helper changes packet data */
> > > +SEC("?xdp")
> > > +int xdp_invalid_data_slice(struct xdp_md *xdp)
> > > +{
> > > +       struct bpf_dynptr ptr;
> > > +       struct ethhdr *hdr1, *hdr2;
> > > +
> > > +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > > +       hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> > > +       if (!hdr1)
> > > +               return XDP_DROP;
> > > +
> > > +       hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> > > +       if (!hdr2)
> > > +               return XDP_DROP;
> > > +
> > > +       hdr1->h_proto = 12;
> > > +       hdr2->h_proto = 12;
> > > +
> > > +       if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> > > +               return XDP_DROP;
> > > +
> > > +       /* this should fail */
> > > +       hdr2->h_proto = 1;
> >
> > is there something special about having both hdr1 and hdr2? Wouldn't
> > this test work with just single hdr pointer?
>
> Yes, this test would work with just 1 single hdr pointer (which is
> what skb_invalid_data_slice does) but I wanted to ensure that this
> also works in the case of multiple data slices. If you think this is
> unnecessary / adds clutter, I can remove hdr2.


I think testing two pointers isn't the point, so I'd keep the test
minimal. It seems like testing two pointers should be in a success
test, to prove it works, rather than as some side effect of an
expected-to-fail test, no?

>
> >
> > > +
> > > +       return XDP_PASS;
> > > +}
> > > +
> > > +/* Only supported prog type can create skb-type dynptrs */
> >
> > [...]
> >
> > > +       err = 1;
> > > +
> > > +       if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> > > +               return 0;
> > > +       err++;
> > > +
> > > +       data = bpf_dynptr_data(&ptr, 0, 1);
> > > +       if (data)
> > > +               /* it's an error if data is not NULL since cgroup skbs
> > > +                * are read only
> > > +                */
> > > +               return 0;
> > > +       err++;
> > > +
> > > +       ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> > > +       /* since cgroup skbs are read only, writes should fail */
> > > +       if (ret != -EINVAL)
> > > +               return 0;
> > > +
> > > +       err = 0;
> >
> > hm, if data is NULL you'll still report success if bpf_dynptr_write
> > returns 0 or any other error but -EINVAL... The logic is a bit unclear
> > here...
> >
> If data is NULL and bpf_dynptr_write returns 0 or any other error
> besides -EINVAL, I think we report a failure here (err is set to a
> non-zero value, which userspace checks against)

oh, ok, I read it backwards. I find this "stateful increasing error
number" pattern very confusing. Why not write it more
straightforwardly as:

if (bpf_dynptr_from_skb(...)) {
    err = 1;
    return 0;
}

data = bpf_dynptr_data(...);
if (data) {
    err = 2;
    return 0;
}

ret = bpf_dynptr_write(...);
if (ret != -EINVAL) {
    err = 3;
    return 0;
}

/* all tests passed */
err = 0;
return 0;

?

>
> > > +
> > > +       return 0;
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > > new file mode 100644
> > > index 000000000000..c879dfb6370a
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > > @@ -0,0 +1,115 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +

[...]

> > > +       hdr_len = data[1];
> > > +       if (hdr_len > *hdr_bytes_remaining)
> > > +               return -1;
> > > +
> > > +       if (kind == tcp_hdr_opt_kind_tpr) {
> > > +               if (hdr_len != tcp_hdr_opt_len_tpr)
> > > +                       return -1;
> > > +
> > > +               memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
> >
> > this implicitly relies on compiler inlining memcpy, let's use
> > __builtint_memcpy() here instead to set a good example?
>
> Sounds good, I will change this for v2. Should memcpys in bpf progs
> always use __builtin_memcpy or is it on a case-by-case basis where if
> the size is small enough, then you use it?

__builtin_memcpy() is best. When we write just "memcpy()" we still
rely on compiler to actually optimizing that to __builtin_memcpy(),
because there is no memcpy() (we'd get unrecognized extern error if
compiler actually emitted call to memcpy()).

> >
> > > +               return 1;
> > > +       }
> > > +
> > > +       *off += hdr_len;
> > > +       *hdr_bytes_remaining -= hdr_len;
> > > +
> > > +       return 0;
> > > +}
> > > +

[...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
                     ` (4 preceding siblings ...)
  2022-08-01 23:33   ` Jakub Kicinski
@ 2022-08-03  6:37   ` Martin KaFai Lau
  5 siblings, 0 replies; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-03  6:37 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59a217ca2dfd..0730cd198a7f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5241,11 +5241,22 @@ union bpf_attr {
>   *	Description
>   *		Write *len* bytes from *src* into *dst*, starting from *offset*
>   *		into *dst*.
> - *		*flags* is currently unused.
> + *
> + *		*flags* must be 0 except for skb-type dynptrs.
> + *
> + *		For skb-type dynptrs:
> + *		    *  if *offset* + *len* extends into the skb's paged buffers, the user
> + *		       should manually pull the skb with bpf_skb_pull and then try again.
bpf_skb_pull_data().

Probably need formatting like, **bpf_skb_pull_data**\ ()

> + *
> + *		    *  *flags* are a combination of **BPF_F_RECOMPUTE_CSUM** (automatically
> + *			recompute the checksum for the packet after storing the bytes) and
> + *			**BPF_F_INVALIDATE_HASH** (set *skb*\ **->hash**, *skb*\
> + *			**->swhash** and *skb*\ **->l4hash** to 0).
>   *	Return
>   *		0 on success, -E2BIG if *offset* + *len* exceeds the length
>   *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> - *		is a read-only dynptr or if *flags* is not 0.
> + *		is a read-only dynptr or if *flags* is not correct, -EAGAIN if for
> + *		skb-type dynptrs the write extends into the skb's paged buffers.
May also mention other negative errors is similar to the bpf_skb_store_bytes()
instead of mentioning them one-by-one here.

>   *
>   * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
>   *	Description
> @@ -5253,10 +5264,19 @@ union bpf_attr {
>   *
>   *		*len* must be a statically known value. The returned data slice
>   *		is invalidated whenever the dynptr is invalidated.
> + *
> + *		For skb-type dynptrs:
> + *		    * if *offset* + *len* extends into the skb's paged buffers,
> + *		      the user should manually pull the skb with bpf_skb_pull and then
same here. bpf_skb_pull_data().

> + *		      try again.
> + *
> + *		    * the data slice is automatically invalidated anytime a
> + *		      helper call that changes the underlying packet buffer
> + *		      (eg bpf_skb_pull) is called.
>   *	Return
>   *		Pointer to the underlying dynptr data, NULL if the dynptr is
>   *		read-only, if the dynptr is invalid, or if the offset and length
> - *		is out of bounds.
> + *		is out of bounds or in a paged buffer for skb-type dynptrs.
>   *
>   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
>   *	Description
> @@ -5331,6 +5351,21 @@ union bpf_attr {
>   *		**-EACCES** if the SYN cookie is not valid.
>   *
>   *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
> + *
> + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
> + *	Description
> + *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
> + *		context. Depending on program type, the dynptr may be read-only,
> + *		in which case trying to obtain a direct data slice to it through
> + *		bpf_dynptr_data will return an error.
> + *
> + *		Calls that change the *skb*'s underlying packet buffer
> + *		(eg bpf_skb_pull_data) do not invalidate the dynptr, but they do
> + *		invalidate any data slices associated with the dynptr.
> + *
> + *		*flags* is currently unused, it must be 0 for now.
> + *	Return
> + *		0 on success or -EINVAL if flags is not 0.
>   */

[ ... ]

> @@ -1528,15 +1544,38 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
>  BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
>  	   u32, len, u64, flags)
>  {
> +	enum bpf_dynptr_type type;
>  	int err;
>  
> -	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> +	if (!dst->data || bpf_dynptr_is_rdonly(dst))
>  		return -EINVAL;
>  
>  	err = bpf_dynptr_check_off_len(dst, offset, len);
>  	if (err)
>  		return err;
>  
> +	type = bpf_dynptr_get_type(dst);
> +
> +	if (flags) {
> +		if (type == BPF_DYNPTR_TYPE_SKB) {
> +			if (flags & ~(BPF_F_RECOMPUTE_CSUM | BPF_F_INVALIDATE_HASH))
nit.  The flags is the same as __bpf_skb_store_bytes().  __bpf_skb_store_bytes()
can reject as well instead of duplicating the test here.

> +				return -EINVAL;
> +		} else {
> +			return -EINVAL;
> +		}
> +	}
> +
> +	if (type == BPF_DYNPTR_TYPE_SKB) {
> +		struct sk_buff *skb = dst->data;
> +
> +		/* if the data is paged, the caller needs to pull it first */
> +		if (dst->offset + offset + len > skb->len - skb->data_len)
> +			return -EAGAIN;
> +
> +		return __bpf_skb_store_bytes(skb, dst->offset + offset, src, len,
> +					     flags);
> +	}
> +

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-03  0:53       ` Andrii Nakryiko
@ 2022-08-03 16:11         ` Joanne Koong
  2022-08-04 18:45           ` Alexei Starovoitov
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-03 16:11 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Aug 2, 2022 at 5:53 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Aug 2, 2022 at 3:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Aug 1, 2022 at 10:58 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > Test skb and xdp dynptr functionality in the following ways:
> > > >
> > > > 1) progs/test_xdp.c
> > > >    * Change existing test to use dynptrs to parse xdp data
> > > >
> > > >      There were no noticeable diferences in user + system time between
> > > >      the original version vs. using dynptrs. Averaging the time for 10
> > > >      runs (run using "time ./test_progs -t xdp_bpf2bpf"):
> > > >          original version: 0.0449 sec
> > > >          with dynptrs: 0.0429 sec
> > > >
> > > > 2) progs/test_l4lb_noinline.c
> > > >    * Change existing test to use dynptrs to parse skb data
> > > >
> > > >      There were no noticeable diferences in user + system time between
> > > >      the original version vs. using dynptrs. Averaging the time for 10
> > > >      runs (run using "time ./test_progs -t l4lb_all/l4lb_noinline"):
> > > >          original version: 0.0502 sec
> > > >          with dynptrs: 0.055 sec
> > > >
> > > >      For number of processed verifier instructions:
> > > >          original version: 6284 insns
> > > >          with dynptrs: 2538 insns
> > > >
> > > > 3) progs/test_dynptr_xdp.c
> > > >    * Add sample code for parsing tcp hdr opt lookup using dynptrs.
> > > >      This logic is lifted from a real-world use case of packet parsing in
> > > >      katran [0], a layer 4 load balancer
> > > >
> > > > 4) progs/dynptr_success.c
> > > >    * Add test case "test_skb_readonly" for testing attempts at writes /
> > > >      data slices on a prog type with read-only skb ctx.
> > > >
> > > > 5) progs/dynptr_fail.c
> > > >    * Add test cases "skb_invalid_data_slice" and
> > > >      "xdp_invalid_data_slice" for testing that helpers that modify the
> > > >      underlying packet buffer automatically invalidate the associated
> > > >      data slice.
> > > >    * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
> > > >      that prog types that do not support bpf_dynptr_from_skb/xdp don't
> > > >      have access to the API.
> > > >
> > > > [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > >  .../testing/selftests/bpf/prog_tests/dynptr.c |  85 ++++++++++---
> > > >  .../selftests/bpf/prog_tests/dynptr_xdp.c     |  49 ++++++++
> > > >  .../testing/selftests/bpf/progs/dynptr_fail.c |  76 ++++++++++++
> > > >  .../selftests/bpf/progs/dynptr_success.c      |  32 +++++
> > > >  .../selftests/bpf/progs/test_dynptr_xdp.c     | 115 ++++++++++++++++++
> > > >  .../selftests/bpf/progs/test_l4lb_noinline.c  |  71 +++++------
> > > >  tools/testing/selftests/bpf/progs/test_xdp.c  |  95 +++++++--------
> > > >  .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
> > > >  8 files changed, 416 insertions(+), 108 deletions(-)
> > > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/dynptr_xdp.c
> > > >  create mode 100644 tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > > >
> > > > diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > > index bcf80b9f7c27..c40631f33c7b 100644
> > > > --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > > +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
> > > > @@ -2,6 +2,7 @@
> > > >  /* Copyright (c) 2022 Facebook */
> > > >
> > > >  #include <test_progs.h>
> > > > +#include <network_helpers.h>
> > > >  #include "dynptr_fail.skel.h"
> > > >  #include "dynptr_success.skel.h"
> > > >
> > > > @@ -11,8 +12,8 @@ static char obj_log_buf[1048576];
> > > >  static struct {
> > > >         const char *prog_name;
> > > >         const char *expected_err_msg;
> > > > -} dynptr_tests[] = {
> > > > -       /* failure cases */
> > > > +} verifier_error_tests[] = {
> > > > +       /* these cases should trigger a verifier error */
> > > >         {"ringbuf_missing_release1", "Unreleased reference id=1"},
> > > >         {"ringbuf_missing_release2", "Unreleased reference id=2"},
> > > >         {"ringbuf_missing_release_callback", "Unreleased reference id"},
> > > > @@ -42,11 +43,25 @@ static struct {
> > > >         {"release_twice_callback", "arg 1 is an unacquired reference"},
> > > >         {"dynptr_from_mem_invalid_api",
> > > >                 "Unsupported reg type fp for bpf_dynptr_from_mem data"},
> > > > +       {"skb_invalid_data_slice", "invalid mem access 'scalar'"},
> > > > +       {"xdp_invalid_data_slice", "invalid mem access 'scalar'"},
> > > > +       {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"},
> > > > +       {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"},
> > > > +};
> > > > +
> > > > +enum test_setup_type {
> > > > +       SETUP_SYSCALL_SLEEP,
> > > > +       SETUP_SKB_PROG,
> > > > +};
> > > >
> > > > -       /* success cases */
> > > > -       {"test_read_write", NULL},
> > > > -       {"test_data_slice", NULL},
> > > > -       {"test_ringbuf", NULL},
> > > > +static struct {
> > > > +       const char *prog_name;
> > > > +       enum test_setup_type type;
> > > > +} runtime_tests[] = {
> > > > +       {"test_read_write", SETUP_SYSCALL_SLEEP},
> > > > +       {"test_data_slice", SETUP_SYSCALL_SLEEP},
> > > > +       {"test_ringbuf", SETUP_SYSCALL_SLEEP},
> > > > +       {"test_skb_readonly", SETUP_SKB_PROG},
> > >
> > > nit: wouldn't it be better to add test_setup_type to dynptr_tests (and
> > > keep fail and success cases together)? It's conceivable that you might
> > > want different setups to test different error conditions, right?
> >
> > Yeah! I originally separated it out because the success tests don't
> > have an error message while the fail ones do, and fail ones don't have
> > a setup (I don't think we'll need any custom userspace setup for those
> > since we're checking for verifier failures at prog load time) and the
> > success ones do. But I can combine them into 1 so that it's simpler. I
> > will do this in v2.
>
> great, thanks! you might actually need custom setup for SKB vs XDP
> programs if you are unifying bpf_dynptr_from_packet?

Yes I think so for the success cases (for the fail cases, I think just
having SEC("xdp") and SEC("tc") is sufficient)

>
> > >
> > > >  };
> > > >
> > > >  static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > > > @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
> > > >         dynptr_fail__destroy(skel);
> > > >  }
> > > >
>
> [...]
>
> > > > +/* The data slice is invalidated whenever a helper changes packet data */
> > > > +SEC("?xdp")
> > > > +int xdp_invalid_data_slice(struct xdp_md *xdp)
> > > > +{
> > > > +       struct bpf_dynptr ptr;
> > > > +       struct ethhdr *hdr1, *hdr2;
> > > > +
> > > > +       bpf_dynptr_from_xdp(xdp, 0, &ptr);
> > > > +       hdr1 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr1));
> > > > +       if (!hdr1)
> > > > +               return XDP_DROP;
> > > > +
> > > > +       hdr2 = bpf_dynptr_data(&ptr, 0, sizeof(*hdr2));
> > > > +       if (!hdr2)
> > > > +               return XDP_DROP;
> > > > +
> > > > +       hdr1->h_proto = 12;
> > > > +       hdr2->h_proto = 12;
> > > > +
> > > > +       if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr1)))
> > > > +               return XDP_DROP;
> > > > +
> > > > +       /* this should fail */
> > > > +       hdr2->h_proto = 1;
> > >
> > > is there something special about having both hdr1 and hdr2? Wouldn't
> > > this test work with just single hdr pointer?
> >
> > Yes, this test would work with just 1 single hdr pointer (which is
> > what skb_invalid_data_slice does) but I wanted to ensure that this
> > also works in the case of multiple data slices. If you think this is
> > unnecessary / adds clutter, I can remove hdr2.
>
>
> I think testing two pointers isn't the point, so I'd keep the test
> minimal. It seems like testing two pointers should be in a success
> test, to prove it works, rather than as some side effect of an
> expected-to-fail test, no?

My intention was to test that two pointers doesn't work (eg that when
you have multiple data slices, changing the packet buffer should
invalidate all slices, so that any attempt to access any slice should
fail) so I think to test that this would have to stay in the
verifier_fail test. I think this test might be more confusing than
helpful, so I will just remove hdr2 for v2 :)

>
> >
> > >
> > > > +
> > > > +       return XDP_PASS;
> > > > +}
> > > > +
> > > > +/* Only supported prog type can create skb-type dynptrs */
> > >
> > > [...]
> > >
> > > > +       err = 1;
> > > > +
> > > > +       if (bpf_dynptr_from_skb(ctx, 0, &ptr))
> > > > +               return 0;
> > > > +       err++;
> > > > +
> > > > +       data = bpf_dynptr_data(&ptr, 0, 1);
> > > > +       if (data)
> > > > +               /* it's an error if data is not NULL since cgroup skbs
> > > > +                * are read only
> > > > +                */
> > > > +               return 0;
> > > > +       err++;
> > > > +
> > > > +       ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
> > > > +       /* since cgroup skbs are read only, writes should fail */
> > > > +       if (ret != -EINVAL)
> > > > +               return 0;
> > > > +
> > > > +       err = 0;
> > >
> > > hm, if data is NULL you'll still report success if bpf_dynptr_write
> > > returns 0 or any other error but -EINVAL... The logic is a bit unclear
> > > here...
> > >
> > If data is NULL and bpf_dynptr_write returns 0 or any other error
> > besides -EINVAL, I think we report a failure here (err is set to a
> > non-zero value, which userspace checks against)
>
> oh, ok, I read it backwards. I find this "stateful increasing error
> number" pattern very confusing. Why not write it more
> straightforwardly as:
>
> if (bpf_dynptr_from_skb(...)) {
>     err = 1;
>     return 0;
> }
>
> data = bpf_dynptr_data(...);
> if (data) {
>     err = 2;
>     return 0;
> }
>
> ret = bpf_dynptr_write(...);
> if (ret != -EINVAL) {
>     err = 3;
>     return 0;
> }
>
> /* all tests passed */
> err = 0;
> return 0;
>
> ?
>

My thinking was that err++ would be more robust (eg if you add another
case to the code, then you don't have to go down and fix all the err
numbers to adjust them by 1). But you're right, I think this just ends
up being confusing / non-intuitive to read. I will change it to the
more explicit way for v2

> >
> > > > +
> > > > +       return 0;
> > > > +}
> > > > diff --git a/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > > > new file mode 100644
> > > > index 000000000000..c879dfb6370a
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/bpf/progs/test_dynptr_xdp.c
> > > > @@ -0,0 +1,115 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +
>
> [...]
>
> > > > +       hdr_len = data[1];
> > > > +       if (hdr_len > *hdr_bytes_remaining)
> > > > +               return -1;
> > > > +
> > > > +       if (kind == tcp_hdr_opt_kind_tpr) {
> > > > +               if (hdr_len != tcp_hdr_opt_len_tpr)
> > > > +                       return -1;
> > > > +
> > > > +               memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
> > >
> > > this implicitly relies on compiler inlining memcpy, let's use
> > > __builtint_memcpy() here instead to set a good example?
> >
> > Sounds good, I will change this for v2. Should memcpys in bpf progs
> > always use __builtin_memcpy or is it on a case-by-case basis where if
> > the size is small enough, then you use it?
>
> __builtin_memcpy() is best. When we write just "memcpy()" we still
> rely on compiler to actually optimizing that to __builtin_memcpy(),
> because there is no memcpy() (we'd get unrecognized extern error if
> compiler actually emitted call to memcpy()).

Ohh I see, thanks for the explanation!

I am going to do some selftests cleanup this week, so I'll change the
other usages of memcpy() to __builtin_memcpy() as part of that clean
up.

>
> > >
> > > > +               return 1;
> > > > +       }
> > > > +
> > > > +       *off += hdr_len;
> > > > +       *hdr_bytes_remaining -= hdr_len;
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
>
> [...]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-07-29 21:39       ` Martin KaFai Lau
  2022-08-01 17:52         ` Joanne Koong
@ 2022-08-03 20:29         ` Joanne Koong
  2022-08-03 20:36           ` Andrii Nakryiko
                             ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-03 20:29 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Fri, Jul 29, 2022 at 2:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Fri, Jul 29, 2022 at 01:26:31PM -0700, Joanne Koong wrote:
> > On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> > > >       if (bpf_dynptr_is_rdonly(ptr))
> > > Is it possible to allow data slice for rdonly dynptr-skb?
> > > and depends on the may_access_direct_pkt_data() check in the verifier.
> >
> > Ooh great idea. This should be very simple to do, since the data slice
> > that gets returned is assigned as PTR_TO_PACKET. So any stx operations
> > on it will by default go through the may_access_direct_pkt_data()
> > check. I'll add this for v2.
> It will be great.  Out of all three helpers (bpf_dynptr_read/write/data),
> bpf_dynptr_data will be the useful one to parse the header data (e.g. tcp-hdr-opt)
> that has runtime variable length because bpf_dynptr_data() can take a non-cost
> 'offset' argument.  It is useful to get a consistent usage across all bpf
> prog types that are either read-only or read-write of the skb.
>
> >
> > >
> > > >               return 0;
> > > >
> > > > +     type = bpf_dynptr_get_type(ptr);
> > > > +
> > > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > > +             struct sk_buff *skb = ptr->data;
> > > > +
> > > > +             /* if the data is paged, the caller needs to pull it first */
> > > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > > +                     return 0;
> > > > +
> > > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > > +     }
> > > > +
> > > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > > >  }
> > >
> > > [ ... ]
> > >
> > > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > > > +                                    struct bpf_call_arg_meta *meta)
> > > >  {
> > > >       struct bpf_func_state *state = func(env, reg);
> > > >       int spi = get_spi(reg->off);
> > > >
> > > > -     return state->stack[spi].spilled_ptr.id;
> > > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > > >  }
> > > >
> > > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > >                               case DYNPTR_TYPE_RINGBUF:
> > > >                                       err_extra = "ringbuf ";
> > > >                                       break;
> > > > +                             case DYNPTR_TYPE_SKB:
> > > > +                                     err_extra = "skb ";
> > > > +                                     break;
> > > >                               default:
> > > >                                       break;
> > > >                               }
> > > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> > > >                                       return -EFAULT;
> > > >                               }
> > > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > > +                             /* Find the id and the type of the dynptr we're tracking
> > > > +                              * the reference of.
> > > > +                              */
> > > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > > >                       }
> > > >               }
> > > >               break;
> > > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > +             else
> > > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > check_packet_access() uses range.
> > > It took me a while to figure range and mem_size is in union.
> > > Mentioning here in case someone has similar question.
> > For v2, I'll add this as a comment in the code or I'll include
> > "regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
> > obvious :)
> 'regs[BPF_REG_0].range = meta.mem_size' would be great.  No strong
> opinion here.
>
> > >
> > > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > > >               const struct btf_type *t;
> > > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > > >                       goto patch_call_imm;
> > > >               }
> > > >
> > > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > > +                     else
> > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > > +                     insn_buf[1] = *insn;
> > > > +                     cnt = 2;
> > > > +
> > > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > > +                     if (!new_prog)
> > > > +                             return -ENOMEM;
> > > > +
> > > > +                     delta += cnt - 1;
> > > > +                     env->prog = new_prog;
> > > > +                     prog = new_prog;
> > > > +                     insn = new_prog->insnsi + i + delta;
> > > > +                     goto patch_call_imm;
> > > > +             }
> > > Have you considered to reject bpf_dynptr_write()
> > > at prog load time?
> > It's possible to reject bpf_dynptr_write() at prog load time but would
> > require adding tracking in the verifier for whether a dynptr is
> > read-only or not. Do you think it's better to reject it at load time
> > instead of returning NULL at runtime?
> The check_helper_call above seems to know 'meta.type == BPF_DYNPTR_TYPE_SKB'.
> Together with may_access_direct_pkt_data(), would it be enough ?
> Then no need to do patching for BPF_FUNC_dynptr_from_skb here.

Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
to be patched regardless in order to set the rd-only flag in the
metadata for the dynptr. There will be other helper functions that
write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
probe read user with dynptrs, ...) so I think it's more scalable if we
reject these writes at runtime through the rd-only flag in the
metadata, than for the verifier to custom-case that any helper funcs
that write into dynptrs will need to get dynptr type + do
may_access_direct_pkt_data() if it's type skb or xdp. The
inconsistency between not rd-only in metadata vs. rd-only in verifier
might be a little confusing as well.

For these reasons, I'm leaning more towards having bpf_dynptr_write()
and other dynptr write helper funcs be rejected at runtime instead of
prog load time, but I'm eager to hear what you prefer.

What are your thoughts?

>
> Since we are on bpf_dynptr_write, what is the reason
> on limiting it to the skb_headlen() ?  Not implying one
> way is better than another.  would like to undertand the reason
> behind it since it is not clear in the commit message.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 20:29         ` Joanne Koong
@ 2022-08-03 20:36           ` Andrii Nakryiko
  2022-08-03 20:56           ` Martin KaFai Lau
  2022-08-03 23:25           ` Jakub Kicinski
  2 siblings, 0 replies; 52+ messages in thread
From: Andrii Nakryiko @ 2022-08-03 20:36 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, Aug 3, 2022 at 1:29 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Fri, Jul 29, 2022 at 2:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Fri, Jul 29, 2022 at 01:26:31PM -0700, Joanne Koong wrote:
> > > On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > > > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> > > > >       if (bpf_dynptr_is_rdonly(ptr))
> > > > Is it possible to allow data slice for rdonly dynptr-skb?
> > > > and depends on the may_access_direct_pkt_data() check in the verifier.
> > >
> > > Ooh great idea. This should be very simple to do, since the data slice
> > > that gets returned is assigned as PTR_TO_PACKET. So any stx operations
> > > on it will by default go through the may_access_direct_pkt_data()
> > > check. I'll add this for v2.
> > It will be great.  Out of all three helpers (bpf_dynptr_read/write/data),
> > bpf_dynptr_data will be the useful one to parse the header data (e.g. tcp-hdr-opt)
> > that has runtime variable length because bpf_dynptr_data() can take a non-cost
> > 'offset' argument.  It is useful to get a consistent usage across all bpf
> > prog types that are either read-only or read-write of the skb.
> >
> > >
> > > >
> > > > >               return 0;
> > > > >
> > > > > +     type = bpf_dynptr_get_type(ptr);
> > > > > +
> > > > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > > > +             struct sk_buff *skb = ptr->data;
> > > > > +
> > > > > +             /* if the data is paged, the caller needs to pull it first */
> > > > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > > > +                     return 0;
> > > > > +
> > > > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > > > +     }
> > > > > +
> > > > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > > > >  }
> > > >
> > > > [ ... ]
> > > >
> > > > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > > > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > > > > +                                    struct bpf_call_arg_meta *meta)
> > > > >  {
> > > > >       struct bpf_func_state *state = func(env, reg);
> > > > >       int spi = get_spi(reg->off);
> > > > >
> > > > > -     return state->stack[spi].spilled_ptr.id;
> > > > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > > > >  }
> > > > >
> > > > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > >                               case DYNPTR_TYPE_RINGBUF:
> > > > >                                       err_extra = "ringbuf ";
> > > > >                                       break;
> > > > > +                             case DYNPTR_TYPE_SKB:
> > > > > +                                     err_extra = "skb ";
> > > > > +                                     break;
> > > > >                               default:
> > > > >                                       break;
> > > > >                               }
> > > > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> > > > >                                       return -EFAULT;
> > > > >                               }
> > > > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > > > +                             /* Find the id and the type of the dynptr we're tracking
> > > > > +                              * the reference of.
> > > > > +                              */
> > > > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > > > >                       }
> > > > >               }
> > > > >               break;
> > > > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > > > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > > > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > +             else
> > > > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > check_packet_access() uses range.
> > > > It took me a while to figure range and mem_size is in union.
> > > > Mentioning here in case someone has similar question.
> > > For v2, I'll add this as a comment in the code or I'll include
> > > "regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
> > > obvious :)
> > 'regs[BPF_REG_0].range = meta.mem_size' would be great.  No strong
> > opinion here.
> >
> > > >
> > > > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > > > >               const struct btf_type *t;
> > > > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > > > >                       goto patch_call_imm;
> > > > >               }
> > > > >
> > > > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > > > +                     else
> > > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > > > +                     insn_buf[1] = *insn;
> > > > > +                     cnt = 2;
> > > > > +
> > > > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > > > +                     if (!new_prog)
> > > > > +                             return -ENOMEM;
> > > > > +
> > > > > +                     delta += cnt - 1;
> > > > > +                     env->prog = new_prog;
> > > > > +                     prog = new_prog;
> > > > > +                     insn = new_prog->insnsi + i + delta;
> > > > > +                     goto patch_call_imm;
> > > > > +             }
> > > > Have you considered to reject bpf_dynptr_write()
> > > > at prog load time?
> > > It's possible to reject bpf_dynptr_write() at prog load time but would
> > > require adding tracking in the verifier for whether a dynptr is
> > > read-only or not. Do you think it's better to reject it at load time
> > > instead of returning NULL at runtime?
> > The check_helper_call above seems to know 'meta.type == BPF_DYNPTR_TYPE_SKB'.
> > Together with may_access_direct_pkt_data(), would it be enough ?
> > Then no need to do patching for BPF_FUNC_dynptr_from_skb here.
>
> Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> to be patched regardless in order to set the rd-only flag in the
> metadata for the dynptr. There will be other helper functions that
> write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> probe read user with dynptrs, ...) so I think it's more scalable if we
> reject these writes at runtime through the rd-only flag in the
> metadata, than for the verifier to custom-case that any helper funcs
> that write into dynptrs will need to get dynptr type + do
> may_access_direct_pkt_data() if it's type skb or xdp. The
> inconsistency between not rd-only in metadata vs. rd-only in verifier
> might be a little confusing as well.
>
> For these reasons, I'm leaning more towards having bpf_dynptr_write()
> and other dynptr write helper funcs be rejected at runtime instead of
> prog load time, but I'm eager to hear what you prefer.
>

+1, that's sort of the point of dynptr to move move checks into
runtime and leave BPF verifier simpler

> What are your thoughts?
>
> >
> > Since we are on bpf_dynptr_write, what is the reason
> > on limiting it to the skb_headlen() ?  Not implying one
> > way is better than another.  would like to undertand the reason
> > behind it since it is not clear in the commit message.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 20:29         ` Joanne Koong
  2022-08-03 20:36           ` Andrii Nakryiko
@ 2022-08-03 20:56           ` Martin KaFai Lau
  2022-08-03 23:25           ` Jakub Kicinski
  2 siblings, 0 replies; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-03 20:56 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Wed, Aug 03, 2022 at 01:29:37PM -0700, Joanne Koong wrote:
> On Fri, Jul 29, 2022 at 2:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Fri, Jul 29, 2022 at 01:26:31PM -0700, Joanne Koong wrote:
> > > On Thu, Jul 28, 2022 at 4:39 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > >
> > > > On Tue, Jul 26, 2022 at 11:47:04AM -0700, Joanne Koong wrote:
> > > > > @@ -1567,6 +1607,18 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> > > > >       if (bpf_dynptr_is_rdonly(ptr))
> > > > Is it possible to allow data slice for rdonly dynptr-skb?
> > > > and depends on the may_access_direct_pkt_data() check in the verifier.
> > >
> > > Ooh great idea. This should be very simple to do, since the data slice
> > > that gets returned is assigned as PTR_TO_PACKET. So any stx operations
> > > on it will by default go through the may_access_direct_pkt_data()
> > > check. I'll add this for v2.
> > It will be great.  Out of all three helpers (bpf_dynptr_read/write/data),
> > bpf_dynptr_data will be the useful one to parse the header data (e.g. tcp-hdr-opt)
> > that has runtime variable length because bpf_dynptr_data() can take a non-cost
> > 'offset' argument.  It is useful to get a consistent usage across all bpf
> > prog types that are either read-only or read-write of the skb.
> >
> > >
> > > >
> > > > >               return 0;
> > > > >
> > > > > +     type = bpf_dynptr_get_type(ptr);
> > > > > +
> > > > > +     if (type == BPF_DYNPTR_TYPE_SKB) {
> > > > > +             struct sk_buff *skb = ptr->data;
> > > > > +
> > > > > +             /* if the data is paged, the caller needs to pull it first */
> > > > > +             if (ptr->offset + offset + len > skb->len - skb->data_len)
> > > > > +                     return 0;
> > > > > +
> > > > > +             return (unsigned long)(skb->data + ptr->offset + offset);
> > > > > +     }
> > > > > +
> > > > >       return (unsigned long)(ptr->data + ptr->offset + offset);
> > > > >  }
> > > >
> > > > [ ... ]
> > > >
> > > > > -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > > > > +static void stack_slot_get_dynptr_info(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
> > > > > +                                    struct bpf_call_arg_meta *meta)
> > > > >  {
> > > > >       struct bpf_func_state *state = func(env, reg);
> > > > >       int spi = get_spi(reg->off);
> > > > >
> > > > > -     return state->stack[spi].spilled_ptr.id;
> > > > > +     meta->ref_obj_id = state->stack[spi].spilled_ptr.id;
> > > > > +     meta->type = state->stack[spi].spilled_ptr.dynptr.type;
> > > > >  }
> > > > >
> > > > >  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > > @@ -6052,6 +6057,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > >                               case DYNPTR_TYPE_RINGBUF:
> > > > >                                       err_extra = "ringbuf ";
> > > > >                                       break;
> > > > > +                             case DYNPTR_TYPE_SKB:
> > > > > +                                     err_extra = "skb ";
> > > > > +                                     break;
> > > > >                               default:
> > > > >                                       break;
> > > > >                               }
> > > > > @@ -6065,8 +6073,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > > > >                                       verbose(env, "verifier internal error: multiple refcounted args in BPF_FUNC_dynptr_data");
> > > > >                                       return -EFAULT;
> > > > >                               }
> > > > > -                             /* Find the id of the dynptr we're tracking the reference of */
> > > > > -                             meta->ref_obj_id = stack_slot_get_id(env, reg);
> > > > > +                             /* Find the id and the type of the dynptr we're tracking
> > > > > +                              * the reference of.
> > > > > +                              */
> > > > > +                             stack_slot_get_dynptr_info(env, reg, meta);
> > > > >                       }
> > > > >               }
> > > > >               break;
> > > > > @@ -7406,7 +7416,11 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > > >               regs[BPF_REG_0].type = PTR_TO_TCP_SOCK | ret_flag;
> > > > >       } else if (base_type(ret_type) == RET_PTR_TO_ALLOC_MEM) {
> > > > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > > -             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > > > +                 meta.type == BPF_DYNPTR_TYPE_SKB)
> > > > > +                     regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > +             else
> > > > > +                     regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > check_packet_access() uses range.
> > > > It took me a while to figure range and mem_size is in union.
> > > > Mentioning here in case someone has similar question.
> > > For v2, I'll add this as a comment in the code or I'll include
> > > "regs[BPF_REG_0].range = meta.mem_size" explicitly to make it more
> > > obvious :)
> > 'regs[BPF_REG_0].range = meta.mem_size' would be great.  No strong
> > opinion here.
> >
> > > >
> > > > >       } else if (base_type(ret_type) == RET_PTR_TO_MEM_OR_BTF_ID) {
> > > > >               const struct btf_type *t;
> > > > > @@ -14132,6 +14146,25 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > > > >                       goto patch_call_imm;
> > > > >               }
> > > > >
> > > > > +             if (insn->imm == BPF_FUNC_dynptr_from_skb) {
> > > > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, true);
> > > > > +                     else
> > > > > +                             insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, false);
> > > > > +                     insn_buf[1] = *insn;
> > > > > +                     cnt = 2;
> > > > > +
> > > > > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > > > > +                     if (!new_prog)
> > > > > +                             return -ENOMEM;
> > > > > +
> > > > > +                     delta += cnt - 1;
> > > > > +                     env->prog = new_prog;
> > > > > +                     prog = new_prog;
> > > > > +                     insn = new_prog->insnsi + i + delta;
> > > > > +                     goto patch_call_imm;
> > > > > +             }
> > > > Have you considered to reject bpf_dynptr_write()
> > > > at prog load time?
> > > It's possible to reject bpf_dynptr_write() at prog load time but would
> > > require adding tracking in the verifier for whether a dynptr is
> > > read-only or not. Do you think it's better to reject it at load time
> > > instead of returning NULL at runtime?
> > The check_helper_call above seems to know 'meta.type == BPF_DYNPTR_TYPE_SKB'.
> > Together with may_access_direct_pkt_data(), would it be enough ?
> > Then no need to do patching for BPF_FUNC_dynptr_from_skb here.
> 
> Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> to be patched regardless in order to set the rd-only flag in the
> metadata for the dynptr. There will be other helper functions that
> write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> probe read user with dynptrs, ...) so I think it's more scalable if we
> reject these writes at runtime through the rd-only flag in the
> metadata, than for the verifier to custom-case that any helper funcs
> that write into dynptrs will need to get dynptr type + do
> may_access_direct_pkt_data() if it's type skb or xdp. The
> inconsistency between not rd-only in metadata vs. rd-only in verifier
> might be a little confusing as well.
> 
> For these reasons, I'm leaning more towards having bpf_dynptr_write()
> and other dynptr write helper funcs be rejected at runtime instead of
> prog load time, but I'm eager to hear what you prefer.
> 
> What are your thoughts?
Sure, as long as bpf_dynptr_data() is not restricted by the rdonly dynptr.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 20:29         ` Joanne Koong
  2022-08-03 20:36           ` Andrii Nakryiko
  2022-08-03 20:56           ` Martin KaFai Lau
@ 2022-08-03 23:25           ` Jakub Kicinski
  2022-08-04  1:05             ` Joanne Koong
                               ` (2 more replies)
  2 siblings, 3 replies; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-03 23:25 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, 3 Aug 2022 13:29:37 -0700 Joanne Koong wrote:
> Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> to be patched regardless in order to set the rd-only flag in the
> metadata for the dynptr. There will be other helper functions that
> write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> probe read user with dynptrs, ...) so I think it's more scalable if we
> reject these writes at runtime through the rd-only flag in the
> metadata, than for the verifier to custom-case that any helper funcs
> that write into dynptrs will need to get dynptr type + do
> may_access_direct_pkt_data() if it's type skb or xdp. The
> inconsistency between not rd-only in metadata vs. rd-only in verifier
> might be a little confusing as well.
> 
> For these reasons, I'm leaning more towards having bpf_dynptr_write()
> and other dynptr write helper funcs be rejected at runtime instead of
> prog load time, but I'm eager to hear what you prefer.
> 
> What are your thoughts?

Oh. I thought dynptrs are an extension of the discussion we had about
creating a skb_header_pointer()-like abstraction but it sounds like 
we veered quite far off that track at some point :(

The point of skb_header_pointer() is to expose the chunk of the packet
pointed to by [skb, offset, len] as a linear buffer. Potentially coping
it out to a stack buffer *IIF* the header is not contiguous inside the
skb head, which should very rarely happen.

Here it seems we return an error so that user must pull if the data is
not linear, which is defeating the purpose. The user of
skb_header_pointer() wants to avoid the copy while _reliably_ getting 
a contiguous pointer. Plus pulling in the header may be far more
expensive than a small copy to the stack.

The pointer returned by skb_header_pointer is writable, but it's not
guaranteed that the writes go to the packet, they may go to the
on-stack buffer, so the caller must do some sort of:

	if (data_ptr == stack_buf)
		skb_store_bits(...);

Which we were thinking of wrapping in some sort of flush operation.

If I'm reading this right dynptr as implemented here do not provide
such semantics, am I confused in thinking that this is a continuation
of the XDP multi-buff discussion? Is it a completely separate thing
and we'll still need a header_pointer like helper?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 23:25           ` Jakub Kicinski
@ 2022-08-04  1:05             ` Joanne Koong
  2022-08-04  1:34               ` Jakub Kicinski
  2022-08-04  1:27             ` Martin KaFai Lau
  2022-08-04 22:58             ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-04  1:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, Aug 3, 2022 at 4:25 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 3 Aug 2022 13:29:37 -0700 Joanne Koong wrote:
> > Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> > to be patched regardless in order to set the rd-only flag in the
> > metadata for the dynptr. There will be other helper functions that
> > write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> > probe read user with dynptrs, ...) so I think it's more scalable if we
> > reject these writes at runtime through the rd-only flag in the
> > metadata, than for the verifier to custom-case that any helper funcs
> > that write into dynptrs will need to get dynptr type + do
> > may_access_direct_pkt_data() if it's type skb or xdp. The
> > inconsistency between not rd-only in metadata vs. rd-only in verifier
> > might be a little confusing as well.
> >
> > For these reasons, I'm leaning more towards having bpf_dynptr_write()
> > and other dynptr write helper funcs be rejected at runtime instead of
> > prog load time, but I'm eager to hear what you prefer.
> >
> > What are your thoughts?
>
> Oh. I thought dynptrs are an extension of the discussion we had about
> creating a skb_header_pointer()-like abstraction but it sounds like
> we veered quite far off that track at some point :(

I think the problem is that the skb may be cloned, so a write into any
portion of the paged data requires a pull. If it weren't for this,
then we could do the write and checksumming without pulling (eg kmap
the page, get the csum_partial of the bytes you'll write over, do the
write, get the csum_partial of the bytes you just wrote, then unkmap,
then update skb->csum to be skb->csum - csum of the bytes you wrote
over + csum of the bytes you wrote). I think we would even be able to
provide a direct data slice to non-contiguous pages without needing
the additional copy to a stack buffer (eg kmap the non-contiguous
pages to a contiguous virtual address that we pass back to the bpf
program, and then when the bpf program is finished do the cleanup for
the mappings).

Three ideas I'm thinking through as a possible solution:
1) Enforce that the skb is always uncloned for skb-type bpf progs (we
currently do this just for the skb head, see bpf_unclone_prologue()),
but I'm not sure if the trade-off (pulling all the packet data, even
if it won't be used) is acceptable.

2) Don't support cloned skbs for bpf_dynptr_write/data and don't do
any pulling. If the prog wants to use bpf_dynptr_write/data, then they
have to pull it first

2) (uglier than #1 and #2) For bpf_dynptr_write()s, pull if the write
is to a paged area and the skb is cloned, otherwise write to the paged
area without pulling; if we do this, then we always have to invalidate
all data slices associated with the skb (even for writes to the head)
since at prog load time, the verifier doesn't know if the pull happens
or not. For bpf_dynptr_data()s, follow the same policy.

I'm leaning towards 2. What are your thoughts?

>
> The point of skb_header_pointer() is to expose the chunk of the packet
> pointed to by [skb, offset, len] as a linear buffer. Potentially coping
> it out to a stack buffer *IIF* the header is not contiguous inside the
> skb head, which should very rarely happen.
>
> Here it seems we return an error so that user must pull if the data is
> not linear, which is defeating the purpose. The user of
> skb_header_pointer() wants to avoid the copy while _reliably_ getting
> a contiguous pointer. Plus pulling in the header may be far more
> expensive than a small copy to the stack.
>
> The pointer returned by skb_header_pointer is writable, but it's not
> guaranteed that the writes go to the packet, they may go to the
> on-stack buffer, so the caller must do some sort of:
>
>         if (data_ptr == stack_buf)
>                 skb_store_bits(...);
>
> Which we were thinking of wrapping in some sort of flush operation.
>
> If I'm reading this right dynptr as implemented here do not provide
> such semantics, am I confused in thinking that this is a continuation
> of the XDP multi-buff discussion? Is it a completely separate thing
> and we'll still need a header_pointer like helper?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 23:25           ` Jakub Kicinski
  2022-08-04  1:05             ` Joanne Koong
@ 2022-08-04  1:27             ` Martin KaFai Lau
  2022-08-04  1:44               ` Jakub Kicinski
  2022-08-04 22:58             ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 52+ messages in thread
From: Martin KaFai Lau @ 2022-08-04  1:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Wed, Aug 03, 2022 at 04:25:40PM -0700, Jakub Kicinski wrote:
> The point of skb_header_pointer() is to expose the chunk of the packet
> pointed to by [skb, offset, len] as a linear buffer. Potentially coping
> it out to a stack buffer *IIF* the header is not contiguous inside the
> skb head, which should very rarely happen.
> 
> Here it seems we return an error so that user must pull if the data is
> not linear, which is defeating the purpose. The user of
> skb_header_pointer() wants to avoid the copy while _reliably_ getting 
> a contiguous pointer. Plus pulling in the header may be far more
> expensive than a small copy to the stack.
> 
> The pointer returned by skb_header_pointer is writable, but it's not
> guaranteed that the writes go to the packet, they may go to the
> on-stack buffer, so the caller must do some sort of:
> 
> 	if (data_ptr == stack_buf)
> 		skb_store_bits(...);
> 
> Which we were thinking of wrapping in some sort of flush operation.
Curious on the idea.  don't know whether this is a dynptr helper or
should be a specific pkt helper though.

The idea is to have the prog keeps writing to a ptr (skb->data or stack_buf).
When the prog is done, call a bpf helper to flush.  The helper
decides if it needs to flush from stack_buf to skb and
will take care of the cloned skb ?

> 
> If I'm reading this right dynptr as implemented here do not provide
> such semantics, am I confused in thinking that this is a continuation
> of the XDP multi-buff discussion? Is it a completely separate thing
> and we'll still need a header_pointer like helper?
Can you share a pointer to the XDP multi-buff discussion?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-04  1:05             ` Joanne Koong
@ 2022-08-04  1:34               ` Jakub Kicinski
  2022-08-04  3:44                 ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-04  1:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, 3 Aug 2022 18:05:44 -0700 Joanne Koong wrote:
> I think the problem is that the skb may be cloned, so a write into any
> portion of the paged data requires a pull. If it weren't for this,
> then we could do the write and checksumming without pulling (eg kmap
> the page, get the csum_partial of the bytes you'll write over, do the
> write, get the csum_partial of the bytes you just wrote, then unkmap,
> then update skb->csum to be skb->csum - csum of the bytes you wrote
> over + csum of the bytes you wrote). I think we would even be able to
> provide a direct data slice to non-contiguous pages without needing
> the additional copy to a stack buffer (eg kmap the non-contiguous
> pages to a contiguous virtual address that we pass back to the bpf
> program, and then when the bpf program is finished do the cleanup for
> the mappings).

The whole read/write/data concept is not a great match for packet
parsing. Primary use for packet parsing is that you want to read
a header and not have to deal with frags or pulling. In that case
you should get a direct pointer or a copy on the stack, transparently.

Maybe before I go on talking nonsense I should read up on what dynptr
is and what it's trying to achieve. Stan says its like unique_ptr in
C++ which tells me.. nothing :)

$ git grep dynptr -- Documentation/
$

Any pointers?

> Three ideas I'm thinking through as a possible solution:
> 1) Enforce that the skb is always uncloned for skb-type bpf progs (we
> currently do this just for the skb head, see bpf_unclone_prologue()),
> but I'm not sure if the trade-off (pulling all the packet data, even
> if it won't be used) is acceptable.
> 
> 2) Don't support cloned skbs for bpf_dynptr_write/data and don't do
> any pulling. If the prog wants to use bpf_dynptr_write/data, then they
> have to pull it first

I think all output skbs from TCP are cloned, so that's not gonna work.

> 2) (uglier than #1 and #2) For bpf_dynptr_write()s, pull if the write
> is to a paged area and the skb is cloned, otherwise write to the paged
> area without pulling; if we do this, then we always have to invalidate
> all data slices associated with the skb (even for writes to the head)
> since at prog load time, the verifier doesn't know if the pull happens
> or not. For bpf_dynptr_data()s, follow the same policy.
> 
> I'm leaning towards 2. What are your thoughts?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-04  1:27             ` Martin KaFai Lau
@ 2022-08-04  1:44               ` Jakub Kicinski
  0 siblings, 0 replies; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-04  1:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Joanne Koong, bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Wed, 3 Aug 2022 18:27:56 -0700 Martin KaFai Lau wrote:
> On Wed, Aug 03, 2022 at 04:25:40PM -0700, Jakub Kicinski wrote:
> > The point of skb_header_pointer() is to expose the chunk of the packet
> > pointed to by [skb, offset, len] as a linear buffer. Potentially coping
> > it out to a stack buffer *IIF* the header is not contiguous inside the
> > skb head, which should very rarely happen.
> > 
> > Here it seems we return an error so that user must pull if the data is
> > not linear, which is defeating the purpose. The user of
> > skb_header_pointer() wants to avoid the copy while _reliably_ getting 
> > a contiguous pointer. Plus pulling in the header may be far more
> > expensive than a small copy to the stack.
> > 
> > The pointer returned by skb_header_pointer is writable, but it's not
> > guaranteed that the writes go to the packet, they may go to the
> > on-stack buffer, so the caller must do some sort of:
> > 
> > 	if (data_ptr == stack_buf)
> > 		skb_store_bits(...);
> > 
> > Which we were thinking of wrapping in some sort of flush operation.  
> Curious on the idea.  don't know whether this is a dynptr helper or
> should be a specific pkt helper though.

Yeah, I could well pattern matched the dynptr because it sounded
similar but it's a completely different beast.

> The idea is to have the prog keeps writing to a ptr (skb->data or stack_buf).

To be clear writing is a lot more rare than reading in this case.

> When the prog is done, call a bpf helper to flush.  The helper
> decides if it needs to flush from stack_buf to skb and
> will take care of the cloned skb ?

Yeah, I'd think for skb it'd just pull. Normally dealing with skbs
you'd indeed probably just pull upfront if you knew you're gonna write.
Hence saving yourself from the unnecessary trip thru the stack. But XDP
does not have strong pulling support, so if the interface must support
both then it's the lower common denominator.

> > If I'm reading this right dynptr as implemented here do not provide
> > such semantics, am I confused in thinking that this is a continuation
> > of the XDP multi-buff discussion? Is it a completely separate thing
> > and we'll still need a header_pointer like helper?  
> Can you share a pointer to the XDP multi-buff discussion?

https://lore.kernel.org/all/20210916095539.4696ae27@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-04  1:34               ` Jakub Kicinski
@ 2022-08-04  3:44                 ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-04  3:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Martin KaFai Lau, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, Aug 3, 2022 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 3 Aug 2022 18:05:44 -0700 Joanne Koong wrote:
> > I think the problem is that the skb may be cloned, so a write into any
> > portion of the paged data requires a pull. If it weren't for this,
> > then we could do the write and checksumming without pulling (eg kmap
> > the page, get the csum_partial of the bytes you'll write over, do the
> > write, get the csum_partial of the bytes you just wrote, then unkmap,
> > then update skb->csum to be skb->csum - csum of the bytes you wrote
> > over + csum of the bytes you wrote). I think we would even be able to
> > provide a direct data slice to non-contiguous pages without needing
> > the additional copy to a stack buffer (eg kmap the non-contiguous
> > pages to a contiguous virtual address that we pass back to the bpf
> > program, and then when the bpf program is finished do the cleanup for
> > the mappings).
>
> The whole read/write/data concept is not a great match for packet
> parsing. Primary use for packet parsing is that you want to read
> a header and not have to deal with frags or pulling. In that case
> you should get a direct pointer or a copy on the stack, transparently.

The selftests [0] includes some examples of packet parsing using
dynptrs. You're able to get a pointer to the headers (if it's in the
head) directly, or you can use bpf_dynptr_read() to read the data from
the frag into a buffer (without needing to pull; bpf_dynptr_read()
essentially just calls bpf_skb_load_bytes()).

Does this address the use cases you have in mind?

I think the pull and unclone stuff only pertains to the cases for
writes into the frags.

[0] https://lore.kernel.org/bpf/20220726184706.954822-4-joannelkoong@gmail.com/

>
> Maybe before I go on talking nonsense I should read up on what dynptr
> is and what it's trying to achieve. Stan says its like unique_ptr in
> C++ which tells me.. nothing :)
>
> $ git grep dynptr -- Documentation/
> $
>
> Any pointers?

Ooh thanks for the reminder, adding a page for the dynptr
documentation is on my to-do list

A dynptr (also known as fat pointers in other languages) is a pointer
that stores extra metadata alongside the address it points to. In
particular, the metadata the bpf dynptr stores includes the length of
data it points to (so in the case of skb/xdp, the size of the packet),
the type of dynptr, and properties like whether it's read-only.
Dynptrs are an interface for bpf programs to dynamically access
memory, because the helper calls enforce that the accesses are safe.
For example, bpf_dynptr_read() allows reads at offsets and lengths not
statically known at compile time (in bpf_dynptr_read, the kernel uses
the metadata to check that the offset and length doesn't violate
memory bounds); without dynptrs, the verifier can't guarantee that the
offset or length of the read is safe since those values aren't
statically known at compile time, so for example you can't directly
read a dynamic number of bytes from a packet.

With regards to skb + xdp, the main use case of dynptrs are to 1)
allow dynamic-size accesses of packet data and 2) allow easier and
simpler packet parsing (for example, accessing skb->data directly
requires multiple if checks for ensuring it's within bounds of
skb->data_end in order to satisfy the verifier; with the dynptr
interface, you are able to get a direct data slice and access it
without needing the checks. The selftests 3rd patch has some
demonstrations of this).

>
> > Three ideas I'm thinking through as a possible solution:
> > 1) Enforce that the skb is always uncloned for skb-type bpf progs (we
> > currently do this just for the skb head, see bpf_unclone_prologue()),
> > but I'm not sure if the trade-off (pulling all the packet data, even
> > if it won't be used) is acceptable.
> >
> > 2) Don't support cloned skbs for bpf_dynptr_write/data and don't do
> > any pulling. If the prog wants to use bpf_dynptr_write/data, then they
> > have to pull it first
>
> I think all output skbs from TCP are cloned, so that's not gonna work.
>
> > 2) (uglier than #1 and #2) For bpf_dynptr_write()s, pull if the write
> > is to a paged area and the skb is cloned, otherwise write to the paged
> > area without pulling; if we do this, then we always have to invalidate
> > all data slices associated with the skb (even for writes to the head)
> > since at prog load time, the verifier doesn't know if the pull happens
> > or not. For bpf_dynptr_data()s, follow the same policy.
> >
> > I'm leaning towards 2. What are your thoughts?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-03 16:11         ` Joanne Koong
@ 2022-08-04 18:45           ` Alexei Starovoitov
  2022-08-05 16:29             ` Joanne Koong
  0 siblings, 1 reply; 52+ messages in thread
From: Alexei Starovoitov @ 2022-08-04 18:45 UTC (permalink / raw)
  To: Joanne Koong
  Cc: Andrii Nakryiko, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Wed, Aug 3, 2022 at 9:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > __builtin_memcpy() is best. When we write just "memcpy()" we still
> > rely on compiler to actually optimizing that to __builtin_memcpy(),
> > because there is no memcpy() (we'd get unrecognized extern error if
> > compiler actually emitted call to memcpy()).
>
> Ohh I see, thanks for the explanation!
>
> I am going to do some selftests cleanup this week, so I'll change the
> other usages of memcpy() to __builtin_memcpy() as part of that clean
> up.

builtin_memcpy might be doing single byte copy when
alignment is not known which is often the case when
working with packets.
If we do this cleanup, let's copy-paste cilium's memcpy
helper that does 8 byte copy.
It's much better than builtin_memcpy.
https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-02 22:21     ` Joanne Koong
@ 2022-08-04 21:46       ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-04 21:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov

On Tue, Aug 2, 2022 at 3:21 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 12:12 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jul 26, 2022 at 11:48 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > >
> > > -static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > > +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
> > >  {
> > > -       void *data_end = (void *)(long)xdp->data_end;
> > > -       void *data = (void *)(long)xdp->data;
> > > +       struct bpf_dynptr new_xdp_ptr;
> > >         struct iptnl_info *tnl;
> > >         struct ethhdr *new_eth;
> > >         struct ethhdr *old_eth;
> > > -       struct iphdr *iph = data + sizeof(struct ethhdr);
> > > +       struct iphdr *iph;
> > >         __u16 *next_iph;
> > >         __u16 payload_len;
> > >         struct vip vip = {};
> > > @@ -90,10 +90,12 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> > >         __u32 csum = 0;
> > >         int i;
> > >
> > > -       if (iph + 1 > data_end)
> > > +       iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz,
> > > +                             iphdr_sz + (tcphdr_sz > udphdr_sz ? tcphdr_sz : udphdr_sz));
> > > +       if (!iph)
> > >                 return XDP_DROP;
> >
> > dynptr based xdp/skb access looks neat.
> > Maybe in addition to bpf_dynptr_data() we can add helper(s)
> > that return skb/xdp_md from dynptr?
> > This way the code will be passing dynptr only and there will
> > be no need to pass around 'struct xdp_md *xdp' (like this function).
>
> Great idea! I'll add this to v2.

Thinking about this some more, I don't think the extra helpers will be
that useful. We'd have to add 2 custom helpers (bpf_dynptr_get_xdp +
bpf_dynptr_get_skb) and calling them would always require a null check
(since we'd return NULL if the dyntpr is invalid/null). I think it'd
be faster / easier for the program to just pass in the ctx as an extra
arg in the special cases where it needs that.

>
> >
> > Separately please keep the existing tests instead of converting them.
> > Either ifdef data/data_end vs dynptr style or copy paste
> > the whole test into a new .c file. Whichever is cleaner.
>
> Will do for v2.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-02  2:12     ` Joanne Koong
@ 2022-08-04 21:55       ` Joanne Koong
  2022-08-05 23:22         ` Jakub Kicinski
  0 siblings, 1 reply; 52+ messages in thread
From: Joanne Koong @ 2022-08-04 21:55 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: bpf, andrii, daniel, ast

On Mon, Aug 1, 2022 at 7:12 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 1, 2022 at 4:33 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > (consider cross-posting network-related stuff to netdev@)
>
> Great, I will start cc-ing netdev@
>
> >
> > On Tue, 26 Jul 2022 11:47:04 -0700 Joanne Koong wrote:
> > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > benefits. One is that they allow operations on sizes that are not
> > > statically known at compile-time (eg variable-sized accesses).
> > > Another is that parsing the packet data through dynptrs (instead of
> > > through direct access of skb->data and skb->data_end) can be more
> > > ergonomic and less brittle (eg does not need manual if checking for
> > > being within bounds of data_end).
> >
> > Is there really a need for dynptr_from_{skb,xdp} to be different
> > function IDs? I was hoping this work would improve portability of
> > networking BPF programs across the hooks.
>
> Awesome, I like this idea of having just one generic API named
> something like bpf_dynptr_from_packet. I'll add this for v2!
>

Thinking about this some more, I don't think we get a lot of benefits
from combining it into one API (bpf_dynptr_from_packet) instead of 2
separate APIs (bpf_dynptr_from_skb / bpf_dynptr_from_xdp). The
bpf_dynptr_write behavior will be inconsistent (eg bpf_dynptr_write
into xdp frags will work whereas bpf_dynptr_write into skb frags will
fail). Martin also pointed out that he'd prefer bpf_dynptr_write() to
succeed for writing into frags and invalidate data slices (instead of
failing the write and always keeping data slices valid), which we
can't do if we combine xdp + skb, without always (needlessly)
invalidating xdp data slices whenever there's a write. Additionally,
in the verifier, there's no organic mapping between prog type -> prog
ctx, so we'll have to hardcode some mapping between prog type -> skb
vs. xdp ctx. I think for these reasons it makes more sense to have 2
separate APIs, instead of having 1 API that both hooks can call.

> >
> > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > read-only (writes and data slices are not permitted). For reads on the
> > > dynptr, this includes reading into data in the non-linear paged buffers
> > > but for writes and data slices, if the data is in a paged buffer, the
> > > user must first call bpf_skb_pull_data to pull the data into the linear
> > > portion.
> > >
> > > Additionally, any helper calls that change the underlying packet buffer
> > > (eg bpf_skb_pull_data) invalidates any data slices of the associated
> > > dynptr.
> >
> > Grepping the verifier did not help me find that, would you mind
> > pointing me to the code?
>
> The base reg type of a skb data slice will be PTR_TO_PACKET - this
> gets set in this patch in check_helper_call() in verifier.c:
>
> + if (func_id == BPF_FUNC_dynptr_data &&
> +    meta.type == BPF_DYNPTR_TYPE_SKB)
> + regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
>
> Anytime there is a helper call that changes the underlying packet
> buffer [0], the verifier iterates through the registers and marks all
> PTR_TO_PACKET reg types as unknown, which invalidates them. The dynptr
> data slice will be invalidated since its base reg type is
> PTR_TO_PACKET
>
> The stack trace is:
>    check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown()
>
>
> I will add this explanation to the commit message for v2 since it is non-obvious
>
>
> [0] https://elixir.bootlin.com/linux/latest/source/kernel/bpf/verifier.c#L7143
>
> [1] https://elixir.bootlin.com/linux/latest/source/kernel/bpf/verifier.c#L6489
>
>
> >
> > > Right now, skb dynptrs can only be constructed from skbs that are
> > > the bpf program context - as such, there does not need to be any
> > > reference tracking or release on skb dynptrs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-03 23:25           ` Jakub Kicinski
  2022-08-04  1:05             ` Joanne Koong
  2022-08-04  1:27             ` Martin KaFai Lau
@ 2022-08-04 22:58             ` Kumar Kartikeya Dwivedi
  2022-08-05 23:25               ` Jakub Kicinski
  2 siblings, 1 reply; 52+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-08-04 22:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Joanne Koong, Martin KaFai Lau, bpf, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov

On Thu, 4 Aug 2022 at 01:28, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 3 Aug 2022 13:29:37 -0700 Joanne Koong wrote:
> > Thinking about this some more, I think BPF_FUNC_dynptr_from_skb needs
> > to be patched regardless in order to set the rd-only flag in the
> > metadata for the dynptr. There will be other helper functions that
> > write into dynptrs (eg memcpy with dynptrs, strncpy with dynptrs,
> > probe read user with dynptrs, ...) so I think it's more scalable if we
> > reject these writes at runtime through the rd-only flag in the
> > metadata, than for the verifier to custom-case that any helper funcs
> > that write into dynptrs will need to get dynptr type + do
> > may_access_direct_pkt_data() if it's type skb or xdp. The
> > inconsistency between not rd-only in metadata vs. rd-only in verifier
> > might be a little confusing as well.
> >
> > For these reasons, I'm leaning more towards having bpf_dynptr_write()
> > and other dynptr write helper funcs be rejected at runtime instead of
> > prog load time, but I'm eager to hear what you prefer.
> >
> > What are your thoughts?
>
> Oh. I thought dynptrs are an extension of the discussion we had about
> creating a skb_header_pointer()-like abstraction but it sounds like
> we veered quite far off that track at some point :(
>
> The point of skb_header_pointer() is to expose the chunk of the packet
> pointed to by [skb, offset, len] as a linear buffer. Potentially coping
> it out to a stack buffer *IIF* the header is not contiguous inside the
> skb head, which should very rarely happen.
>
> Here it seems we return an error so that user must pull if the data is
> not linear, which is defeating the purpose. The user of
> skb_header_pointer() wants to avoid the copy while _reliably_ getting
> a contiguous pointer. Plus pulling in the header may be far more
> expensive than a small copy to the stack.
>
> The pointer returned by skb_header_pointer is writable, but it's not
> guaranteed that the writes go to the packet, they may go to the
> on-stack buffer, so the caller must do some sort of:
>
>         if (data_ptr == stack_buf)
>                 skb_store_bits(...);
>
> Which we were thinking of wrapping in some sort of flush operation.
>
> If I'm reading this right dynptr as implemented here do not provide
> such semantics, am I confused in thinking that this is a continuation
> of the XDP multi-buff discussion? Is it a completely separate thing
> and we'll still need a header_pointer like helper?

When I worked on [0], I actually did it a bit like you described in
the original discussion under the xdp multi-buff thread, but I left
the other case (where data to be read resides across frag boundaries)
up to the user to handle, instead of automatically passing in pointer
to stack and doing the copy for them, so in my case
xdp_load_bytes/xdp_store_bytes is the fallback if you can't get a
bpf_packet_pointer for a ctx, offset, len which you can directly
access. But this was only for XDP, not for skb.

The advantage with a dynptr is that len for the slice from
bpf_packet_pointer style helper doesn't have to be a constant, it can
be a runtime value and since it is checked at runtime anyway, the
helper's code is the same but access can be done for slices whose
length is unknown to the verifier in a safe manner. The dynptr is very
useful as the return value of such a helper.

The suggested usage was like this:

    int err = 0;
    char buf[N];

    off &= 0xffff;
    ptr = bpf_packet_pointer(ctx, off, sizeof(buf), &err);
    if (unlikely(!ptr)) {
        if (err < 0)
            return XDP_ABORTED;
        err = bpf_xdp_load_bytes(ctx, off, buf, sizeof(buf));
        if (err < 0)
            return XDP_ABORTED;
        ptr = buf;
    }
    ...
    // Do some stores and loads in [ptr, ptr + N) region
    ...
    if (unlikely(ptr == buf)) {
        err = bpf_xdp_store_bytes(ctx, off, buf, sizeof(buf));
        if (err < 0)
            return XDP_ABORTED;
    }

So the idea was the same, there is still a "flush" (in that unlikely
branch), but it is done explicitly by the user (which I found less
confusing than it being done automagically or a by a new flush helper
which will do the same thing we do here, but YMMV).

[0]: https://lore.kernel.org/bpf/20220306234311.452206-1-memxor@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
  2022-08-04 18:45           ` Alexei Starovoitov
@ 2022-08-05 16:29             ` Joanne Koong
  0 siblings, 0 replies; 52+ messages in thread
From: Joanne Koong @ 2022-08-05 16:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, bpf, Andrii Nakryiko, Daniel Borkmann,
	Alexei Starovoitov

On Thu, Aug 4, 2022 at 11:45 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Aug 3, 2022 at 9:11 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > __builtin_memcpy() is best. When we write just "memcpy()" we still
> > > rely on compiler to actually optimizing that to __builtin_memcpy(),
> > > because there is no memcpy() (we'd get unrecognized extern error if
> > > compiler actually emitted call to memcpy()).
> >
> > Ohh I see, thanks for the explanation!
> >
> > I am going to do some selftests cleanup this week, so I'll change the
> > other usages of memcpy() to __builtin_memcpy() as part of that clean
> > up.
>
> builtin_memcpy might be doing single byte copy when
> alignment is not known which is often the case when
> working with packets.
> If we do this cleanup, let's copy-paste cilium's memcpy
> helper that does 8 byte copy.
> It's much better than builtin_memcpy.
> https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h

Maybe we should have a separate patchset to changing to use cilium's
other builtins as well (eg memzero, memcmp, memmove); I'll leave the
memcpy() -> __builtin_memcpy() changes to be part of that patchset.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-04 21:55       ` Joanne Koong
@ 2022-08-05 23:22         ` Jakub Kicinski
  0 siblings, 0 replies; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-05 23:22 UTC (permalink / raw)
  To: Joanne Koong; +Cc: bpf, andrii, daniel, ast

On Thu, 4 Aug 2022 14:55:14 -0700 Joanne Koong wrote:
> Thinking about this some more, I don't think we get a lot of benefits
> from combining it into one API (bpf_dynptr_from_packet) instead of 2
> separate APIs (bpf_dynptr_from_skb / bpf_dynptr_from_xdp).

Ease of use is not a big benefit? We'll continue to have people trying
to run XDP generic in prod because they wrote their program for XDP,
not tc :(

> The bpf_dynptr_write behavior will be inconsistent (eg bpf_dynptr_write
> into xdp frags will work whereas bpf_dynptr_write into skb frags will
> fail). Martin also pointed out that he'd prefer bpf_dynptr_write() to
> succeed for writing into frags and invalidate data slices (instead of
> failing the write and always keeping data slices valid), which we
> can't do if we combine xdp + skb, without always (needlessly)
> invalidating xdp data slices whenever there's a write. Additionally,
> in the verifier, there's no organic mapping between prog type -> prog
> ctx, so we'll have to hardcode some mapping between prog type -> skb
> vs. xdp ctx. I think for these reasons it makes more sense to have 2
> separate APIs, instead of having 1 API that both hooks can call.

Feels like pushing complexity onto users because the infra is not in
place. One day the infra will be in place yet the uAPI will remain for
ever. But enough emails exchanged, take your pick and time will tell :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs
  2022-08-04 22:58             ` Kumar Kartikeya Dwivedi
@ 2022-08-05 23:25               ` Jakub Kicinski
  0 siblings, 0 replies; 52+ messages in thread
From: Jakub Kicinski @ 2022-08-05 23:25 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Joanne Koong, Martin KaFai Lau, bpf, Andrii Nakryiko,
	Daniel Borkmann, Alexei Starovoitov

On Fri, 5 Aug 2022 00:58:01 +0200 Kumar Kartikeya Dwivedi wrote:
> When I worked on [0], I actually did it a bit like you described in
> the original discussion under the xdp multi-buff thread, but I left
> the other case (where data to be read resides across frag boundaries)
> up to the user to handle, instead of automatically passing in pointer
> to stack and doing the copy for them, so in my case
> xdp_load_bytes/xdp_store_bytes is the fallback if you can't get a
> bpf_packet_pointer for a ctx, offset, len which you can directly
> access. But this was only for XDP, not for skb.
> 
> The advantage with a dynptr is that len for the slice from
> bpf_packet_pointer style helper doesn't have to be a constant, it can
> be a runtime value and since it is checked at runtime anyway, the
> helper's code is the same but access can be done for slices whose
> length is unknown to the verifier in a safe manner. The dynptr is very
> useful as the return value of such a helper.

I see.

> The suggested usage was like this:
> 
>     int err = 0;
>     char buf[N];
> 
>     off &= 0xffff;
>     ptr = bpf_packet_pointer(ctx, off, sizeof(buf), &err);
>     if (unlikely(!ptr)) {
>         if (err < 0)
>             return XDP_ABORTED;
>         err = bpf_xdp_load_bytes(ctx, off, buf, sizeof(buf));
>         if (err < 0)
>             return XDP_ABORTED;
>         ptr = buf;
>     }
>     ...
>     // Do some stores and loads in [ptr, ptr + N) region
>     ...
>     if (unlikely(ptr == buf)) {
>         err = bpf_xdp_store_bytes(ctx, off, buf, sizeof(buf));
>         if (err < 0)
>             return XDP_ABORTED;
>     }
> 
> So the idea was the same, there is still a "flush" (in that unlikely
> branch), but it is done explicitly by the user (which I found less
> confusing than it being done automagically or a by a new flush helper
> which will do the same thing we do here, but YMMV).

Ack, the flush is awkward to create an API for. I presume that's 
the main reason the xdp mbuf discussion wasn't fruitful.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-08-05 23:25 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-26 18:47 [PATCH bpf-next v1 0/3] Add skb + xdp dynptrs Joanne Koong
2022-07-26 18:47 ` [PATCH bpf-next v1 1/3] bpf: Add skb dynptrs Joanne Koong
2022-07-27 17:13   ` sdf
2022-07-28 16:49     ` Joanne Koong
2022-07-28 17:28       ` Stanislav Fomichev
2022-07-28 17:45   ` Hao Luo
2022-07-28 18:36     ` Joanne Koong
2022-07-28 23:39   ` Martin KaFai Lau
2022-07-29 20:26     ` Joanne Koong
2022-07-29 21:39       ` Martin KaFai Lau
2022-08-01 17:52         ` Joanne Koong
2022-08-01 19:38           ` Martin KaFai Lau
2022-08-01 21:16             ` Joanne Koong
2022-08-01 22:14               ` Andrii Nakryiko
2022-08-01 22:32               ` Martin KaFai Lau
2022-08-01 22:58                 ` Andrii Nakryiko
2022-08-01 23:23                   ` Martin KaFai Lau
2022-08-02  0:56                     ` Martin KaFai Lau
2022-08-02  3:51                       ` Andrii Nakryiko
2022-08-02  4:53                         ` Joanne Koong
2022-08-02  5:14                           ` Joanne Koong
2022-08-03 20:29         ` Joanne Koong
2022-08-03 20:36           ` Andrii Nakryiko
2022-08-03 20:56           ` Martin KaFai Lau
2022-08-03 23:25           ` Jakub Kicinski
2022-08-04  1:05             ` Joanne Koong
2022-08-04  1:34               ` Jakub Kicinski
2022-08-04  3:44                 ` Joanne Koong
2022-08-04  1:27             ` Martin KaFai Lau
2022-08-04  1:44               ` Jakub Kicinski
2022-08-04 22:58             ` Kumar Kartikeya Dwivedi
2022-08-05 23:25               ` Jakub Kicinski
2022-08-01 22:11   ` Andrii Nakryiko
2022-08-02  0:15     ` Joanne Koong
2022-08-01 23:33   ` Jakub Kicinski
2022-08-02  2:12     ` Joanne Koong
2022-08-04 21:55       ` Joanne Koong
2022-08-05 23:22         ` Jakub Kicinski
2022-08-03  6:37   ` Martin KaFai Lau
2022-07-26 18:47 ` [PATCH bpf-next v1 2/3] bpf: Add xdp dynptrs Joanne Koong
2022-07-26 18:47 ` [PATCH bpf-next v1 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Joanne Koong
2022-07-26 19:44   ` Zvi Effron
2022-07-26 20:06     ` Joanne Koong
2022-08-01 17:58   ` Andrii Nakryiko
2022-08-02 22:56     ` Joanne Koong
2022-08-03  0:53       ` Andrii Nakryiko
2022-08-03 16:11         ` Joanne Koong
2022-08-04 18:45           ` Alexei Starovoitov
2022-08-05 16:29             ` Joanne Koong
2022-08-01 19:12   ` Alexei Starovoitov
2022-08-02 22:21     ` Joanne Koong
2022-08-04 21:46       ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).