[PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock
@ 2019-02-01  7:03 Martin KaFai Lau
  2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This series adds __sk_buff->sk, "struct bpf_tcp_sock",
BPF_FUNC_sk_fullsock and BPF_FUNC_tcp_sock.  Together, they provide
a common way to expose the members of "struct tcp_sock" and
"struct bpf_sock" for the bpf_prog to access.

The patch series first adds a bpf_sock pointer to __sk_buff
and a new helper BPF_FUNC_sk_fullsock.

It then adds BPF_FUNC_tcp_sock to get a bpf_tcp_sock
pointer from a bpf_sock pointer.

The current use case is to allow a cg_skb_bpf_prog to provide
per cgroup traffic policing/shaping.

Please see individual patch for details.

Martin KaFai Lau (6):
  bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  bpf: Refactor sock_ops_convert_ctx_access
  bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
  bpf: Sync bpf.h to tools/
  bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to
    test_verifer
  bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock

 include/linux/bpf.h                           |  42 ++
 include/uapi/linux/bpf.h                      |  61 ++-
 kernel/bpf/verifier.c                         | 156 +++++--
 net/core/filter.c                             | 407 +++++++++++-------
 tools/include/uapi/linux/bpf.h                |  61 ++-
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 314 ++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 145 +++++++
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 238 ++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 12 files changed, 1233 insertions(+), 206 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
@ 2019-02-01  7:03 ` Martin KaFai Lau
  2019-02-02  2:38   ` Alexei Starovoitov
  2019-02-04 22:33   ` Daniel Borkmann
  2019-02-01  7:03 ` [PATCH bpf-next 2/6] bpf: Refactor sock_ops_convert_ctx_access Martin KaFai Lau
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
before accessing the fields in sock.  For example, in __netdev_pick_tx:

static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
			    struct net_device *sb_dev)
{
	/* ... */

	struct sock *sk = skb->sk;

		if (queue_index != new_index && sk &&
		    sk_fullsock(sk) &&
		    rcu_access_pointer(sk->sk_dst_cache))
			sk_tx_queue_set(sk, new_index);

	/* ... */

	return queue_index;
}

This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
where a few of the convert_ctx_access() in filter.c has already been
accessing the skb->sk sock_common's fields,
e.g. sock_ops_convert_ctx_access().

"__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
Some of the fileds in "bpf_sock" will not be directly
accessible through the "__sk_buff->sk" pointer.  It is limited
by the new "bpf_sock_common_is_valid_access()".
e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
     are not allowed.

The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
can be used to get a sk with all accessible fields in "bpf_sock".
This helper is added to both cg_skb and sched_(cls|act).

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_sock *sk;
	__u32 family;

	sk = skb->sk;
	if (!sk)
		return 1;

	sk = bpf_sk_fullsock(sk);
	if (!sk)
		return 1;

	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
		return 1;

	/* some_traffic_shaping(); */

	return 1;
}

(1) The sk is read only

(2) There is no new "struct bpf_sock_common" introduced.

(3) Future kernel sock's members could be added to bpf_sock only
    instead of repeatedly adding at multiple places like currently
    in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

(4) After "sk = skb->sk", the reg holding sk is in type
    PTR_TO_SOCK_COMMON_OR_NULL.

(5) After bpf_sk_fullsock(), the return type will be in type
    PTR_TO_SOCKET_OR_NULL which is the same as the return type of
    bpf_sk_lookup_xxx().

    However, bpf_sk_fullsock() does not take refcnt.  The
    acquire_reference_state() is only depending on the return type now.
    To avoid it, a new is_acquire_function() is checked before calling
    acquire_reference_state().

(6) The WARN_ON in "release_reference_state()" is no longer an
    internal verifier bug.

    When reg->id is not found in state->refs[], it means the
    bpf_prog does something wrong like
    "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
    never been acquired by calling "bpf_sk_fullsock(skb->sk)".

    A -EINVAL and a verbose are done instead of WARN_ON.  A test is
    added to the test_verifier in a later patch.

    Since the WARN_ON in "release_reference_state()" is no longer
    needed, "__release_reference_state()" is folded into
    "release_reference_state()" also.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  12 ++++
 include/uapi/linux/bpf.h |  12 +++-
 kernel/bpf/verifier.c    | 129 +++++++++++++++++++++++++++------------
 net/core/filter.c        |  43 +++++++++++++
 4 files changed, 156 insertions(+), 40 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0394f1f9213b..72022a0c442d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -162,6 +162,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_CTX,		/* pointer to context */
 	ARG_ANYTHING,		/* any (initialized) argument is ok */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
+	ARG_PTR_TO_SOCK_COMMON,	/* pointer to __sk_buff->sk */
 };
 
 /* type of values returned from helper functions */
@@ -224,6 +225,8 @@ enum bpf_reg_type {
 	PTR_TO_FLOW_KEYS,	 /* reg points to bpf_flow_keys */
 	PTR_TO_SOCKET,		 /* reg points to struct bpf_sock */
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
+	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
+	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -887,6 +890,9 @@ void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 #if defined(CONFIG_NET)
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info);
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info);
 u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
@@ -895,6 +901,12 @@ u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 				struct bpf_prog *prog,
 				u32 *target_size);
 #else
+static inline bool bpf_sock_common_is_valid_access(int off, int size,
+						   enum bpf_access_type type,
+						   struct bpf_insn_access_aux *info)
+{
+	return false;
+}
 static inline bool bpf_sock_is_valid_access(int off, int size,
 					    enum bpf_access_type type,
 					    struct bpf_insn_access_aux *info)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60b99b730a41..b7a5312c4ccf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2328,6 +2328,14 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper tests if a bpf_sock is a fullsock such that
+ *		all the fields in bpf_sock can be accessed.
+ *	Return
+ *		The same pointer if *sk* is a fullsock.  Otherwise,
+ *		NULL is returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2422,7 +2430,8 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(sk_fullsock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2542,6 +2551,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8c1c21cd50b4..b4500f29dff1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -330,10 +330,17 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 	       type == PTR_TO_PACKET_META;
 }
 
+static bool type_is_sk_pointer(enum bpf_reg_type type)
+{
+	return type == PTR_TO_SOCKET ||
+		type == PTR_TO_SOCK_COMMON;
+}
+
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
-	       type == PTR_TO_SOCKET_OR_NULL;
+	       type == PTR_TO_SOCKET_OR_NULL ||
+	       type == PTR_TO_SOCK_COMMON_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -370,6 +377,12 @@ static bool is_release_function(enum bpf_func_id func_id)
 	return func_id == BPF_FUNC_sk_release;
 }
 
+static bool is_acquire_function(enum bpf_func_id func_id)
+{
+	return func_id == BPF_FUNC_sk_lookup_tcp ||
+		func_id == BPF_FUNC_sk_lookup_udp;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
 	[NOT_INIT]		= "?",
@@ -385,6 +398,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_FLOW_KEYS]	= "flow_keys",
 	[PTR_TO_SOCKET]		= "sock",
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
+	[PTR_TO_SOCK_COMMON]	= "sock_common",
+	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
 };
 
 static char slot_type_char[] = {
@@ -611,13 +626,10 @@ static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
 }
 
 /* release function corresponding to acquire_reference_state(). Idempotent. */
-static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
+static int release_reference_state(struct bpf_func_state *state, int ptr_id)
 {
 	int i, last_idx;
 
-	if (!ptr_id)
-		return -EFAULT;
-
 	last_idx = state->acquired_refs - 1;
 	for (i = 0; i < state->acquired_refs; i++) {
 		if (state->refs[i].id == ptr_id) {
@@ -629,21 +641,7 @@ static int __release_reference_state(struct bpf_func_state *state, int ptr_id)
 			return 0;
 		}
 	}
-	return -EFAULT;
-}
-
-/* variation on the above for cases where we expect that there must be an
- * outstanding reference for the specified ptr_id.
- */
-static int release_reference_state(struct bpf_verifier_env *env, int ptr_id)
-{
-	struct bpf_func_state *state = cur_func(env);
-	int err;
-
-	err = __release_reference_state(state, ptr_id);
-	if (WARN_ON_ONCE(err != 0))
-		verbose(env, "verifier internal error: can't release reference\n");
-	return err;
+	return -EINVAL;
 }
 
 static int transfer_reference_state(struct bpf_func_state *dst,
@@ -1201,6 +1199,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case CONST_PTR_TO_MAP:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1623,6 +1623,7 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 	struct bpf_reg_state *regs = cur_regs(env);
 	struct bpf_reg_state *reg = &regs[regno];
 	struct bpf_insn_access_aux info;
+	bool valid;
 
 	if (reg->smin_value < 0) {
 		verbose(env, "R%d min value is negative, either use unsigned index or do a if (index >=0) check.\n",
@@ -1630,13 +1631,24 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 		return -EACCES;
 	}
 
-	if (!bpf_sock_is_valid_access(off, size, t, &info)) {
-		verbose(env, "invalid bpf_sock access off=%d size=%d\n",
-			off, size);
-		return -EACCES;
+	switch (reg->type) {
+	case PTR_TO_SOCK_COMMON:
+		valid = bpf_sock_common_is_valid_access(off, size, t, &info);
+		break;
+	case PTR_TO_SOCKET:
+		valid = bpf_sock_is_valid_access(off, size, t, &info);
+		break;
+	default:
+		valid = false;
 	}
 
-	return 0;
+	if (valid)
+		return 0;
+
+	verbose(env, "R%d invalid %s access off=%d size=%d\n",
+		regno, reg_type_str[reg->type], off, size);
+
+	return -EACCES;
 }
 
 static bool __is_pointer_value(bool allow_ptr_leaks,
@@ -1662,8 +1674,14 @@ static bool is_ctx_reg(struct bpf_verifier_env *env, int regno)
 {
 	const struct bpf_reg_state *reg = reg_state(env, regno);
 
-	return reg->type == PTR_TO_CTX ||
-	       reg->type == PTR_TO_SOCKET;
+	return reg->type == PTR_TO_CTX;
+}
+
+static bool is_sk_reg(struct bpf_verifier_env *env, int regno)
+{
+	const struct bpf_reg_state *reg = reg_state(env, regno);
+
+	return type_is_sk_pointer(reg->type);
 }
 
 static bool is_pkt_reg(struct bpf_verifier_env *env, int regno)
@@ -1774,6 +1792,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET:
 		pointer_desc = "sock ";
 		break;
+	case PTR_TO_SOCK_COMMON:
+		pointer_desc = "sock_common ";
+		break;
 	default:
 		break;
 	}
@@ -1977,11 +1998,14 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 			 * PTR_TO_PACKET[_META,_END]. In the latter
 			 * case, we know the offset is zero.
 			 */
-			if (reg_type == SCALAR_VALUE)
+			if (reg_type == SCALAR_VALUE) {
 				mark_reg_unknown(env, regs, value_regno);
-			else
+			} else {
 				mark_reg_known_zero(env, regs,
 						    value_regno);
+				if (reg_type_may_be_null(reg_type))
+					regs[value_regno].id = ++env->id_gen;
+			}
 			regs[value_regno].type = reg_type;
 		}
 
@@ -2027,9 +2051,10 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 		err = check_flow_keys_access(env, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown(env, regs, value_regno);
-	} else if (reg->type == PTR_TO_SOCKET) {
+	} else if (type_is_sk_pointer(reg->type)) {
 		if (t == BPF_WRITE) {
-			verbose(env, "cannot write into socket\n");
+			verbose(env, "R%d cannot write into %s\n",
+				regno, reg_type_str[reg->type]);
 			return -EACCES;
 		}
 		err = check_sock_access(env, regno, off, size, t);
@@ -2076,7 +2101,8 @@ static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_ins
 
 	if (is_ctx_reg(env, insn->dst_reg) ||
 	    is_pkt_reg(env, insn->dst_reg) ||
-	    is_flow_key_reg(env, insn->dst_reg)) {
+	    is_flow_key_reg(env, insn->dst_reg) ||
+	    is_sk_reg(env, insn->dst_reg)) {
 		verbose(env, "BPF_XADD stores into R%d %s is not allowed\n",
 			insn->dst_reg,
 			reg_type_str[reg_state(env, insn->dst_reg)->type]);
@@ -2258,6 +2284,11 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		err = check_ctx_reg(env, reg, regno);
 		if (err < 0)
 			return err;
+	} else if (arg_type == ARG_PTR_TO_SOCK_COMMON) {
+		expected_type = PTR_TO_SOCK_COMMON;
+		/* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */
+		if (!type_is_sk_pointer(type))
+			goto err_type;
 	} else if (arg_type == ARG_PTR_TO_SOCKET) {
 		expected_type = PTR_TO_SOCKET;
 		if (type != expected_type)
@@ -2661,7 +2692,7 @@ static int release_reference(struct bpf_verifier_env *env,
 	for (i = 0; i <= vstate->curframe; i++)
 		release_reg_references(env, vstate->frame[i], meta->ptr_id);
 
-	return release_reference_state(env, meta->ptr_id);
+	return release_reference_state(cur_func(env), meta->ptr_id);
 }
 
 static int check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
@@ -2926,8 +2957,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		}
 	} else if (is_release_function(func_id)) {
 		err = release_reference(env, &meta);
-		if (err)
+		if (err) {
+			verbose(env, "func %s#%d reference has not been acquired before\n",
+				func_id_name(func_id), func_id);
 			return err;
+		}
 	}
 
 	regs = cur_regs(env);
@@ -2974,12 +3008,19 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
 	} else if (fn->ret_type == RET_PTR_TO_SOCKET_OR_NULL) {
-		int id = acquire_reference_state(env, insn_idx);
-		if (id < 0)
-			return id;
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_SOCKET_OR_NULL;
-		regs[BPF_REG_0].id = id;
+		if (is_acquire_function(func_id)) {
+			int id = acquire_reference_state(env, insn_idx);
+
+			if (id < 0)
+				return id;
+			/* For release_reference() */
+			regs[BPF_REG_0].id = id;
+		} else {
+			/* For mark_ptr_or_null_reg() */
+			regs[BPF_REG_0].id = ++env->id_gen;
+		}
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3239,6 +3280,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_PACKET_END:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4472,7 +4515,10 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			}
 		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
 			reg->type = PTR_TO_SOCKET;
+		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
+			reg->type = PTR_TO_SOCK_COMMON;
 		}
+
 		if (is_null || !reg_is_refcounted(reg)) {
 			/* We don't need id from this point onwards anymore,
 			 * thus we should better reset it, so that state
@@ -4495,7 +4541,7 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 	int i, j;
 
 	if (reg_is_refcounted_or_null(&regs[regno]) && is_null)
-		__release_reference_state(state, id);
+		release_reference_state(state, id);
 
 	for (i = 0; i < MAX_BPF_REG; i++)
 		mark_ptr_or_null_reg(state, &regs[i], id, is_null);
@@ -5656,6 +5702,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_FLOW_KEYS:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -5973,6 +6021,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_CTX:
 	case PTR_TO_SOCKET:
 	case PTR_TO_SOCKET_OR_NULL:
+	case PTR_TO_SOCK_COMMON:
+	case PTR_TO_SOCK_COMMON_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -6944,6 +6994,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			convert_ctx_access = ops->convert_ctx_access;
 			break;
 		case PTR_TO_SOCKET:
+		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
 		default:
diff --git a/net/core/filter.c b/net/core/filter.c
index 8ce421796ac6..7f9ff0517b67 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1793,6 +1793,18 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
+{
+	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_sk_fullsock_proto = {
+	.func		= bpf_sk_fullsock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_SOCKET_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 static inline int sk_skb_try_make_writable(struct sk_buff *skb,
 					   unsigned int write_len)
 {
@@ -5388,6 +5400,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5459,6 +5473,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_skb_fib_lookup_proto;
+	case BPF_FUNC_sk_fullsock:
+		return &bpf_sk_fullsock_proto;
 #ifdef CONFIG_XFRM
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
@@ -5746,6 +5762,11 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 		if (size != sizeof(__u64))
 			return false;
 		break;
+	case offsetof(struct __sk_buff, sk):
+		if (type == BPF_WRITE || size != sizeof(__u64))
+			return false;
+		info->reg_type = PTR_TO_SOCK_COMMON_OR_NULL;
+		break;
 	default:
 		/* Only narrow read access allowed for now. */
 		if (type == BPF_WRITE) {
@@ -5932,6 +5953,21 @@ static bool __sock_filter_check_size(int off, int size,
 	return size == size_default;
 }
 
+bool bpf_sock_common_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case offsetof(struct bpf_sock, type):
+	case offsetof(struct bpf_sock, protocol):
+	case offsetof(struct bpf_sock, mark):
+	case offsetof(struct bpf_sock, priority):
+		return false;
+	default:
+		return bpf_sock_is_valid_access(off, size, type, info);
+	}
+}
+
 bool bpf_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 			      struct bpf_insn_access_aux *info)
 {
@@ -6730,6 +6766,13 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 		off += offsetof(struct qdisc_skb_cb, pkt_len);
 		*target_size = 4;
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg, off);
+		break;
+
+	case offsetof(struct __sk_buff, sk):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_buff, sk));
+		break;
 	}
 
 	return insn - insn_buf;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 2/6] bpf: Refactor sock_ops_convert_ctx_access
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
  2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
@ 2019-02-01  7:03 ` Martin KaFai Lau
  2019-02-01  7:03 ` [PATCH bpf-next 3/6] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock Martin KaFai Lau
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

The next patch will introduce a new "struct bpf_tcp_sock" which
exposes the same tcp_sock's fields already exposed in
"struct bpf_sock_ops".

This patch refactor the existing convert_ctx_access() codes for
"struct bpf_sock_ops" to get them ready to be reused for
"struct bpf_tcp_sock".  The "rtt_min" is not refactored
in this patch because its handling is different from other
fields.

The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
changes are code move only.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/core/filter.c | 287 ++++++++++++++++++++--------------------------
 1 file changed, 127 insertions(+), 160 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7f9ff0517b67..b3b4b5faede4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5020,6 +5020,54 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = {
 };
 #endif /* CONFIG_IPV6_SEG6_BPF */
 
+#define CONVERT_COMMON_TCP_SOCK_FIELDS(md_type, CONVERT)		\
+do {									\
+	switch (si->off) {						\
+	case offsetof(md_type, snd_cwnd):				\
+		CONVERT(snd_cwnd); break;				\
+	case offsetof(md_type, srtt_us):				\
+		CONVERT(srtt_us); break;				\
+	case offsetof(md_type, snd_ssthresh):				\
+		CONVERT(snd_ssthresh); break;				\
+	case offsetof(md_type, rcv_nxt):				\
+		CONVERT(rcv_nxt); break;				\
+	case offsetof(md_type, snd_nxt):				\
+		CONVERT(snd_nxt); break;				\
+	case offsetof(md_type, snd_una):				\
+		CONVERT(snd_una); break;				\
+	case offsetof(md_type, mss_cache):				\
+		CONVERT(mss_cache); break;				\
+	case offsetof(md_type, ecn_flags):				\
+		CONVERT(ecn_flags); break;				\
+	case offsetof(md_type, rate_delivered):				\
+		CONVERT(rate_delivered); break;				\
+	case offsetof(md_type, rate_interval_us):			\
+		CONVERT(rate_interval_us); break;			\
+	case offsetof(md_type, packets_out):				\
+		CONVERT(packets_out); break;				\
+	case offsetof(md_type, retrans_out):				\
+		CONVERT(retrans_out); break;				\
+	case offsetof(md_type, total_retrans):				\
+		CONVERT(total_retrans); break;				\
+	case offsetof(md_type, segs_in):				\
+		CONVERT(segs_in); break;				\
+	case offsetof(md_type, data_segs_in):				\
+		CONVERT(data_segs_in); break;				\
+	case offsetof(md_type, segs_out):				\
+		CONVERT(segs_out); break;				\
+	case offsetof(md_type, data_segs_out):				\
+		CONVERT(data_segs_out); break;				\
+	case offsetof(md_type, lost_out):				\
+		CONVERT(lost_out); break;				\
+	case offsetof(md_type, sacked_out):				\
+		CONVERT(sacked_out); break;				\
+	case offsetof(md_type, bytes_received):				\
+		CONVERT(bytes_received); break;				\
+	case offsetof(md_type, bytes_acked):				\
+		CONVERT(bytes_acked); break;				\
+	}								\
+} while (0)
+
 #ifdef CONFIG_INET
 static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple,
 			      int dif, int sdif, u8 family, u8 proto)
@@ -7124,6 +7172,85 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 	struct bpf_insn *insn = insn_buf;
 	int off;
 
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      si->dst_reg, si->src_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
+						       OBJ_FIELD),	      \
+				      si->dst_reg, si->dst_reg,		      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+	} while (0)
+
+#define SOCK_OPS_GET_TCP_SOCK_FIELD(FIELD) \
+		SOCK_OPS_GET_FIELD(FIELD, FIELD, struct tcp_sock)
+
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
+	do {								      \
+		int reg = BPF_REG_9;					      \
+		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
+			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		if (si->dst_reg == reg || si->src_reg == reg)		      \
+			reg--;						      \
+		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern,     \
+						is_fullsock),		      \
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       is_fullsock));		      \
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
+						struct bpf_sock_ops_kern, sk),\
+				      reg, si->dst_reg,			      \
+				      offsetof(struct bpf_sock_ops_kern, sk));\
+		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
+				      reg, si->src_reg,			      \
+				      offsetof(OBJ, OBJ_FIELD));	      \
+		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
+				      offsetof(struct bpf_sock_ops_kern,      \
+					       temp));			      \
+	} while (0)
+
+#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
+	do {								      \
+		if (TYPE == BPF_WRITE)					      \
+			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+		else							      \
+			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_sock_ops,
+				       SOCK_OPS_GET_TCP_SOCK_FIELD);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
 	switch (si->off) {
 	case offsetof(struct bpf_sock_ops, op) ...
 	     offsetof(struct bpf_sock_ops, replylong[3]):
@@ -7281,175 +7408,15 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
 				      FIELD_SIZEOF(struct minmax_sample, t));
 		break;
 
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 2);	      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      si->dst_reg, si->src_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,		      \
-						       OBJ_FIELD),	      \
-				      si->dst_reg, si->dst_reg,		      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-	} while (0)
-
-/* Helper macro for adding write access to tcp_sock or sock fields.
- * The macro is called with two registers, dst_reg which contains a pointer
- * to ctx (context) and src_reg which contains the value that should be
- * stored. However, we need an additional register since we cannot overwrite
- * dst_reg because it may be used later in the program.
- * Instead we "borrow" one of the other register. We first save its value
- * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
- * it at the end of the macro.
- */
-#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)			      \
-	do {								      \
-		int reg = BPF_REG_9;					      \
-		BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >		      \
-			     FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		if (si->dst_reg == reg || si->src_reg == reg)		      \
-			reg--;						      \
-		*insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern,     \
-						is_fullsock),		      \
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       is_fullsock));		      \
-		*insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);		      \
-		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(			      \
-						struct bpf_sock_ops_kern, sk),\
-				      reg, si->dst_reg,			      \
-				      offsetof(struct bpf_sock_ops_kern, sk));\
-		*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),	      \
-				      reg, si->src_reg,			      \
-				      offsetof(OBJ, OBJ_FIELD));	      \
-		*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->dst_reg,		      \
-				      offsetof(struct bpf_sock_ops_kern,      \
-					       temp));			      \
-	} while (0)
-
-#define SOCK_OPS_GET_OR_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ, TYPE)	      \
-	do {								      \
-		if (TYPE == BPF_WRITE)					      \
-			SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-		else							      \
-			SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ);	      \
-	} while (0)
-
-	case offsetof(struct bpf_sock_ops, snd_cwnd):
-		SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, srtt_us):
-		SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
 		SOCK_OPS_GET_FIELD(bpf_sock_ops_cb_flags, bpf_sock_ops_cb_flags,
 				   struct tcp_sock);
 		break;
 
-	case offsetof(struct bpf_sock_ops, snd_ssthresh):
-		SOCK_OPS_GET_FIELD(snd_ssthresh, snd_ssthresh, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rcv_nxt):
-		SOCK_OPS_GET_FIELD(rcv_nxt, rcv_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_nxt):
-		SOCK_OPS_GET_FIELD(snd_nxt, snd_nxt, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, snd_una):
-		SOCK_OPS_GET_FIELD(snd_una, snd_una, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, mss_cache):
-		SOCK_OPS_GET_FIELD(mss_cache, mss_cache, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, ecn_flags):
-		SOCK_OPS_GET_FIELD(ecn_flags, ecn_flags, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_delivered):
-		SOCK_OPS_GET_FIELD(rate_delivered, rate_delivered,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, rate_interval_us):
-		SOCK_OPS_GET_FIELD(rate_interval_us, rate_interval_us,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, packets_out):
-		SOCK_OPS_GET_FIELD(packets_out, packets_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, retrans_out):
-		SOCK_OPS_GET_FIELD(retrans_out, retrans_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, total_retrans):
-		SOCK_OPS_GET_FIELD(total_retrans, total_retrans,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_in):
-		SOCK_OPS_GET_FIELD(segs_in, segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_in):
-		SOCK_OPS_GET_FIELD(data_segs_in, data_segs_in, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, segs_out):
-		SOCK_OPS_GET_FIELD(segs_out, segs_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, data_segs_out):
-		SOCK_OPS_GET_FIELD(data_segs_out, data_segs_out,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, lost_out):
-		SOCK_OPS_GET_FIELD(lost_out, lost_out, struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, sacked_out):
-		SOCK_OPS_GET_FIELD(sacked_out, sacked_out, struct tcp_sock);
-		break;
-
 	case offsetof(struct bpf_sock_ops, sk_txhash):
 		SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
 					  struct sock, type);
 		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_received):
-		SOCK_OPS_GET_FIELD(bytes_received, bytes_received,
-				   struct tcp_sock);
-		break;
-
-	case offsetof(struct bpf_sock_ops, bytes_acked):
-		SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
-		break;
-
 	}
 	return insn - insn_buf;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 3/6] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
  2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
  2019-02-01  7:03 ` [PATCH bpf-next 2/6] bpf: Refactor sock_ops_convert_ctx_access Martin KaFai Lau
@ 2019-02-01  7:03 ` Martin KaFai Lau
  2019-02-01  7:04 ` [PATCH bpf-next 4/6] bpf: Sync bpf.h to tools/ Martin KaFai Lau
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:03 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb->sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp->snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper ensures the sk is a tcp_sock.  If it is not
a tcp_sock, it returns NULL.  Hence, the caller needs to
check for NULL before accessing bpf_tcp_sock.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      | 30 ++++++++++++++++
 include/uapi/linux/bpf.h | 51 +++++++++++++++++++++++++-
 kernel/bpf/verifier.c    | 31 ++++++++++++++--
 net/core/filter.c        | 77 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 186 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 72022a0c442d..cc3667f8c003 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -172,6 +172,7 @@ enum bpf_return_type {
 	RET_PTR_TO_MAP_VALUE,		/* returns a pointer to map elem value */
 	RET_PTR_TO_MAP_VALUE_OR_NULL,	/* returns a pointer to map elem value or NULL */
 	RET_PTR_TO_SOCKET_OR_NULL,	/* returns a pointer to a socket or NULL */
+	RET_PTR_TO_TCP_SOCK_OR_NULL,	/* returns a pointer to a tcp_sock or NULL */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -227,6 +228,8 @@ enum bpf_reg_type {
 	PTR_TO_SOCKET_OR_NULL,	 /* reg points to struct bpf_sock or NULL */
 	PTR_TO_SOCK_COMMON,	 /* reg points to sock_common */
 	PTR_TO_SOCK_COMMON_OR_NULL, /* reg points to sock_common or NULL */
+	PTR_TO_TCP_SOCK,	 /* reg points to struct bpf_tcp_sock */
+	PTR_TO_TCP_SOCK_OR_NULL, /* reg points to struct bpf_tcp_sock or NULL */
 };
 
 /* The information passed from prog-specific *_is_valid_access
@@ -923,4 +926,31 @@ static inline u32 bpf_sock_convert_ctx_access(enum bpf_access_type type,
 }
 #endif
 
+#ifdef CONFIG_INET
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info);
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog,
+				    u32 *target_size);
+#else
+static inline bool bpf_tcp_sock_is_valid_access(int off, int size,
+						enum bpf_access_type type,
+						struct bpf_insn_access_aux *info)
+{
+	return false;
+}
+
+static inline u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+						  const struct bpf_insn *si,
+						  struct bpf_insn *insn_buf,
+						  struct bpf_prog *prog,
+						  u32 *target_size)
+{
+	return 0;
+}
+#endif /* CONFIG_INET */
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b7a5312c4ccf..5f6c947abdd8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2336,6 +2336,15 @@ union bpf_attr {
  *	Return
  *		The same pointer if *sk* is a fullsock.  Otherwise,
  *		NULL is returned.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		Obtain a **struct bpf_tcp_sock** pointer from a
+ *              **struct bpf_sock** pointer.
+ *
+ *	Return
+ *		Pointer to **struct bpf_tcp_sock**, or **NULL** in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2431,7 +2440,8 @@ union bpf_attr {
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
 	FN(rc_pointer_rel),		\
-	FN(sk_fullsock),
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2614,6 +2624,45 @@ struct bpf_sock {
 				 */
 };
 
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
+				 */
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
+				 */
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b4500f29dff1..f8ba690dad80 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -333,14 +333,16 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 static bool type_is_sk_pointer(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
-		type == PTR_TO_SOCK_COMMON;
+		type == PTR_TO_SOCK_COMMON ||
+		type == PTR_TO_TCP_SOCK;
 }
 
 static bool reg_type_may_be_null(enum bpf_reg_type type)
 {
 	return type == PTR_TO_MAP_VALUE_OR_NULL ||
 	       type == PTR_TO_SOCKET_OR_NULL ||
-	       type == PTR_TO_SOCK_COMMON_OR_NULL;
+	       type == PTR_TO_SOCK_COMMON_OR_NULL ||
+	       type == PTR_TO_TCP_SOCK_OR_NULL;
 }
 
 static bool type_is_refcounted(enum bpf_reg_type type)
@@ -400,6 +402,8 @@ static const char * const reg_type_str[] = {
 	[PTR_TO_SOCKET_OR_NULL] = "sock_or_null",
 	[PTR_TO_SOCK_COMMON]	= "sock_common",
 	[PTR_TO_SOCK_COMMON_OR_NULL] = "sock_common_or_null",
+	[PTR_TO_TCP_SOCK]	= "tcp_sock",
+	[PTR_TO_TCP_SOCK_OR_NULL] = "tcp_sock_or_null",
 };
 
 static char slot_type_char[] = {
@@ -1201,6 +1205,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return true;
 	default:
 		return false;
@@ -1638,6 +1644,9 @@ static int check_sock_access(struct bpf_verifier_env *env, u32 regno, int off,
 	case PTR_TO_SOCKET:
 		valid = bpf_sock_is_valid_access(off, size, t, &info);
 		break;
+	case PTR_TO_TCP_SOCK:
+		valid = bpf_tcp_sock_is_valid_access(off, size, t, &info);
+		break;
 	default:
 		valid = false;
 	}
@@ -1795,6 +1804,9 @@ static int check_ptr_alignment(struct bpf_verifier_env *env,
 	case PTR_TO_SOCK_COMMON:
 		pointer_desc = "sock_common ";
 		break;
+	case PTR_TO_TCP_SOCK:
+		pointer_desc = "tcp_sock ";
+		break;
 	default:
 		break;
 	}
@@ -3021,6 +3033,10 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 			/* For mark_ptr_or_null_reg() */
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
+	} else if (fn->ret_type == RET_PTR_TO_TCP_SOCK_OR_NULL) {
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_TCP_SOCK_OR_NULL;
+		regs[BPF_REG_0].id = ++env->id_gen;
 	} else {
 		verbose(env, "unknown return type %d of func %s#%d\n",
 			fn->ret_type, func_id_name(func_id), func_id);
@@ -3282,6 +3298,8 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		verbose(env, "R%d pointer arithmetic on %s prohibited\n",
 			dst, reg_type_str[ptr_reg->type]);
 		return -EACCES;
@@ -4517,6 +4535,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 			reg->type = PTR_TO_SOCKET;
 		} else if (reg->type == PTR_TO_SOCK_COMMON_OR_NULL) {
 			reg->type = PTR_TO_SOCK_COMMON;
+		} else if (reg->type == PTR_TO_TCP_SOCK_OR_NULL) {
+			reg->type = PTR_TO_TCP_SOCK;
 		}
 
 		if (is_null || !reg_is_refcounted(reg)) {
@@ -5704,6 +5724,8 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		/* Only valid matches are exact, which memcmp() above
 		 * would have accepted
 		 */
@@ -6023,6 +6045,8 @@ static bool reg_type_mismatch_ok(enum bpf_reg_type type)
 	case PTR_TO_SOCKET_OR_NULL:
 	case PTR_TO_SOCK_COMMON:
 	case PTR_TO_SOCK_COMMON_OR_NULL:
+	case PTR_TO_TCP_SOCK:
+	case PTR_TO_TCP_SOCK_OR_NULL:
 		return false;
 	default:
 		return true;
@@ -6997,6 +7021,9 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_SOCK_COMMON:
 			convert_ctx_access = bpf_sock_convert_ctx_access;
 			break;
+		case PTR_TO_TCP_SOCK:
+			convert_ctx_access = bpf_tcp_sock_convert_ctx_access;
+			break;
 		default:
 			continue;
 		}
diff --git a/net/core/filter.c b/net/core/filter.c
index b3b4b5faede4..bdc39c3081ff 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5305,6 +5305,77 @@ static const struct bpf_func_proto bpf_sock_addr_sk_lookup_udp_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
+bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+				  struct bpf_insn_access_aux *info)
+{
+	if (off < 0 || off >= offsetofend(struct bpf_tcp_sock, bytes_acked))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case offsetof(struct bpf_tcp_sock, bytes_received):
+	case offsetof(struct bpf_tcp_sock, bytes_acked):
+		return size == sizeof(__u64);
+	default:
+		return size == sizeof(__u32);
+	}
+}
+
+u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
+				    const struct bpf_insn *si,
+				    struct bpf_insn *insn_buf,
+				    struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+#define BPF_TCP_SOCK_GET_COMMON(FIELD)					\
+	do {								\
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD) >	\
+			     FIELD_SIZEOF(struct bpf_tcp_sock, FIELD));	\
+		*insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock, FIELD), \
+				      si->dst_reg, si->src_reg,		\
+				      offsetof(struct tcp_sock, FIELD)); \
+	} while (0)
+
+	CONVERT_COMMON_TCP_SOCK_FIELDS(struct bpf_tcp_sock,
+				       BPF_TCP_SOCK_GET_COMMON);
+
+	if (insn > insn_buf)
+		return insn - insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct bpf_tcp_sock, rtt_min):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+			     sizeof(struct minmax));
+		BUILD_BUG_ON(sizeof(struct minmax) <
+			     sizeof(struct minmax_sample));
+
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct tcp_sock, rtt_min) +
+				      offsetof(struct minmax_sample, v));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+BPF_CALL_1(bpf_tcp_sock, const struct sock *, sk)
+{
+	if (sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP)
+		return (unsigned long)sk;
+
+	return (unsigned long)NULL;
+}
+
+static const struct bpf_func_proto bpf_tcp_sock_proto = {
+	.func		= bpf_tcp_sock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_TCP_SOCK_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5450,6 +5521,10 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_local_storage_proto;
 	case BPF_FUNC_sk_fullsock:
 		return &bpf_sk_fullsock_proto;
+#ifdef CONFIG_INET
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
+#endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}
@@ -5540,6 +5615,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_lookup_udp_proto;
 	case BPF_FUNC_sk_release:
 		return &bpf_sk_release_proto;
+	case BPF_FUNC_tcp_sock:
+		return &bpf_tcp_sock_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 4/6] bpf: Sync bpf.h to tools/
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2019-02-01  7:03 ` [PATCH bpf-next 3/6] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock Martin KaFai Lau
@ 2019-02-01  7:04 ` Martin KaFai Lau
  2019-02-01  7:04 ` [PATCH bpf-next 5/6] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer Martin KaFai Lau
  2019-02-01  7:04 ` [PATCH bpf-next 6/6] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock Martin KaFai Lau
  5 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This patch sync the uapi bpf.h to tools/.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/include/uapi/linux/bpf.h | 61 +++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60b99b730a41..5f6c947abdd8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2328,6 +2328,23 @@ union bpf_attr {
  *		"**y**".
  *	Return
  *		0
+ *
+ * struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)
+ *	Description
+ *		This helper tests if a bpf_sock is a fullsock such that
+ *		all the fields in bpf_sock can be accessed.
+ *	Return
+ *		The same pointer if *sk* is a fullsock.  Otherwise,
+ *		NULL is returned.
+ *
+ * struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk)
+ *	Description
+ *		Obtain a **struct bpf_tcp_sock** pointer from a
+ *              **struct bpf_sock** pointer.
+ *
+ *	Return
+ *		Pointer to **struct bpf_tcp_sock**, or **NULL** in
+ *		case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2422,7 +2439,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(sk_fullsock),		\
+	FN(tcp_sock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -2542,6 +2561,7 @@ struct __sk_buff {
 	__u64 tstamp;
 	__u32 wire_len;
 	__u32 gso_segs;
+	__bpf_md_ptr(struct bpf_sock *, sk);
 };
 
 struct bpf_tunnel_key {
@@ -2604,6 +2624,45 @@ struct bpf_sock {
 				 */
 };
 
+struct bpf_tcp_sock {
+	__u32 snd_cwnd;		/* Sending congestion window		*/
+	__u32 srtt_us;		/* smoothed round trip time << 3 in usecs */
+	__u32 rtt_min;
+	__u32 snd_ssthresh;	/* Slow start size threshold		*/
+	__u32 rcv_nxt;		/* What we want to receive next		*/
+	__u32 snd_nxt;		/* Next sequence we send		*/
+	__u32 snd_una;		/* First byte we want an ack for	*/
+	__u32 mss_cache;	/* Cached effective mss, not including SACKS */
+	__u32 ecn_flags;	/* ECN status bits.			*/
+	__u32 rate_delivered;	/* saved rate sample: packets delivered */
+	__u32 rate_interval_us;	/* saved rate sample: time elapsed */
+	__u32 packets_out;	/* Packets which are "in flight"	*/
+	__u32 retrans_out;	/* Retransmitted packets out		*/
+	__u32 total_retrans;	/* Total retransmits for entire connection */
+	__u32 segs_in;		/* RFC4898 tcpEStatsPerfSegsIn
+				 * total number of segments in.
+				 */
+	__u32 data_segs_in;	/* RFC4898 tcpEStatsPerfDataSegsIn
+				 * total number of data segments in.
+				 */
+	__u32 segs_out;		/* RFC4898 tcpEStatsPerfSegsOut
+				 * The total number of segments sent.
+				 */
+	__u32 data_segs_out;	/* RFC4898 tcpEStatsPerfDataSegsOut
+				 * total number of data segments sent.
+				 */
+	__u32 lost_out;		/* Lost packets			*/
+	__u32 sacked_out;	/* SACK'd packets			*/
+	__u64 bytes_received;	/* RFC4898 tcpEStatsAppHCThruOctetsReceived
+				 * sum(delta(rcv_nxt)), or how many bytes
+				 * were acked.
+				 */
+	__u64 bytes_acked;	/* RFC4898 tcpEStatsAppHCThruOctetsAcked
+				 * sum(delta(snd_una)), or how many bytes
+				 * were acked.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 5/6] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2019-02-01  7:04 ` [PATCH bpf-next 4/6] bpf: Sync bpf.h to tools/ Martin KaFai Lau
@ 2019-02-01  7:04 ` Martin KaFai Lau
  2019-02-01  7:04 ` [PATCH bpf-next 6/6] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock Martin KaFai Lau
  5 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This patch tests accessing the skb->sk and the new helpers,
bpf_sk_fullsock and bpf_tcp_sock.

The errstr of some existing "reference tracking" tests is changed
with s/bpf_sock/sock/ and s/socket/sock/ where "sock" is from the
verifier's reg_type_str[].

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 .../selftests/bpf/verifier/ref_tracking.c     |   4 +-
 tools/testing/selftests/bpf/verifier/sock.c   | 238 ++++++++++++++++++
 tools/testing/selftests/bpf/verifier/unpriv.c |   2 +-
 3 files changed, 241 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/sock.c

diff --git a/tools/testing/selftests/bpf/verifier/ref_tracking.c b/tools/testing/selftests/bpf/verifier/ref_tracking.c
index dc2cc823df2b..3ed3593bd8b6 100644
--- a/tools/testing/selftests/bpf/verifier/ref_tracking.c
+++ b/tools/testing/selftests/bpf/verifier/ref_tracking.c
@@ -547,7 +547,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.result = REJECT,
 },
 {
@@ -562,7 +562,7 @@
 	BPF_EXIT_INSN(),
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
-	.errstr = "invalid bpf_sock access off=0 size=8",
+	.errstr = "invalid sock access off=0 size=8",
 	.result = REJECT,
 },
 {
diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
new file mode 100644
index 000000000000..a01ea3bc1c54
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -0,0 +1,238 @@
+{
+	"skb->sk: no NULL check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_common_or_null'",
+},
+{
+	"skb->sk: sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, family)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"skb->sk: sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid sock_common access",
+},
+{
+	"bpf_sk_fullsock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"sk_fullsock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'sock_or_null'",
+},
+{
+	"sk_fullsock(skb->sk): sk->type [fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, type)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"sk_fullsock(skb->sk): sk->family [non fullsock field]",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_sock, family)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): no !skb->sk check",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common_or_null expected=sock_common",
+},
+{
+	"bpf_tcp_sock(skb->sk): no NULL check on ret",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "invalid mem access 'tcp_sock_or_null'",
+},
+{
+	"bpf_tcp_sock(skb->sk): sk->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(skb->sk): sk->bytes_acked",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, bytes_acked)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_tcp_sock(bpf_sk_fullsock(skb->sk)): tp->snd_cwnd",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct bpf_tcp_sock, snd_cwnd)),
+	BPF_ALU64_IMM(BPF_AND, BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
+{
+	"bpf_sk_release(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 1),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=sock",
+},
+{
+	"bpf_sk_release(bpf_sk_fullsock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_sk_fullsock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "reference has not been acquired before",
+},
+{
+	"bpf_sk_release(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_sk_release),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	.result = REJECT,
+	.errstr = "type=tcp_sock expected=sock",
+},
diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c
index dca58cf1a4ab..0013847993a8 100644
--- a/tools/testing/selftests/bpf/verifier/unpriv.c
+++ b/tools/testing/selftests/bpf/verifier/unpriv.c
@@ -364,7 +364,7 @@
 	},
 	.result = REJECT,
 	//.errstr = "same insn cannot be used with different pointers",
-	.errstr = "cannot write into socket",
+	.errstr = "cannot write into sock",
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },
 {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 6/6] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock
  2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2019-02-01  7:04 ` [PATCH bpf-next 5/6] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer Martin KaFai Lau
@ 2019-02-01  7:04 ` Martin KaFai Lau
  5 siblings, 0 replies; 14+ messages in thread
From: Martin KaFai Lau @ 2019-02-01  7:04 UTC (permalink / raw)
  To: netdev; +Cc: Alexei Starovoitov, Daniel Borkmann, kernel-team, Lawrence Brakmo

This patch adds a C program to show the usage on
skb->sk and bpf_tcp_sock.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 tools/testing/selftests/bpf/Makefile          |   5 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   4 +
 .../testing/selftests/bpf/test_sock_fields.c  | 314 ++++++++++++++++++
 .../selftests/bpf/test_sock_fields_kern.c     | 145 ++++++++
 4 files changed, 466 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields.c
 create mode 100644 tools/testing/selftests/bpf/test_sock_fields_kern.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index c566090e5657..6c00f5e39b63 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -23,7 +23,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt test_tcpnotify_user
+	test_netcnt test_tcpnotify_user test_sock_fields
 
 BPF_OBJ_FILES = \
 	test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \
@@ -35,7 +35,7 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o
+	xdp_dummy.o test_map_in_map.o test_sock_fields_kern.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
@@ -110,6 +110,7 @@ $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
 $(OUTPUT)/test_netcnt: cgroup_helpers.c
+$(OUTPUT)/test_sock_fields: cgroup_helpers.c
 
 .PHONY: force
 
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6c77cf7bedce..a5bb09f5e6fc 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -172,6 +172,10 @@ static int (*bpf_skb_vlan_pop)(void *ctx) =
 	(void *) BPF_FUNC_skb_vlan_pop;
 static int (*bpf_rc_pointer_rel)(void *ctx, int rel_x, int rel_y) =
 	(void *) BPF_FUNC_rc_pointer_rel;
+static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_sk_fullsock;
+static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
+	(void *) BPF_FUNC_tcp_sock;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_sock_fields.c b/tools/testing/selftests/bpf/test_sock_fields.c
new file mode 100644
index 000000000000..1f8ff8e6caf4
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <sys/socket.h>
+#include <sys/epoll.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#include "cgroup_helpers.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+#define CHECK(condition, tag, format...) ({				\
+	int __ret = !!(condition);					\
+	if (__ret) {							\
+		printf("%s(%d):FAIL:%s ", __func__, __LINE__, tag);	\
+		printf(format);						\
+		printf("\n");						\
+		exit(-1);						\
+	}								\
+})
+
+#define TEST_CGROUP "/test-bpf-sock-fields"
+#define DATA "Hello BPF!"
+#define DATA_LEN sizeof(DATA)
+
+static struct sockaddr_in6 srv_sa6, cli_sa6;
+static int linum_map_fd;
+static int addr_map_fd;
+static int tp_map_fd;
+static int sk_map_fd;
+static __u32 srv_idx = SRV_IDX;
+static __u32 cli_idx = CLI_IDX;
+
+static void init_loopback6(struct sockaddr_in6 *sa6)
+{
+	memset(sa6, 0, sizeof(*sa6));
+	sa6->sin6_family = AF_INET6;
+	sa6->sin6_addr = in6addr_loopback;
+}
+
+static void print_sk(const struct bpf_sock *sk)
+{
+	char ip4[64];
+	char ip6[64];
+
+	inet_ntop(AF_INET, &sk->src_ip4, ip4, sizeof(ip4));
+	inet_ntop(AF_INET6, &sk->src_ip6, ip6, sizeof(ip6));
+
+	printf("bound_dev_if:%u family:%u type:%u protocol:%u mark:%u priority:%u "
+	       "src_ip4:%x(%s) src_ip6:%x:%x:%x:%x(%s) src_port:%u\n",
+	       sk->bound_dev_if, sk->family, sk->type, sk->protocol,
+	       sk->mark, sk->priority, sk->src_ip4, ip4,
+	       sk->src_ip6[0], sk->src_ip6[1], sk->src_ip6[2], sk->src_ip6[3], ip6,
+	       sk->src_port);
+}
+
+static void print_tp(const struct bpf_tcp_sock *tp)
+{
+	printf("snd_cwnd:%u srtt_us:%u rtt_min:%u snd_ssthresh:%u rcv_nxt:%u "
+	       "snd_nxt:%u snd:una:%u mss_cache:%u ecn_flags:%u "
+	       "rate_delivered:%u rate_interval_us:%u packets_out:%u "
+	       "retrans_out:%u total_retrans:%u segs_in:%u data_segs_in:%u "
+	       "segs_out:%u data_segs_out:%u lost_out:%u sacked_out:%u "
+	       "bytes_received:%llu bytes_acked:%llu\n",
+	       tp->snd_cwnd, tp->srtt_us, tp->rtt_min, tp->snd_ssthresh,
+	       tp->rcv_nxt, tp->snd_nxt, tp->snd_una, tp->mss_cache,
+	       tp->ecn_flags, tp->rate_delivered, tp->rate_interval_us,
+	       tp->packets_out, tp->retrans_out, tp->total_retrans,
+	       tp->segs_in, tp->data_segs_in, tp->segs_out,
+	       tp->data_segs_out, tp->lost_out, tp->sacked_out,
+	       tp->bytes_received, tp->bytes_acked);
+}
+
+static void check_result(void)
+{
+	struct bpf_tcp_sock srv_tp, cli_tp;
+	struct bpf_sock srv_sk, cli_sk;
+	__u32 linum, idx0 = 0;
+	int err;
+
+	err = bpf_map_lookup_elem(linum_map_fd, &idx0, &linum);
+	CHECK(err == -1, "bpf_map_lookup_elem(linum_map_fd)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &srv_idx, &srv_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &srv_idx, &srv_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &srv_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	err = bpf_map_lookup_elem(sk_map_fd, &cli_idx, &cli_sk);
+	CHECK(err == -1, "bpf_map_lookup_elem(sk_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+	err = bpf_map_lookup_elem(tp_map_fd, &cli_idx, &cli_tp);
+	CHECK(err == -1, "bpf_map_lookup_elem(tp_map_fd, &cli_idx)",
+	      "err:%d errno:%d", err, errno);
+
+	printf("srv_sk: ");
+	print_sk(&srv_sk);
+	printf("\n");
+
+	printf("cli_sk: ");
+	print_sk(&cli_sk);
+	printf("\n");
+
+	printf("srv_tp: ");
+	print_tp(&srv_tp);
+	printf("\n");
+
+	printf("cli_tp: ");
+	print_tp(&cli_tp);
+	printf("\n");
+
+	CHECK(srv_sk.family != AF_INET6 ||
+	      srv_sk.protocol != IPPROTO_TCP ||
+	      memcmp(srv_sk.src_ip6, &in6addr_loopback,
+		     sizeof(srv_sk.src_ip6)) ||
+	      srv_sk.src_port != ntohs(srv_sa6.sin6_port),
+	      "Unexpected srv_sk", "Check srv_sk output. linum:%u",
+	      linum);
+
+	CHECK(cli_sk.family != AF_INET6 ||
+	      cli_sk.protocol != IPPROTO_TCP ||
+	      memcmp(cli_sk.src_ip6, &in6addr_loopback,
+		     sizeof(cli_sk.src_ip6)) ||
+	      cli_sk.src_port != ntohs(cli_sa6.sin6_port),
+	      "Unexpected cli_sk", "Check cli_sk output. linum:%u",
+	      linum);
+
+	CHECK(srv_tp.data_segs_out != 1 ||
+	      srv_tp.data_segs_in ||
+	      srv_tp.snd_cwnd != 10 ||
+	      srv_tp.total_retrans ||
+	      srv_tp.bytes_acked != DATA_LEN,
+	      "Unexpected srv_tp", "Check srv_tp output. linum:%u",
+	      linum);
+
+	CHECK(cli_tp.data_segs_out ||
+	      cli_tp.data_segs_in != 1 ||
+	      cli_tp.snd_cwnd != 10 ||
+	      cli_tp.total_retrans ||
+	      cli_tp.bytes_received != DATA_LEN,
+	      "Unexpected cli_tp", "Check cli_tp output. linum:%u",
+	      linum);
+}
+
+static void test(void)
+{
+	int listen_fd, cli_fd, accept_fd, epfd, err;
+	struct epoll_event ev;
+	socklen_t addrlen;
+
+	addrlen = sizeof(struct sockaddr_in6);
+	ev.events = EPOLLIN;
+
+	epfd = epoll_create(1);
+	CHECK(epfd == -1, "epoll_create()", "epfd:%d errno:%d", epfd, errno);
+
+	/* Prepare listen_fd */
+	listen_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(listen_fd == -1, "socket()", "listen_fd:%d errno:%d",
+	      listen_fd, errno);
+
+	init_loopback6(&srv_sa6);
+	err = bind(listen_fd, (struct sockaddr *)&srv_sa6, sizeof(srv_sa6));
+	CHECK(err, "bind(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(listen_fd, (struct sockaddr *)&srv_sa6, &addrlen);
+	CHECK(err, "getsockname(listen_fd)", "err:%d errno:%d", err, errno);
+
+	err = listen(listen_fd, 0);
+	CHECK(err, "listen(listen_fd)", "err:%d errno:%d", err, errno);
+
+	/* Prepare cli_fd */
+	cli_fd = socket(AF_INET6, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	CHECK(cli_fd == -1, "socket()", "cli_fd:%d errno:%d", cli_fd, errno);
+
+	init_loopback6(&cli_sa6);
+	err = bind(cli_fd, (struct sockaddr *)&cli_sa6, sizeof(cli_sa6));
+	CHECK(err, "bind(cli_fd)", "err:%d errno:%d", err, errno);
+
+	err = getsockname(cli_fd, (struct sockaddr *)&cli_sa6, &addrlen);
+	CHECK(err, "getsockname(cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Update addr_map with srv_sa6 and cli_sa6 */
+	err = bpf_map_update_elem(addr_map_fd, &srv_idx, &srv_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	err = bpf_map_update_elem(addr_map_fd, &cli_idx, &cli_sa6, 0);
+	CHECK(err, "map_update", "err:%d errno:%d", err, errno);
+
+	/* Connect from cli_sa6 to srv_sa6 */
+	err = connect(cli_fd, (struct sockaddr *)&srv_sa6, addrlen);
+	printf("srv_sa6.sin6_port:%u cli_sa6.sin6_port:%u\n\n",
+	       ntohs(srv_sa6.sin6_port), ntohs(cli_sa6.sin6_port));
+	CHECK(err && errno != EINPROGRESS,
+	      "connect(cli_fd)", "err:%d errno:%d", err, errno);
+
+	ev.data.fd = listen_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, listen_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Accept the connection */
+	/* Have some timeout in accept(listen_fd). Just in case. */
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != listen_fd,
+	      "epoll_wait(listen_fd)",
+	      "err:%d errno:%d ev.data.fd:%d listen_fd:%d",
+	      err, errno, ev.data.fd, listen_fd);
+
+	accept_fd = accept(listen_fd, NULL, NULL);
+	CHECK(accept_fd == -1, "accept(listen_fd)", "accept_fd:%d errno:%d",
+	      accept_fd, errno);
+	close(listen_fd);
+
+	/* Send some data from accept_fd to cli_fd */
+	err = send(accept_fd, DATA, DATA_LEN, 0);
+	CHECK(err != DATA_LEN, "send(accept_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	/* Have some timeout in recv(cli_fd). Just in case. */
+	ev.data.fd = cli_fd;
+	err = epoll_ctl(epfd, EPOLL_CTL_ADD, cli_fd, &ev);
+	CHECK(err, "epoll_ctl(EPOLL_CTL_ADD, cli_fd)", "err:%d errno:%d",
+	      err, errno);
+
+	err = epoll_wait(epfd, &ev, 1, 1000);
+	CHECK(err != 1 || ev.data.fd != cli_fd,
+	      "epoll_wait(cli_fd)", "err:%d errno:%d ev.data.fd:%d cli_fd:%d",
+	      err, errno, ev.data.fd, cli_fd);
+
+	err = recv(cli_fd, NULL, 0, MSG_TRUNC);
+	CHECK(err, "recv(cli_fd)", "err:%d errno:%d", err, errno);
+
+	close(epfd);
+	close(accept_fd);
+	close(cli_fd);
+
+	check_result();
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_prog_load_attr attr = {
+		.file = "test_sock_fields_kern.o",
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int cgroup_fd, prog_fd, err;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+
+	err = setup_cgroup_environment();
+	CHECK(err, "setup_cgroup_environment()", "err:%d errno:%d",
+	      err, errno);
+
+	atexit(cleanup_cgroup_environment);
+
+	/* Create a cgroup, get fd, and join it */
+	cgroup_fd = create_and_get_cgroup(TEST_CGROUP);
+	CHECK(cgroup_fd == -1, "create_and_get_cgroup()",
+	      "cgroup_fd:%d errno:%d", cgroup_fd, errno);
+
+	err = join_cgroup(TEST_CGROUP);
+	CHECK(err, "join_cgroup", "err:%d errno:%d", err, errno);
+
+	err = bpf_prog_load_xattr(&attr, &obj, &prog_fd);
+	CHECK(err, "bpf_prog_load_xattr()", "err:%d", err);
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_INET_EGRESS, 0);
+	CHECK(err == -1, "bpf_prog_attach(CPF_CGROUP_INET_EGRESS)",
+	      "err:%d errno%d", err, errno);
+	close(cgroup_fd);
+
+	map = bpf_object__find_map_by_name(obj, "addr_map");
+	CHECK(!map, "cannot find addr_map", "(null)");
+	addr_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "sock_result_map");
+	CHECK(!map, "cannot find sock_result_map", "(null)");
+	sk_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "tcp_sock_result_map");
+	CHECK(!map, "cannot find tcp_sock_result_map", "(null)");
+	tp_map_fd = bpf_map__fd(map);
+
+	map = bpf_object__find_map_by_name(obj, "linum_map");
+	CHECK(!map, "cannot find linum_map", "(null)");
+	linum_map_fd = bpf_map__fd(map);
+
+	test();
+
+	bpf_object__close(obj);
+	cleanup_cgroup_environment();
+
+	printf("PASS\n");
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_sock_fields_kern.c b/tools/testing/selftests/bpf/test_sock_fields_kern.c
new file mode 100644
index 000000000000..09d4d8d34588
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_sock_fields_kern.c
@@ -0,0 +1,145 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook */
+
+#include <linux/bpf.h>
+#include <netinet/in.h>
+#include <stdbool.h>
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+enum bpf_array_idx {
+	SRV_IDX,
+	CLI_IDX,
+	__NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") addr_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct sockaddr_in6),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") tcp_sock_result_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct bpf_tcp_sock),
+	.max_entries = __NR_BPF_ARRAY_IDX,
+};
+
+struct bpf_map_def SEC("maps") linum_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
+static bool is_loopback6(__u32 *a6)
+{
+	return !a6[0] && !a6[1] && !a6[2] && a6[3] == bpf_htonl(1);
+}
+
+static void skcpy(struct bpf_sock *dst,
+		  const struct bpf_sock *src)
+{
+	dst->bound_dev_if = src->bound_dev_if;
+	dst->family = src->family;
+	dst->type = src->type;
+	dst->protocol = src->protocol;
+	dst->mark = src->mark;
+	dst->priority = src->priority;
+	dst->src_ip4 = src->src_ip4;
+	dst->src_ip6[0] = src->src_ip6[0];
+	dst->src_ip6[1] = src->src_ip6[1];
+	dst->src_ip6[2] = src->src_ip6[2];
+	dst->src_ip6[3] = src->src_ip6[3];
+	dst->src_port = src->src_port;
+}
+
+static void tpcpy(struct bpf_tcp_sock *dst,
+		  const struct bpf_tcp_sock *src)
+{
+	dst->snd_cwnd = src->snd_cwnd;
+	dst->srtt_us = src->srtt_us;
+	dst->rtt_min = src->rtt_min;
+	dst->snd_ssthresh = src->snd_ssthresh;
+	dst->rcv_nxt = src->rcv_nxt;
+	dst->snd_nxt = src->snd_nxt;
+	dst->snd_una = src->snd_una;
+	dst->mss_cache = src->mss_cache;
+	dst->ecn_flags = src->ecn_flags;
+	dst->rate_delivered = src->rate_delivered;
+	dst->rate_interval_us = src->rate_interval_us;
+	dst->packets_out = src->packets_out;
+	dst->retrans_out = src->retrans_out;
+	dst->total_retrans = src->total_retrans;
+	dst->segs_in = src->segs_in;
+	dst->data_segs_in = src->data_segs_in;
+	dst->segs_out = src->segs_out;
+	dst->data_segs_out = src->data_segs_out;
+	dst->lost_out = src->lost_out;
+	dst->sacked_out = src->sacked_out;
+	dst->bytes_received = src->bytes_received;
+	dst->bytes_acked = src->bytes_acked;
+}
+
+#define RETURN {						\
+	linum = __LINE__;					\
+	bpf_map_update_elem(&linum_map, &idx0, &linum, 0);	\
+	return 1;						\
+}
+
+SEC("cgroup_skb/egress")
+int read_sock_fields(struct __sk_buff *skb)
+{
+	__u32 srv_idx = SRV_IDX, cli_idx = CLI_IDX, idx;
+	struct sockaddr_in6 *srv_sa6, *cli_sa6;
+	struct bpf_tcp_sock *tp, *tp_ret;
+	struct bpf_sock *sk, *sk_ret;
+	__u32 linum, idx0 = 0;
+
+	sk = skb->sk;
+	if (!sk)
+		RETURN;
+
+	sk = bpf_sk_fullsock(sk);
+	if (!sk || sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP ||
+	    !is_loopback6(sk->src_ip6))
+		RETURN;
+
+	tp = bpf_tcp_sock(sk);
+	if (!tp)
+		RETURN;
+
+	srv_sa6 = bpf_map_lookup_elem(&addr_map, &srv_idx);
+	cli_sa6 = bpf_map_lookup_elem(&addr_map, &cli_idx);
+	if (!srv_sa6 || !cli_sa6)
+		RETURN;
+
+	if (sk->src_port == bpf_ntohs(srv_sa6->sin6_port))
+		idx = srv_idx;
+	else if (sk->src_port == bpf_ntohs(cli_sa6->sin6_port))
+		idx = cli_idx;
+	else
+		RETURN;
+
+	sk_ret = bpf_map_lookup_elem(&sock_result_map, &idx);
+	tp_ret = bpf_map_lookup_elem(&tcp_sock_result_map, &idx);
+	if (!sk_ret || !tp_ret)
+		RETURN;
+
+	skcpy(sk_ret, sk);
+	tpcpy(tp_ret, tp);
+
+	RETURN;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
@ 2019-02-02  2:38   ` Alexei Starovoitov
  2019-02-04 22:33   ` Daniel Borkmann
  1 sibling, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2019-02-02  2:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, kernel-team,
	Lawrence Brakmo, joe, john.fastabend, willemb

On Thu, Jan 31, 2019 at 11:03:56PM -0800, Martin KaFai Lau wrote:
> In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> before accessing the fields in sock.  For example, in __netdev_pick_tx:
> 
> static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> 			    struct net_device *sb_dev)
> {
> 	/* ... */
> 
> 	struct sock *sk = skb->sk;
> 
> 		if (queue_index != new_index && sk &&
> 		    sk_fullsock(sk) &&
> 		    rcu_access_pointer(sk->sk_dst_cache))
> 			sk_tx_queue_set(sk, new_index);
> 
> 	/* ... */
> 
> 	return queue_index;
> }
> 
> This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
...
> Some of the fileds in "bpf_sock" will not be directly
> accessible through the "__sk_buff->sk" pointer.
...
> The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> can be used to get a sk with all accessible fields in "bpf_sock".
> This helper is added to both cg_skb and sched_(cls|act).
> 
> int cg_skb_foo(struct __sk_buff *skb) {
> 	struct bpf_sock *sk;
> 	__u32 family;
> 
> 	sk = skb->sk;
> 	if (!sk)
> 		return 1;
> 
> 	sk = bpf_sk_fullsock(sk);
> 	if (!sk)
> 		return 1;
> 
> 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> 		return 1;
> 
> 	/* some_traffic_shaping(); */
> 
> 	return 1;
> }
> 
> (1) The sk is read only
> 
> (2) There is no new "struct bpf_sock_common" introduced.
> 
> (3) Future kernel sock's members could be added to bpf_sock only
>     instead of repeatedly adding at multiple places like currently
>     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

All,

this patchset sets a direction on how access to kernel socket datastructures
should be made from bpf programs of networking types.

It makes bpf program access to sk_common, sk, tcp_sock fields look and feel
like kernel code.
We think it's the most flexible and fixes the copy-paste issue of existing api.
I wish we thought of it earlier :)

Please review.

For the patch set:
Acked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
  2019-02-02  2:38   ` Alexei Starovoitov
@ 2019-02-04 22:33   ` Daniel Borkmann
  2019-02-05  0:50     ` Martin Lau
  1 sibling, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2019-02-04 22:33 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev; +Cc: Alexei Starovoitov, kernel-team, Lawrence Brakmo

Hi Martin,

On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> before accessing the fields in sock.  For example, in __netdev_pick_tx:
> 
> static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> 			    struct net_device *sb_dev)
> {
> 	/* ... */
> 
> 	struct sock *sk = skb->sk;
> 
> 		if (queue_index != new_index && sk &&
> 		    sk_fullsock(sk) &&
> 		    rcu_access_pointer(sk->sk_dst_cache))
> 			sk_tx_queue_set(sk, new_index);
> 
> 	/* ... */
> 
> 	return queue_index;
> }
> 
> This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> where a few of the convert_ctx_access() in filter.c has already been
> accessing the skb->sk sock_common's fields,
> e.g. sock_ops_convert_ctx_access().
> 
> "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> Some of the fileds in "bpf_sock" will not be directly
> accessible through the "__sk_buff->sk" pointer.  It is limited
> by the new "bpf_sock_common_is_valid_access()".
> e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
>      are not allowed.
> 
> The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> can be used to get a sk with all accessible fields in "bpf_sock".
> This helper is added to both cg_skb and sched_(cls|act).
> 
> int cg_skb_foo(struct __sk_buff *skb) {
> 	struct bpf_sock *sk;
> 	__u32 family;
> 
> 	sk = skb->sk;
> 	if (!sk)
> 		return 1;
> 
> 	sk = bpf_sk_fullsock(sk);
> 	if (!sk)
> 		return 1;
> 
> 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> 		return 1;
> 
> 	/* some_traffic_shaping(); */
> 
> 	return 1;
> }
> 
> (1) The sk is read only
> 
> (2) There is no new "struct bpf_sock_common" introduced.
> 
> (3) Future kernel sock's members could be added to bpf_sock only
>     instead of repeatedly adding at multiple places like currently
>     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> 
> (4) After "sk = skb->sk", the reg holding sk is in type
>     PTR_TO_SOCK_COMMON_OR_NULL.
> 
> (5) After bpf_sk_fullsock(), the return type will be in type
>     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
>     bpf_sk_lookup_xxx().
> 
>     However, bpf_sk_fullsock() does not take refcnt.  The
>     acquire_reference_state() is only depending on the return type now.
>     To avoid it, a new is_acquire_function() is checked before calling
>     acquire_reference_state().

Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
all this is more of an implementation detail which we would expose here to
the developer.

Is there a specific reason why fetching skb->sk couldn't already be of the
type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
needed and most logic we have today could already be reused (modulo refcnt
avoidance)?

In particular, do you need the skb->sk without the full-sk part somewhere
(e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
parent where you could also access remaining bpf_sock fields?

This could then also be plugged into bpf_tcp_sock() given this needs to be
full sk anyway.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-04 22:33   ` Daniel Borkmann
@ 2019-02-05  0:50     ` Martin Lau
  2019-02-07  7:27       ` Martin Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Lau @ 2019-02-05  0:50 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Alexei Starovoitov, Kernel Team, Lawrence Brakmo

On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote:
> Hi Martin,
> 
> On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> > before accessing the fields in sock.  For example, in __netdev_pick_tx:
> > 
> > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> > 			    struct net_device *sb_dev)
> > {
> > 	/* ... */
> > 
> > 	struct sock *sk = skb->sk;
> > 
> > 		if (queue_index != new_index && sk &&
> > 		    sk_fullsock(sk) &&
> > 		    rcu_access_pointer(sk->sk_dst_cache))
> > 			sk_tx_queue_set(sk, new_index);
> > 
> > 	/* ... */
> > 
> > 	return queue_index;
> > }
> > 
> > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> > where a few of the convert_ctx_access() in filter.c has already been
> > accessing the skb->sk sock_common's fields,
> > e.g. sock_ops_convert_ctx_access().
> > 
> > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> > Some of the fileds in "bpf_sock" will not be directly
> > accessible through the "__sk_buff->sk" pointer.  It is limited
> > by the new "bpf_sock_common_is_valid_access()".
> > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
> >      are not allowed.
> > 
> > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> > can be used to get a sk with all accessible fields in "bpf_sock".
> > This helper is added to both cg_skb and sched_(cls|act).
> > 
> > int cg_skb_foo(struct __sk_buff *skb) {
> > 	struct bpf_sock *sk;
> > 	__u32 family;
> > 
> > 	sk = skb->sk;
> > 	if (!sk)
> > 		return 1;
> > 
> > 	sk = bpf_sk_fullsock(sk);
> > 	if (!sk)
> > 		return 1;
> > 
> > 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> > 		return 1;
> > 
> > 	/* some_traffic_shaping(); */
> > 
> > 	return 1;
> > }
> > 
> > (1) The sk is read only
> > 
> > (2) There is no new "struct bpf_sock_common" introduced.
> > 
> > (3) Future kernel sock's members could be added to bpf_sock only
> >     instead of repeatedly adding at multiple places like currently
> >     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> > 
> > (4) After "sk = skb->sk", the reg holding sk is in type
> >     PTR_TO_SOCK_COMMON_OR_NULL.
> > 
> > (5) After bpf_sk_fullsock(), the return type will be in type
> >     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
> >     bpf_sk_lookup_xxx().
> > 
> >     However, bpf_sk_fullsock() does not take refcnt.  The
> >     acquire_reference_state() is only depending on the return type now.
> >     To avoid it, a new is_acquire_function() is checked before calling
> >     acquire_reference_state().
> 
> Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
> all this is more of an implementation detail which we would expose here to
> the developer.
> 
> Is there a specific reason why fetching skb->sk couldn't already be of the
> type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
> needed and most logic we have today could already be reused (modulo refcnt
> avoidance)?
Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL).

Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()),
not sure a sk (e.g. tw sock) can always be traced back to a full sk.

In term of the patch implementation, it is not much difference.  It is a bit
simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but
not a lot.  I have tried both.

The "fullsock" has already been exposed in another form.
e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock
while other sock_common fields will still be available.  The bpf_prog
can test the sock_ops->is_fullsock for what to do.

> 
> In particular, do you need the skb->sk without the full-sk part somewhere
> (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
> helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
> parent where you could also access remaining bpf_sock fields?
I am thinking more on what if the bpf_prog only needs the fields from
sock_common (e.g. the src/dst ip/port) and skb already has
other needed info (e.g. protocol/mark/priority).
Enforing skb->sk must be a fullsock will unnecessarily limit those
bpf_prog from seeing all skb.

A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock()
tester.  I think having one "struct bpf_sock" is better and less confusing.

Later, for the running context that is sure to have a fullsock,
skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of
PTR_TO_SOCK_COMMON_OR_NULL.

Thanks,
Martin

> 
> This could then also be plugged into bpf_tcp_sock() given this needs to be
> full sk anyway.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-05  0:50     ` Martin Lau
@ 2019-02-07  7:27       ` Martin Lau
  2019-02-07 22:21         ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Lau @ 2019-02-07  7:27 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Alexei Starovoitov, Kernel Team, Lawrence Brakmo

On Mon, Feb 04, 2019 at 04:50:32PM -0800, Martin Lau wrote:
> On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote:
> > Hi Martin,
> > 
> > On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> > > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> > > before accessing the fields in sock.  For example, in __netdev_pick_tx:
> > > 
> > > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> > > 			    struct net_device *sb_dev)
> > > {
> > > 	/* ... */
> > > 
> > > 	struct sock *sk = skb->sk;
> > > 
> > > 		if (queue_index != new_index && sk &&
> > > 		    sk_fullsock(sk) &&
> > > 		    rcu_access_pointer(sk->sk_dst_cache))
> > > 			sk_tx_queue_set(sk, new_index);
> > > 
> > > 	/* ... */
> > > 
> > > 	return queue_index;
> > > }
> > > 
> > > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> > > where a few of the convert_ctx_access() in filter.c has already been
> > > accessing the skb->sk sock_common's fields,
> > > e.g. sock_ops_convert_ctx_access().
> > > 
> > > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> > > Some of the fileds in "bpf_sock" will not be directly
> > > accessible through the "__sk_buff->sk" pointer.  It is limited
> > > by the new "bpf_sock_common_is_valid_access()".
> > > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
> > >      are not allowed.
> > > 
> > > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> > > can be used to get a sk with all accessible fields in "bpf_sock".
> > > This helper is added to both cg_skb and sched_(cls|act).
> > > 
> > > int cg_skb_foo(struct __sk_buff *skb) {
> > > 	struct bpf_sock *sk;
> > > 	__u32 family;
> > > 
> > > 	sk = skb->sk;
> > > 	if (!sk)
> > > 		return 1;
> > > 
> > > 	sk = bpf_sk_fullsock(sk);
> > > 	if (!sk)
> > > 		return 1;
> > > 
> > > 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> > > 		return 1;
> > > 
> > > 	/* some_traffic_shaping(); */
> > > 
> > > 	return 1;
> > > }
> > > 
> > > (1) The sk is read only
> > > 
> > > (2) There is no new "struct bpf_sock_common" introduced.
> > > 
> > > (3) Future kernel sock's members could be added to bpf_sock only
> > >     instead of repeatedly adding at multiple places like currently
> > >     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> > > 
> > > (4) After "sk = skb->sk", the reg holding sk is in type
> > >     PTR_TO_SOCK_COMMON_OR_NULL.
> > > 
> > > (5) After bpf_sk_fullsock(), the return type will be in type
> > >     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
> > >     bpf_sk_lookup_xxx().
> > > 
> > >     However, bpf_sk_fullsock() does not take refcnt.  The
> > >     acquire_reference_state() is only depending on the return type now.
> > >     To avoid it, a new is_acquire_function() is checked before calling
> > >     acquire_reference_state().
> > 
> > Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
> > all this is more of an implementation detail which we would expose here to
> > the developer.
> > 
> > Is there a specific reason why fetching skb->sk couldn't already be of the
> > type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
> > needed and most logic we have today could already be reused (modulo refcnt
> > avoidance)?
> Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL).
> 
> Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()),
> not sure a sk (e.g. tw sock) can always be traced back to a full sk.
> 
> In term of the patch implementation, it is not much difference.  It is a bit
> simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but
> not a lot.  I have tried both.
> 
> The "fullsock" has already been exposed in another form.
> e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock
> while other sock_common fields will still be available.  The bpf_prog
> can test the sock_ops->is_fullsock for what to do.
> 
> > 
> > In particular, do you need the skb->sk without the full-sk part somewhere
> > (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
> > helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
> > parent where you could also access remaining bpf_sock fields?
> I am thinking more on what if the bpf_prog only needs the fields from
> sock_common (e.g. the src/dst ip/port) and skb already has
> other needed info (e.g. protocol/mark/priority).
> Enforing skb->sk must be a fullsock will unnecessarily limit those
> bpf_prog from seeing all skb.
> 
> A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock()
> tester.  I think having one "struct bpf_sock" is better and less confusing.
> 
> Later, for the running context that is sure to have a fullsock,
> skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of
> PTR_TO_SOCK_COMMON_OR_NULL.
> 
Following up the discussion in the iovisor conf call.

One of discussion was about:
other than tw, can __sk_buff->sk always return a
fullsock (PTR_TO_SOCKET_OR_NULL).  In request_sock case,
it is doable because it can trace back to the listener sock.

However, that will go back to the sock_common accessing question.
In particular, how to access the sock_common's fields of the
request_sock itself?  Those fields in the request_sock are different
from its listener sock.  e.g. the skc_daddr and skc_dport.

Also, if the sock_common fields of tw is needed, it will become weird
because likely a new "struct bpf_tw_sock" is needed which is OK
but all sock_common fields need to be copied from bpf_sock
to bpf_tw_sock.

I think reading a sk from a ctx should return the
most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running
ctx can guarantee that it always has a fullsock).
Currently, it is __sk_buff->sk.  Later, sock_ops->sk...etc.
One single 'struct bpf_sock' and limit fullsock field access
at verification time.  The bpf_prog then moves down the chain
based on what it needs.  It could be fullsock, tcp_sock...etc.

I think that will be the most flexible way to write bpf_prog
while also avoid having duplicate fields in different
bpf struct in uapi.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-07  7:27       ` Martin Lau
@ 2019-02-07 22:21         ` Daniel Borkmann
  2019-02-08  5:56           ` Martin Lau
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Borkmann @ 2019-02-07 22:21 UTC (permalink / raw)
  To: Martin Lau; +Cc: netdev, Alexei Starovoitov, Kernel Team, Lawrence Brakmo

On 02/07/2019 08:27 AM, Martin Lau wrote:
[...]
> Following up the discussion in the iovisor conf call.
> 
> One of discussion was about:
> other than tw, can __sk_buff->sk always return a
> fullsock (PTR_TO_SOCKET_OR_NULL).  In request_sock case,
> it is doable because it can trace back to the listener sock.
> 
> However, that will go back to the sock_common accessing question.
> In particular, how to access the sock_common's fields of the
> request_sock itself?  Those fields in the request_sock are different
> from its listener sock.  e.g. the skc_daddr and skc_dport.
> 
> Also, if the sock_common fields of tw is needed, it will become weird
> because likely a new "struct bpf_tw_sock" is needed which is OK
> but all sock_common fields need to be copied from bpf_sock
> to bpf_tw_sock.
> 
> I think reading a sk from a ctx should return the
> most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running
> ctx can guarantee that it always has a fullsock).
> Currently, it is __sk_buff->sk.  Later, sock_ops->sk...etc.
> One single 'struct bpf_sock' and limit fullsock field access
> at verification time.  The bpf_prog then moves down the chain
> based on what it needs.  It could be fullsock, tcp_sock...etc.
> 
> I think that will be the most flexible way to write bpf_prog
> while also avoid having duplicate fields in different
> bpf struct in uapi.

Ok, thanks for following up and sorry for late reply, lets go with
sock_common then. What's the plan to moving forward with accessing
full sk in case of req sk? Separate helper or backed into the newly
added bpf_sk_fullsock() one? Presumably latter?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-07 22:21         ` Daniel Borkmann
@ 2019-02-08  5:56           ` Martin Lau
  2019-02-08  8:56             ` Daniel Borkmann
  0 siblings, 1 reply; 14+ messages in thread
From: Martin Lau @ 2019-02-08  5:56 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, Alexei Starovoitov, Kernel Team, Lawrence Brakmo

On Thu, Feb 07, 2019 at 11:21:41PM +0100, Daniel Borkmann wrote:
> On 02/07/2019 08:27 AM, Martin Lau wrote:
> [...]
> > Following up the discussion in the iovisor conf call.
> > 
> > One of discussion was about:
> > other than tw, can __sk_buff->sk always return a
> > fullsock (PTR_TO_SOCKET_OR_NULL).  In request_sock case,
> > it is doable because it can trace back to the listener sock.
> > 
> > However, that will go back to the sock_common accessing question.
> > In particular, how to access the sock_common's fields of the
> > request_sock itself?  Those fields in the request_sock are different
> > from its listener sock.  e.g. the skc_daddr and skc_dport.
> > 
> > Also, if the sock_common fields of tw is needed, it will become weird
> > because likely a new "struct bpf_tw_sock" is needed which is OK
> > but all sock_common fields need to be copied from bpf_sock
> > to bpf_tw_sock.
> > 
> > I think reading a sk from a ctx should return the
> > most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running
> > ctx can guarantee that it always has a fullsock).
> > Currently, it is __sk_buff->sk.  Later, sock_ops->sk...etc.
> > One single 'struct bpf_sock' and limit fullsock field access
> > at verification time.  The bpf_prog then moves down the chain
> > based on what it needs.  It could be fullsock, tcp_sock...etc.
> > 
> > I think that will be the most flexible way to write bpf_prog
> > while also avoid having duplicate fields in different
> > bpf struct in uapi.
> 
> Ok, thanks for following up and sorry for late reply, lets go with
> sock_common then. What's the plan to moving forward with accessing
> full sk in case of req sk? Separate helper or backed into the newly
> added bpf_sk_fullsock() one? Presumably latter?
I will add sk_to_full_sk() to bpf_sk_fullsock() and bpf_tcp_sock().

Thanks,
Martin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
  2019-02-08  5:56           ` Martin Lau
@ 2019-02-08  8:56             ` Daniel Borkmann
  0 siblings, 0 replies; 14+ messages in thread
From: Daniel Borkmann @ 2019-02-08  8:56 UTC (permalink / raw)
  To: Martin Lau; +Cc: netdev, Alexei Starovoitov, Kernel Team, Lawrence Brakmo

On 02/08/2019 06:56 AM, Martin Lau wrote:
> On Thu, Feb 07, 2019 at 11:21:41PM +0100, Daniel Borkmann wrote:
>> On 02/07/2019 08:27 AM, Martin Lau wrote:
>> [...]
>>> Following up the discussion in the iovisor conf call.
>>>
>>> One of discussion was about:
>>> other than tw, can __sk_buff->sk always return a
>>> fullsock (PTR_TO_SOCKET_OR_NULL).  In request_sock case,
>>> it is doable because it can trace back to the listener sock.
>>>
>>> However, that will go back to the sock_common accessing question.
>>> In particular, how to access the sock_common's fields of the
>>> request_sock itself?  Those fields in the request_sock are different
>>> from its listener sock.  e.g. the skc_daddr and skc_dport.
>>>
>>> Also, if the sock_common fields of tw is needed, it will become weird
>>> because likely a new "struct bpf_tw_sock" is needed which is OK
>>> but all sock_common fields need to be copied from bpf_sock
>>> to bpf_tw_sock.
>>>
>>> I think reading a sk from a ctx should return the
>>> most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running
>>> ctx can guarantee that it always has a fullsock).
>>> Currently, it is __sk_buff->sk.  Later, sock_ops->sk...etc.
>>> One single 'struct bpf_sock' and limit fullsock field access
>>> at verification time.  The bpf_prog then moves down the chain
>>> based on what it needs.  It could be fullsock, tcp_sock...etc.
>>>
>>> I think that will be the most flexible way to write bpf_prog
>>> while also avoid having duplicate fields in different
>>> bpf struct in uapi.
>>
>> Ok, thanks for following up and sorry for late reply, lets go with
>> sock_common then. What's the plan to moving forward with accessing
>> full sk in case of req sk? Separate helper or backed into the newly
>> added bpf_sk_fullsock() one? Presumably latter?
> I will add sk_to_full_sk() to bpf_sk_fullsock() and bpf_tcp_sock().

Ok, sounds good, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-02-08  8:56 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-01  7:03 [PATCH bpf-next 0/6] Add __sk_buff->sk, bpf_tcp_sock, BPF_FUNC_tcp_sock and BPF_FUNC_sk_fullsock Martin KaFai Lau
2019-02-01  7:03 ` [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper Martin KaFai Lau
2019-02-02  2:38   ` Alexei Starovoitov
2019-02-04 22:33   ` Daniel Borkmann
2019-02-05  0:50     ` Martin Lau
2019-02-07  7:27       ` Martin Lau
2019-02-07 22:21         ` Daniel Borkmann
2019-02-08  5:56           ` Martin Lau
2019-02-08  8:56             ` Daniel Borkmann
2019-02-01  7:03 ` [PATCH bpf-next 2/6] bpf: Refactor sock_ops_convert_ctx_access Martin KaFai Lau
2019-02-01  7:03 ` [PATCH bpf-next 3/6] bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock Martin KaFai Lau
2019-02-01  7:04 ` [PATCH bpf-next 4/6] bpf: Sync bpf.h to tools/ Martin KaFai Lau
2019-02-01  7:04 ` [PATCH bpf-next 5/6] bpf: Add skb->sk, bpf_sk_fullsock and bpf_tcp_sock tests to test_verifer Martin KaFai Lau
2019-02-01  7:04 ` [PATCH bpf-next 6/6] bpf: Add test_sock_fields for skb->sk and bpf_tcp_sock Martin KaFai Lau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.