All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
@ 2019-03-22 19:58 Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 1/8] flow_dissector: allow access only to a subset of __sk_buff fields Stanislav Fomichev
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:58 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
skb. Because we use passed skb to lookup associated networking namespace
to find whether we have a BPF program attached or not, we always use
C-based flow dissector in this case.

The goal of this patch series is to add new networking namespace argument
to the eth_get_headlen and make BPF flow dissector programs be able to
work in the skb-less case.

The series goes like this:
1. restrict access to a limited set of __sk_buff fields from flow
   dissector programs; in eth_get_headlen context, they don't
   make sense (for skb-context, they are also not relevant for
   flow dissector)
2. switch kernel context to new xdp_buff-like structure (struct
   bpf_flow_dissector) for flow dissector program
3. minor fix for nhoff clamping; doesn't affect anything besides tests
4. convert flow dissector BPF_PROG_TEST_RUN to skb-less mode to show that
   it works
5. add new optional network namespace argument to __skb_flow_dissect and
   plumb through the callers
6. handle __skb_flow_dissect with skb == NULL
7. add new net namespace argument to eth_get_headlen and adjust all users
8. add selftest that makes sure bpf_skb_load_bytes is prohibited in
   skb-less context

v3:
* new kernel xdp_buff-like context per Alexey suggestion
* drop skb_net helper
* properly clamp flow_keys->nhoff

v2:
* moved temporary skb from stack into percpu (avoids memset of ~200 bytes
  per packet)
* tightened down access to __sk_buff fields from flow dissector programs to
  avoid touching shinfo (whitelist only relevant fields)
* addressed suggestions from Willem

Stanislav Fomichev (8):
  flow_dissector: allow access only to a subset of __sk_buff fields
  flow_dissector: switch kernel context to struct bpf_flow_dissector
  flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff
  bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
  net: plumb network namespace into __skb_flow_dissect
  flow_dissector: handle no-skb use case
  net: pass net argument to the eth_get_headlen
  selftests/bpf: add flow dissector bpf_skb_load_bytes helper test

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |   2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   3 +-
 .../net/ethernet/hisilicon/hns3/hns3_enet.c   |   3 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   3 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |   3 +-
 drivers/net/ethernet/intel/igc/igc_main.c     |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   3 +-
 drivers/net/tun.c                             |   3 +-
 include/linux/etherdevice.h                   |   2 +-
 include/linux/skbuff.h                        |  23 ++-
 include/net/flow_dissector.h                  |  11 ++
 include/net/sch_generic.h                     |  11 +-
 net/bpf/test_run.c                            |  51 ++----
 net/core/filter.c                             | 151 +++++++++++++-----
 net/core/flow_dissector.c                     |  85 +++++-----
 net/ethernet/eth.c                            |   8 +-
 .../prog_tests/flow_dissector_load_bytes.c    |  48 ++++++
 22 files changed, 280 insertions(+), 145 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/flow_dissector_load_bytes.c

-- 
2.21.0.392.gf8f6787159e-goog

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 1/8] flow_dissector: allow access only to a subset of __sk_buff fields
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
@ 2019-03-22 19:58 ` Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 2/8] flow_dissector: switch kernel context to struct bpf_flow_dissector Stanislav Fomichev
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:58 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

Use whitelist instead of a blacklist and allow only a small set of
fields that might be relevant in the context of flow dissector:
* protocol
* vlan_present
* vlan_tci
* vlan_proto

This is required for the eth_get_headlen case where we have only a
chunk of data to dissect (i.e. trying to read the other skb fields
doesn't make sense).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/core/filter.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 647c63a7b25b..62a7a2f50200 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6613,14 +6613,8 @@ static bool flow_dissector_is_valid_access(int off, int size,
 					   const struct bpf_prog *prog,
 					   struct bpf_insn_access_aux *info)
 {
-	if (type == BPF_WRITE) {
-		switch (off) {
-		case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
-			break;
-		default:
-			return false;
-		}
-	}
+	if (type == BPF_WRITE)
+		return false;
 
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, data):
@@ -6632,11 +6626,12 @@ static bool flow_dissector_is_valid_access(int off, int size,
 	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 		info->reg_type = PTR_TO_FLOW_KEYS;
 		break;
-	case bpf_ctx_range(struct __sk_buff, tc_classid):
-	case bpf_ctx_range(struct __sk_buff, data_meta):
-	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
-	case bpf_ctx_range(struct __sk_buff, tstamp):
-	case bpf_ctx_range(struct __sk_buff, wire_len):
+	case bpf_ctx_range(struct __sk_buff, protocol):
+	case bpf_ctx_range(struct __sk_buff, vlan_present):
+	case bpf_ctx_range(struct __sk_buff, vlan_tci):
+	case bpf_ctx_range(struct __sk_buff, vlan_proto):
+		break;
+	default:
 		return false;
 	}
 
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 2/8] flow_dissector: switch kernel context to struct bpf_flow_dissector
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 1/8] flow_dissector: allow access only to a subset of __sk_buff fields Stanislav Fomichev
@ 2019-03-22 19:58 ` Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 3/8] flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff Stanislav Fomichev
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:58 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

struct bpf_flow_dissector has a small subset of sk_buff fields that
flow dissector BPF program is allowed to access and an optional
pointer to real skb. Real skb is used only in bpf_skb_load_bytes
helper to read non-linear data.

The real motivation for this is to be able to call flow dissector
from eth_get_headlen context where we don't have an skb and need
to dissect raw bytes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h       |   4 ++
 include/net/flow_dissector.h |  11 +++
 include/net/sch_generic.h    |  11 ++-
 net/bpf/test_run.c           |   4 --
 net/core/filter.c            | 132 ++++++++++++++++++++++++++++-------
 net/core/flow_dissector.c    |  46 ++++++------
 6 files changed, 150 insertions(+), 58 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9027a8c4219f..a62149af940b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1275,6 +1275,10 @@ static inline int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
 }
 #endif
 
+struct bpf_flow_dissector;
+bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
+		      int nhoff, int hlen);
+
 struct bpf_flow_keys;
 bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    const struct sk_buff *skb,
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 2b26979efb48..41baa8624ea3 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -305,4 +305,15 @@ static inline void *skb_flow_dissector_target(struct flow_dissector *flow_dissec
 	return ((char *)target_container) + flow_dissector->offset[key_id];
 }
 
+struct bpf_flow_dissector {
+	struct bpf_flow_keys	*flow_keys;
+	const struct sk_buff	*skb;
+	void			*data;
+	void			*data_end;
+	__be16			protocol;
+	__u16			vlan_tci;
+	__be16			vlan_proto;
+	__u8			vlan_present;
+};
+
 #endif
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31284c078d06..cc9ffdd44bf1 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -351,13 +351,10 @@ struct tcf_proto {
 };
 
 struct qdisc_skb_cb {
-	union {
-		struct {
-			unsigned int		pkt_len;
-			u16			slave_dev_queue_mapping;
-			u16			tc_classid;
-		};
-		struct bpf_flow_keys *flow_keys;
+	struct {
+		unsigned int		pkt_len;
+		u16			slave_dev_queue_mapping;
+		u16			tc_classid;
 	};
 #define QDISC_CB_PRIV_LEN 20
 	unsigned char		data[QDISC_CB_PRIV_LEN];
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index fab142b796ef..7444ca87abf4 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -252,7 +252,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	u32 repeat = kattr->test.repeat;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
-	struct bpf_skb_data_end *cb;
 	u32 retval, duration;
 	struct sk_buff *skb;
 	struct sock *sk;
@@ -290,9 +289,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 				       current->nsproxy->net_ns->loopback_dev);
 	skb_reset_network_header(skb);
 
-	cb = (struct bpf_skb_data_end *)skb->cb;
-	cb->qdisc_cb.flow_keys = &flow_keys;
-
 	if (!repeat)
 		repeat = 1;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 62a7a2f50200..906ec0e67111 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1729,6 +1729,40 @@ static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_4(bpf_flow_dissector_load_bytes,
+	   const struct bpf_flow_dissector *, ctx, u32, offset,
+	   void *, to, u32, len)
+{
+	void *ptr;
+
+	if (unlikely(offset > 0xffff))
+		goto err_clear;
+
+	if (unlikely(!ctx->skb))
+		goto err_clear;
+
+	ptr = skb_header_pointer(ctx->skb, offset, len, to);
+	if (unlikely(!ptr))
+		goto err_clear;
+	if (ptr != to)
+		memcpy(to, ptr, len);
+
+	return 0;
+err_clear:
+	memset(to, 0, len);
+	return -EFAULT;
+}
+
+static const struct bpf_func_proto bpf_flow_dissector_load_bytes_proto = {
+	.func		= bpf_flow_dissector_load_bytes,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg4_type	= ARG_CONST_SIZE,
+};
+
 BPF_CALL_5(bpf_skb_load_bytes_relative, const struct sk_buff *, skb,
 	   u32, offset, void *, to, u32, len, u32, start_header)
 {
@@ -5857,7 +5891,7 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
-		return &bpf_skb_load_bytes_proto;
+		return &bpf_flow_dissector_load_bytes_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -5984,9 +6018,7 @@ static bool bpf_skb_is_valid_access(int off, int size, enum bpf_access_type type
 			return false;
 		break;
 	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
-		if (size != sizeof(__u64))
-			return false;
-		break;
+		return false;
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 		if (size != sizeof(__u64))
 			return false;
@@ -6021,7 +6053,6 @@ static bool sk_filter_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
 	case bpf_ctx_range(struct __sk_buff, data_end):
-	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
@@ -6048,7 +6079,6 @@ static bool cg_skb_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
-	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 		return false;
 	case bpf_ctx_range(struct __sk_buff, data):
@@ -6094,7 +6124,6 @@ static bool lwt_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
-	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 		return false;
@@ -6337,7 +6366,6 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	case bpf_ctx_range(struct __sk_buff, data_end):
 		info->reg_type = PTR_TO_PACKET_END;
 		break;
-	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 	case bpf_ctx_range_till(struct __sk_buff, family, local_port):
 		return false;
 	}
@@ -6539,7 +6567,6 @@ static bool sk_skb_is_valid_access(int off, int size,
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, tc_classid):
 	case bpf_ctx_range(struct __sk_buff, data_meta):
-	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
 	case bpf_ctx_range(struct __sk_buff, tstamp):
 	case bpf_ctx_range(struct __sk_buff, wire_len):
 		return false;
@@ -6613,29 +6640,95 @@ static bool flow_dissector_is_valid_access(int off, int size,
 					   const struct bpf_prog *prog,
 					   struct bpf_insn_access_aux *info)
 {
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct __sk_buff))
+		return false;
+
 	if (type == BPF_WRITE)
 		return false;
 
 	switch (off) {
 	case bpf_ctx_range(struct __sk_buff, data):
+		if (size != size_default)
+			return false;
 		info->reg_type = PTR_TO_PACKET;
-		break;
+		return true;
 	case bpf_ctx_range(struct __sk_buff, data_end):
+		if (size != size_default)
+			return false;
 		info->reg_type = PTR_TO_PACKET_END;
-		break;
+		return true;
 	case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
+		if (size != sizeof(__u64))
+			return false;
 		info->reg_type = PTR_TO_FLOW_KEYS;
-		break;
+		return true;
 	case bpf_ctx_range(struct __sk_buff, protocol):
 	case bpf_ctx_range(struct __sk_buff, vlan_present):
 	case bpf_ctx_range(struct __sk_buff, vlan_tci):
 	case bpf_ctx_range(struct __sk_buff, vlan_proto):
-		break;
+		bpf_ctx_record_field_size(info, size_default);
+		return bpf_ctx_narrow_access_ok(off, size, size_default);
 	default:
 		return false;
 	}
+}
 
-	return bpf_skb_is_valid_access(off, size, type, prog, info);
+static u32 flow_dissector_convert_ctx_access(enum bpf_access_type type,
+					     const struct bpf_insn *si,
+					     struct bpf_insn *insn_buf,
+					     struct bpf_prog *prog,
+					     u32 *target_size)
+
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct __sk_buff, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_flow_dissector, data));
+		break;
+
+	case offsetof(struct __sk_buff, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_flow_dissector, data_end));
+		break;
+
+	case offsetof(struct __sk_buff, flow_keys):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_flow_dissector, flow_keys),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_flow_dissector, flow_keys));
+		break;
+
+	case offsetof(struct __sk_buff, protocol):
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_flow_dissector, protocol, 2,
+						     target_size));
+		break;
+
+	case offsetof(struct __sk_buff, vlan_present):
+		*insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_flow_dissector, vlan_present, 1,
+						     target_size));
+		break;
+
+	case offsetof(struct __sk_buff, vlan_tci):
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_flow_dissector, vlan_tci, 2,
+						     target_size));
+		break;
+
+	case offsetof(struct __sk_buff, vlan_proto):
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->src_reg,
+				      bpf_target_off(struct bpf_flow_dissector, vlan_proto, 2,
+						     target_size));
+		break;
+	}
+
+	return insn - insn_buf;
 }
 
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
@@ -6942,15 +7035,6 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 						     skc_num, 2, target_size));
 		break;
 
-	case offsetof(struct __sk_buff, flow_keys):
-		off  = si->off;
-		off -= offsetof(struct __sk_buff, flow_keys);
-		off += offsetof(struct sk_buff, cb);
-		off += offsetof(struct qdisc_skb_cb, flow_keys);
-		*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
-				      si->src_reg, off);
-		break;
-
 	case offsetof(struct __sk_buff, tstamp):
 		BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tstamp) != 8);
 
@@ -7955,7 +8039,7 @@ const struct bpf_prog_ops sk_msg_prog_ops = {
 const struct bpf_verifier_ops flow_dissector_verifier_ops = {
 	.get_func_proto		= flow_dissector_func_proto,
 	.is_valid_access	= flow_dissector_is_valid_access,
-	.convert_ctx_access	= bpf_convert_ctx_access,
+	.convert_ctx_access	= flow_dissector_convert_ctx_access,
 };
 
 const struct bpf_prog_ops flow_dissector_prog_ops = {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index bb1a54747d64..0dac3382e841 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -688,37 +688,37 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    struct flow_dissector *flow_dissector,
 			    struct bpf_flow_keys *flow_keys)
 {
-	struct bpf_skb_data_end cb_saved;
-	struct bpf_skb_data_end *cb;
-	u32 result;
-
-	/* Note that even though the const qualifier is discarded
-	 * throughout the execution of the BPF program, all changes(the
-	 * control block) are reverted after the BPF program returns.
-	 * Therefore, __skb_flow_dissect does not alter the skb.
-	 */
-
-	cb = (struct bpf_skb_data_end *)skb->cb;
+	struct bpf_flow_dissector ctx = {
+		.flow_keys = flow_keys,
+		.skb = skb,
+		.protocol = skb->protocol,
+		.vlan_tci = skb->vlan_tci,
+		.vlan_proto = skb->vlan_proto,
+		.vlan_present = skb->vlan_present,
+		.data = skb->data,
+		.data_end = skb->data + skb_headlen(skb),
+	};
+
+	return bpf_flow_dissect(prog, &ctx, skb_network_offset(skb),
+				skb_headlen(skb));
+}
 
-	/* Save Control Block */
-	memcpy(&cb_saved, cb, sizeof(cb_saved));
-	memset(cb, 0, sizeof(*cb));
+bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
+		      int nhoff, int hlen)
+{
+	struct bpf_flow_keys *flow_keys = ctx->flow_keys;
+	u32 result;
 
 	/* Pass parameters to the BPF program */
 	memset(flow_keys, 0, sizeof(*flow_keys));
-	cb->qdisc_cb.flow_keys = flow_keys;
-	flow_keys->nhoff = skb_network_offset(skb);
+	flow_keys->nhoff = nhoff;
 	flow_keys->thoff = flow_keys->nhoff;
 
-	bpf_compute_data_pointers((struct sk_buff *)skb);
-	result = BPF_PROG_RUN(prog, skb);
-
-	/* Restore state */
-	memcpy(cb, &cb_saved, sizeof(cb_saved));
+	result = BPF_PROG_RUN(prog, ctx);
 
-	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, 0, skb->len);
+	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, 0, hlen);
 	flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
-				   flow_keys->nhoff, skb->len);
+				   flow_keys->nhoff, hlen);
 
 	return result == BPF_OK;
 }
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 3/8] flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 1/8] flow_dissector: allow access only to a subset of __sk_buff fields Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 2/8] flow_dissector: switch kernel context to struct bpf_flow_dissector Stanislav Fomichev
@ 2019-03-22 19:58 ` Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 4/8] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:58 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

Don't allow BPF program to set flow_keys->nhoff to less than initial
value. We currently don't read the value afterwards in anything but
the tests, but it's still a good practice to return consistent
values to the test programs.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/core/flow_dissector.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0dac3382e841..3b35fc35f583 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -716,7 +716,7 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 
 	result = BPF_PROG_RUN(prog, ctx);
 
-	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, 0, hlen);
+	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen);
 	flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
 				   flow_keys->nhoff, hlen);
 
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 4/8] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2019-03-22 19:58 ` [RFC bpf-next v3 3/8] flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff Stanislav Fomichev
@ 2019-03-22 19:58 ` Stanislav Fomichev
  2019-03-22 19:59 ` [RFC bpf-next v3 5/8] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:58 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

Now that we have bpf_flow_dissect which can work on raw data,
use it when doing BPF_PROG_TEST_RUN for flow dissector.

Simplifies bpf_prog_test_run_flow_dissector and allows us to
test no-skb mode.

Note, that previously, with bpf_flow_dissect_skb we used to call
eth_type_trans which pulled L2 (ETH_HLEN) header and we explicitly called
skb_reset_network_header. That means flow_keys->nhoff would be
initialized to 0 (skb_network_offset) in init_flow_keys.
Now we call bpf_flow_dissect with nhoff set to ETH_HLEN and need
to undo it once the dissection is done to preserve the existing behavior.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/bpf/test_run.c | 47 +++++++++++++++++-----------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 7444ca87abf4..ce7341c473a2 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -249,12 +249,12 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 				     union bpf_attr __user *uattr)
 {
 	u32 size = kattr->test.data_size_in;
+	struct bpf_flow_dissector ctx = {};
 	u32 repeat = kattr->test.repeat;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
+	const struct ethhdr *eth;
 	u32 retval, duration;
-	struct sk_buff *skb;
-	struct sock *sk;
 	void *data;
 	int ret;
 	u32 i;
@@ -262,43 +262,31 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
 		return -EINVAL;
 
-	data = bpf_test_init(kattr, size, NET_SKB_PAD + NET_IP_ALIGN,
-			     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
+	if (size < ETH_HLEN)
+		return -EINVAL;
+
+	data = bpf_test_init(kattr, size, 0, 0);
 	if (IS_ERR(data))
 		return PTR_ERR(data);
 
-	sk = kzalloc(sizeof(*sk), GFP_USER);
-	if (!sk) {
-		kfree(data);
-		return -ENOMEM;
-	}
-	sock_net_set(sk, current->nsproxy->net_ns);
-	sock_init_data(NULL, sk);
-
-	skb = build_skb(data, 0);
-	if (!skb) {
-		kfree(data);
-		kfree(sk);
-		return -ENOMEM;
-	}
-	skb->sk = sk;
-
-	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
-	__skb_put(skb, size);
-	skb->protocol = eth_type_trans(skb,
-				       current->nsproxy->net_ns->loopback_dev);
-	skb_reset_network_header(skb);
+	eth = (struct ethhdr *)data;
 
 	if (!repeat)
 		repeat = 1;
 
+	ctx.flow_keys = &flow_keys;
+	ctx.protocol = eth->h_proto;
+	ctx.data = data;
+	ctx.data_end = (__u8 *)data + size;
+
 	rcu_read_lock();
 	preempt_disable();
 	time_start = ktime_get_ns();
 	for (i = 0; i < repeat; i++) {
-		retval = __skb_flow_bpf_dissect(prog, skb,
-						&flow_keys_dissector,
-						&flow_keys);
+		retval = bpf_flow_dissect(prog, &ctx, ETH_HLEN, size);
+
+		flow_keys.nhoff -= ETH_HLEN;
+		flow_keys.thoff -= ETH_HLEN;
 
 		if (signal_pending(current)) {
 			preempt_enable();
@@ -331,7 +319,6 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 			      retval, duration);
 
 out:
-	kfree_skb(skb);
-	kfree(sk);
+	kfree(data);
 	return ret;
 }
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 5/8] net: plumb network namespace into __skb_flow_dissect
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2019-03-22 19:58 ` [RFC bpf-next v3 4/8] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
@ 2019-03-22 19:59 ` Stanislav Fomichev
  2019-03-22 19:59 ` [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case Stanislav Fomichev
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:59 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h    | 19 +++++++++++--------
 net/core/flow_dissector.c | 27 +++++++++++++++++----------
 net/ethernet/eth.c        |  5 +++--
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a62149af940b..f2daefc5ba20 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1284,7 +1284,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    const struct sk_buff *skb,
 			    struct flow_dissector *flow_dissector,
 			    struct bpf_flow_keys *flow_keys);
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(const struct net *net,
+			const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
@@ -1294,8 +1295,8 @@ static inline bool skb_flow_dissect(const struct sk_buff *skb,
 				    struct flow_dissector *flow_dissector,
 				    void *target_container, unsigned int flags)
 {
-	return __skb_flow_dissect(skb, flow_dissector, target_container,
-				  NULL, 0, 0, 0, flags);
+	return __skb_flow_dissect(NULL, skb, flow_dissector,
+				  target_container, NULL, 0, 0, 0, flags);
 }
 
 static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
@@ -1303,18 +1304,19 @@ static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
 					      unsigned int flags)
 {
 	memset(flow, 0, sizeof(*flow));
-	return __skb_flow_dissect(skb, &flow_keys_dissector, flow,
-				  NULL, 0, 0, 0, flags);
+	return __skb_flow_dissect(NULL, skb, &flow_keys_dissector,
+				  flow, NULL, 0, 0, 0, flags);
 }
 
 static inline bool
-skb_flow_dissect_flow_keys_basic(const struct sk_buff *skb,
+skb_flow_dissect_flow_keys_basic(const struct net *net,
+				 const struct sk_buff *skb,
 				 struct flow_keys_basic *flow, void *data,
 				 __be16 proto, int nhoff, int hlen,
 				 unsigned int flags)
 {
 	memset(flow, 0, sizeof(*flow));
-	return __skb_flow_dissect(skb, &flow_keys_basic_dissector, flow,
+	return __skb_flow_dissect(net, skb, &flow_keys_basic_dissector, flow,
 				  data, proto, nhoff, hlen, flags);
 }
 
@@ -2494,7 +2496,8 @@ static inline void skb_probe_transport_header(struct sk_buff *skb)
 	if (skb_transport_header_was_set(skb))
 		return;
 
-	if (skb_flow_dissect_flow_keys_basic(skb, &keys, NULL, 0, 0, 0, 0))
+	if (skb_flow_dissect_flow_keys_basic(NULL, skb, &keys,
+					     NULL, 0, 0, 0, 0))
 		skb_set_transport_header(skb, keys.control.thoff);
 }
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 3b35fc35f583..1e2e4a9330af 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -725,6 +725,7 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
+ * @net: associated network namespace, derived from @skb if NULL
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
  * @flow_dissector: list of keys to dissect
  * @target_container: target structure to put dissected values into
@@ -739,7 +740,8 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
  *
  * Caller must take care of zeroing target container memory.
  */
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(const struct net *net,
+			const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
@@ -798,13 +800,17 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		struct bpf_prog *attached = NULL;
 
 		rcu_read_lock();
+		if (!net) {
+			if (skb->dev)
+				net = dev_net(skb->dev);
+			else if (skb->sk)
+				net = sock_net(skb->sk);
+			else
+				WARN_ON_ONCE(1);
+		}
 
-		if (skb->dev)
-			attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
-		else if (skb->sk)
-			attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
-		else
-			WARN_ON_ONCE(1);
+		if (net)
+			attached = rcu_dereference(net->flow_dissector_prog);
 
 		if (attached) {
 			ret = __skb_flow_bpf_dissect(attached, skb,
@@ -1406,8 +1412,8 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
 	__flow_hash_secret_init();
 
 	memset(&keys, 0, sizeof(keys));
-	__skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys,
-			   NULL, 0, 0, 0,
+	__skb_flow_dissect(NULL, skb, &flow_keys_dissector_symmetric,
+			   &keys, NULL, 0, 0, 0,
 			   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
 	return __flow_hash_from_keys(&keys, hashrnd);
@@ -1508,7 +1514,8 @@ u32 skb_get_poff(const struct sk_buff *skb)
 {
 	struct flow_keys_basic keys;
 
-	if (!skb_flow_dissect_flow_keys_basic(skb, &keys, NULL, 0, 0, 0, 0))
+	if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys,
+					      NULL, 0, 0, 0, 0))
 		return 0;
 
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index f7a3d7a171c7..1e439549c419 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -136,8 +136,9 @@ u32 eth_get_headlen(void *data, unsigned int len)
 		return len;
 
 	/* parse any remaining L2/L3 headers, check for L4 */
-	if (!skb_flow_dissect_flow_keys_basic(NULL, &keys, data, eth->h_proto,
-					      sizeof(*eth), len, flags))
+	if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data,
+					      eth->h_proto, sizeof(*eth),
+					      len, flags))
 		return max_t(u32, keys.control.thoff, sizeof(*eth));
 
 	/* parse for any L4 headers */
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2019-03-22 19:59 ` [RFC bpf-next v3 5/8] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
@ 2019-03-22 19:59 ` Stanislav Fomichev
  2019-03-23  1:00   ` Alexei Starovoitov
  2019-03-22 19:59   ` [Intel-wired-lan] " Stanislav Fomichev
  2019-03-22 19:59 ` [RFC bpf-next v3 8/8] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test Stanislav Fomichev
  7 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:59 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

When called without skb, gather all required data from the
__skb_flow_dissect's arguments and use recently introduces
no-skb mode of bpf flow dissector.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/core/flow_dissector.c | 54 +++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 31 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 1e2e4a9330af..ec8ebafe333f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -683,26 +683,6 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	}
 }
 
-bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
-			    const struct sk_buff *skb,
-			    struct flow_dissector *flow_dissector,
-			    struct bpf_flow_keys *flow_keys)
-{
-	struct bpf_flow_dissector ctx = {
-		.flow_keys = flow_keys,
-		.skb = skb,
-		.protocol = skb->protocol,
-		.vlan_tci = skb->vlan_tci,
-		.vlan_proto = skb->vlan_proto,
-		.vlan_present = skb->vlan_present,
-		.data = skb->data,
-		.data_end = skb->data + skb_headlen(skb),
-	};
-
-	return bpf_flow_dissect(prog, &ctx, skb_network_offset(skb),
-				skb_headlen(skb));
-}
-
 bool bpf_flow_dissect(struct bpf_prog *prog, struct bpf_flow_dissector *ctx,
 		      int nhoff, int hlen)
 {
@@ -754,6 +734,7 @@ bool __skb_flow_dissect(const struct net *net,
 	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
+	struct bpf_prog *attached = NULL;
 	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
 	int num_hdrs = 0;
@@ -796,26 +777,37 @@ bool __skb_flow_dissect(const struct net *net,
 					      target_container);
 
 	if (skb) {
-		struct bpf_flow_keys flow_keys;
-		struct bpf_prog *attached = NULL;
-
-		rcu_read_lock();
 		if (!net) {
 			if (skb->dev)
 				net = dev_net(skb->dev);
 			else if (skb->sk)
 				net = sock_net(skb->sk);
-			else
-				WARN_ON_ONCE(1);
 		}
+	}
 
-		if (net)
-			attached = rcu_dereference(net->flow_dissector_prog);
+	WARN_ON_ONCE(!net);
+	if (net) {
+		rcu_read_lock();
+		attached = rcu_dereference(net->flow_dissector_prog);
 
 		if (attached) {
-			ret = __skb_flow_bpf_dissect(attached, skb,
-						     flow_dissector,
-						     &flow_keys);
+			struct bpf_flow_keys flow_keys;
+			struct bpf_flow_dissector ctx = {
+				.flow_keys = &flow_keys,
+				.data = data,
+				.data_end = data + hlen,
+				.protocol = proto,
+			};
+
+			if (skb) {
+				ctx.skb = skb;
+				ctx.protocol = skb->protocol;
+				ctx.vlan_tci = skb->vlan_tci;
+				ctx.vlan_proto = skb->vlan_proto;
+				ctx.vlan_present = skb->vlan_present;
+			}
+
+			ret = bpf_flow_dissect(attached, &ctx, nhoff, hlen);
 			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
 						 target_container);
 			rcu_read_unlock();
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 7/8] net: pass net argument to the eth_get_headlen
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
@ 2019-03-22 19:59   ` Stanislav Fomichev
  2019-03-22 19:58 ` [RFC bpf-next v3 2/8] flow_dissector: switch kernel context to struct bpf_flow_dissector Stanislav Fomichev
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:59 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev, Maxim Krasnyansky, Saeed Mahameed,
	Jeff Kirsher, intel-wired-lan, Yisen Zhuang, Salil Mehta,
	Michael Chan

Update all users eth_get_headlen to pass network namespace
and pass it down to the flow dissector. This commit is a noop
until administrator inserts BPF flow dissector program.

Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c     | 3 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c   | 3 ++-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c     | 2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       | 3 ++-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c       | 3 ++-
 drivers/net/ethernet/intel/ice/ice_txrx.c         | 2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         | 3 ++-
 drivers/net/ethernet/intel/igc/igc_main.c         | 3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   | 3 ++-
 drivers/net/tun.c                                 | 3 ++-
 include/linux/etherdevice.h                       | 2 +-
 net/ethernet/eth.c                                | 5 +++--
 15 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 0bb9d7b3a2b6..664e6392f71d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -899,7 +899,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 			     DMA_ATTR_WEAK_ORDERING);
 
 	if (unlikely(!payload))
-		payload = eth_get_headlen(data_ptr, len);
+		payload = eth_get_headlen(dev_net(bp->dev), data_ptr, len);
 
 	skb = napi_alloc_skb(&rxr->bnapi->napi, payload);
 	if (!skb) {
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index e37a0ca0db89..9d85742a7794 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -603,7 +603,8 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	} else {
 		ring->stats.seg_pkt_cnt++;
 
-		pull_len = eth_get_headlen(va, HNS_RX_HEAD_SIZE);
+		pull_len = eth_get_headlen(dev_net(ndev), va,
+					   HNS_RX_HEAD_SIZE);
 		memcpy(__skb_put(skb, pull_len), va,
 		       ALIGN(pull_len, sizeof(long)));
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 1c1f17ec6be2..265bb8e47d48 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2438,7 +2438,8 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, int length,
 	ring->stats.seg_pkt_cnt++;
 	u64_stats_update_end(&ring->syncp);
 
-	ring->pull_len = eth_get_headlen(va, HNS3_RX_HEAD_SIZE);
+	ring->pull_len = eth_get_headlen(dev_net(netdev), va,
+					 HNS3_RX_HEAD_SIZE);
 	__skb_put(skb, ring->pull_len);
 	hns3_nic_reuse_page(skb, ring->frag_num++, ring, ring->pull_len,
 			    desc_cb);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 5a0419421511..790787a26954 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -278,7 +278,7 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer *rx_buffer,
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, FM10K_RX_HDR_LEN);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, FM10K_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6c97667d20ef..9cd0d18a93d1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2035,7 +2035,8 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > I40E_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), xdp->data,
+					  I40E_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 9b4d7cec2e18..377fd46deacc 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1315,7 +1315,8 @@ static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IAVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(va, IAVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IAVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index fad308c936b2..0854df62646d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -704,7 +704,7 @@ static void ice_pull_tail(struct sk_buff *skb)
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, ICE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, ICE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index bea7175d171b..acaccbc86a84 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8051,7 +8051,8 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGB_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGB_RX_HDR_LEN);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IGB_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index a883b3f357e7..5fe2f95f16ab 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1199,7 +1199,8 @@ static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGC_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGC_RX_HDR_LEN);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IGC_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 16c728984164..b9cfc9241cac 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1800,7 +1800,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 	 * we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 49e23afa05a2..252fe0de6b56 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -895,7 +895,8 @@ struct sk_buff *ixgbevf_construct_skb(struct ixgbevf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IXGBEVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, IXGBEVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), xdp->data,
+					  IXGBEVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index ce1406bb1512..6c538b7c0e25 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -162,7 +162,8 @@ static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
 	case MLX5_INLINE_MODE_NONE:
 		return 0;
 	case MLX5_INLINE_MODE_TCP_UDP:
-		hlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		hlen = eth_get_headlen(dev_net(skb->dev), skb->data,
+				       skb_headlen(skb));
 		if (hlen == ETH_HLEN && !skb_vlan_tag_present(skb))
 			hlen += VLAN_HLEN;
 		break;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ef3b65514976..b020464faf5e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1965,7 +1965,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (frags) {
 		/* Exercise flow dissector code path. */
-		u32 headlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		u32 headlen = eth_get_headlen(dev_net(tun->dev), skb->data,
+					      skb_headlen(skb));
 
 		if (unlikely(headlen > skb_headlen(skb))) {
 			this_cpu_inc(tun->pcpu_stats->rx_dropped);
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index e2f3b21cd72a..71a441ffab3f 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -33,7 +33,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(void *data, unsigned int max_len);
+u32 eth_get_headlen(const struct net *net, void *data, unsigned int max_len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 1e439549c419..0202e72e20a4 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -119,13 +119,14 @@ EXPORT_SYMBOL(eth_header);
 
 /**
  * eth_get_headlen - determine the length of header for an ethernet frame
+ * @net: pointer to device network namespace
  * @data: pointer to start of frame
  * @len: total length of frame
  *
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(void *data, unsigned int len)
+u32 eth_get_headlen(const struct net *net, void *data, unsigned int len)
 {
 	const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 	const struct ethhdr *eth = (const struct ethhdr *)data;
@@ -136,7 +137,7 @@ u32 eth_get_headlen(void *data, unsigned int len)
 		return len;
 
 	/* parse any remaining L2/L3 headers, check for L4 */
-	if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data,
+	if (!skb_flow_dissect_flow_keys_basic(net, NULL, &keys, data,
 					      eth->h_proto, sizeof(*eth),
 					      len, flags))
 		return max_t(u32, keys.control.thoff, sizeof(*eth));
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Intel-wired-lan] [RFC bpf-next v3 7/8] net: pass net argument to the eth_get_headlen
@ 2019-03-22 19:59   ` Stanislav Fomichev
  0 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:59 UTC (permalink / raw)
  To: intel-wired-lan

Update all users eth_get_headlen to pass network namespace
and pass it down to the flow dissector. This commit is a noop
until administrator inserts BPF flow dissector program.

Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan at lists.osuosl.org
Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
Cc: Salil Mehta <salil.mehta@huawei.com>
Cc: Michael Chan <michael.chan@broadcom.com>

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c     | 3 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c   | 3 ++-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c     | 2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       | 3 ++-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c       | 3 ++-
 drivers/net/ethernet/intel/ice/ice_txrx.c         | 2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         | 3 ++-
 drivers/net/ethernet/intel/igc/igc_main.c         | 3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   | 3 ++-
 drivers/net/tun.c                                 | 3 ++-
 include/linux/etherdevice.h                       | 2 +-
 net/ethernet/eth.c                                | 5 +++--
 15 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 0bb9d7b3a2b6..664e6392f71d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -899,7 +899,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 			     DMA_ATTR_WEAK_ORDERING);
 
 	if (unlikely(!payload))
-		payload = eth_get_headlen(data_ptr, len);
+		payload = eth_get_headlen(dev_net(bp->dev), data_ptr, len);
 
 	skb = napi_alloc_skb(&rxr->bnapi->napi, payload);
 	if (!skb) {
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index e37a0ca0db89..9d85742a7794 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -603,7 +603,8 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	} else {
 		ring->stats.seg_pkt_cnt++;
 
-		pull_len = eth_get_headlen(va, HNS_RX_HEAD_SIZE);
+		pull_len = eth_get_headlen(dev_net(ndev), va,
+					   HNS_RX_HEAD_SIZE);
 		memcpy(__skb_put(skb, pull_len), va,
 		       ALIGN(pull_len, sizeof(long)));
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 1c1f17ec6be2..265bb8e47d48 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2438,7 +2438,8 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, int length,
 	ring->stats.seg_pkt_cnt++;
 	u64_stats_update_end(&ring->syncp);
 
-	ring->pull_len = eth_get_headlen(va, HNS3_RX_HEAD_SIZE);
+	ring->pull_len = eth_get_headlen(dev_net(netdev), va,
+					 HNS3_RX_HEAD_SIZE);
 	__skb_put(skb, ring->pull_len);
 	hns3_nic_reuse_page(skb, ring->frag_num++, ring, ring->pull_len,
 			    desc_cb);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 5a0419421511..790787a26954 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -278,7 +278,7 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer *rx_buffer,
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, FM10K_RX_HDR_LEN);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, FM10K_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 6c97667d20ef..9cd0d18a93d1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2035,7 +2035,8 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > I40E_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), xdp->data,
+					  I40E_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 9b4d7cec2e18..377fd46deacc 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1315,7 +1315,8 @@ static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IAVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(va, IAVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IAVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index fad308c936b2..0854df62646d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -704,7 +704,7 @@ static void ice_pull_tail(struct sk_buff *skb)
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, ICE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, ICE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index bea7175d171b..acaccbc86a84 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8051,7 +8051,8 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGB_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGB_RX_HDR_LEN);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IGB_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index a883b3f357e7..5fe2f95f16ab 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1199,7 +1199,8 @@ static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGC_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGC_RX_HDR_LEN);
+		headlen = eth_get_headlen(dev_net(skb->dev), va,
+					  IGC_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 16c728984164..b9cfc9241cac 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1800,7 +1800,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 	 * we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(dev_net(skb->dev), va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 49e23afa05a2..252fe0de6b56 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -895,7 +895,8 @@ struct sk_buff *ixgbevf_construct_skb(struct ixgbevf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IXGBEVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, IXGBEVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(dev_net(skb->dev), xdp->data,
+					  IXGBEVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index ce1406bb1512..6c538b7c0e25 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -162,7 +162,8 @@ static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
 	case MLX5_INLINE_MODE_NONE:
 		return 0;
 	case MLX5_INLINE_MODE_TCP_UDP:
-		hlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		hlen = eth_get_headlen(dev_net(skb->dev), skb->data,
+				       skb_headlen(skb));
 		if (hlen == ETH_HLEN && !skb_vlan_tag_present(skb))
 			hlen += VLAN_HLEN;
 		break;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ef3b65514976..b020464faf5e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1965,7 +1965,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (frags) {
 		/* Exercise flow dissector code path. */
-		u32 headlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		u32 headlen = eth_get_headlen(dev_net(tun->dev), skb->data,
+					      skb_headlen(skb));
 
 		if (unlikely(headlen > skb_headlen(skb))) {
 			this_cpu_inc(tun->pcpu_stats->rx_dropped);
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index e2f3b21cd72a..71a441ffab3f 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -33,7 +33,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(void *data, unsigned int max_len);
+u32 eth_get_headlen(const struct net *net, void *data, unsigned int max_len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 1e439549c419..0202e72e20a4 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -119,13 +119,14 @@ EXPORT_SYMBOL(eth_header);
 
 /**
  * eth_get_headlen - determine the length of header for an ethernet frame
+ * @net: pointer to device network namespace
  * @data: pointer to start of frame
  * @len: total length of frame
  *
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(void *data, unsigned int len)
+u32 eth_get_headlen(const struct net *net, void *data, unsigned int len)
 {
 	const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 	const struct ethhdr *eth = (const struct ethhdr *)data;
@@ -136,7 +137,7 @@ u32 eth_get_headlen(void *data, unsigned int len)
 		return len;
 
 	/* parse any remaining L2/L3 headers, check for L4 */
-	if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data,
+	if (!skb_flow_dissect_flow_keys_basic(net, NULL, &keys, data,
 					      eth->h_proto, sizeof(*eth),
 					      len, flags))
 		return max_t(u32, keys.control.thoff, sizeof(*eth));
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC bpf-next v3 8/8] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test
  2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2019-03-22 19:59   ` [Intel-wired-lan] " Stanislav Fomichev
@ 2019-03-22 19:59 ` Stanislav Fomichev
  7 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-22 19:59 UTC (permalink / raw)
  To: netdev, bpf
  Cc: davem, ast, daniel, simon.horman, willemb, peterpenkov96,
	Stanislav Fomichev

When flow dissector is called without skb, we want to make sure
bpf_skb_load_bytes invocations return error. Add small test which tries
to read single byte from a packet.

bpf_skb_load_bytes should always fail under BPF_PROG_TEST_RUN because
it was converted to the skb-less mode.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../prog_tests/flow_dissector_load_bytes.c    | 48 +++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/flow_dissector_load_bytes.c

diff --git a/tools/testing/selftests/bpf/prog_tests/flow_dissector_load_bytes.c b/tools/testing/selftests/bpf/prog_tests/flow_dissector_load_bytes.c
new file mode 100644
index 000000000000..a73065bcf9e9
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/flow_dissector_load_bytes.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+
+void test_flow_dissector_load_bytes(void)
+{
+	struct bpf_flow_keys flow_keys;
+	__u32 duration = 0, retval, size;
+	struct bpf_insn prog[] = {
+		// BPF_REG_1 - 1st argument: context
+		// BPF_REG_2 - 2nd argument: offset, start at first byte
+		BPF_MOV64_IMM(BPF_REG_2, 0),
+		// BPF_REG_3 - 3rd argument: destination, reserve byte on stack
+		BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -1),
+		// BPF_REG_4 - 4th argument: copy one byte
+		BPF_MOV64_IMM(BPF_REG_4, 1),
+		// bpf_skb_load_bytes(ctx, sizeof(pkt_v4), ptr, 1)
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+			     BPF_FUNC_skb_load_bytes),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+		// if (ret == 0) return BPF_DROP (2)
+		BPF_MOV64_IMM(BPF_REG_0, BPF_DROP),
+		BPF_EXIT_INSN(),
+		// if (ret != 0) return BPF_OK (0)
+		BPF_MOV64_IMM(BPF_REG_0, BPF_OK),
+		BPF_EXIT_INSN(),
+	};
+	int fd, err;
+
+	/* make sure bpf_skb_load_bytes is not allowed from skb-less context
+	 */
+	fd = bpf_load_program(BPF_PROG_TYPE_FLOW_DISSECTOR, prog,
+			      ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
+	CHECK(fd < 0,
+	      "flow_dissector-bpf_skb_load_bytes-load",
+	      "fd %d errno %d\n",
+	      fd, errno);
+
+	err = bpf_prog_test_run(fd, 1, &pkt_v4, sizeof(pkt_v4),
+				&flow_keys, &size, &retval, &duration);
+	CHECK(size != sizeof(flow_keys) || err || retval != 1,
+	      "flow_dissector-bpf_skb_load_bytes",
+	      "err %d errno %d retval %d duration %d size %u/%zu\n",
+	      err, errno, retval, duration, size, sizeof(flow_keys));
+
+	if (fd >= -1)
+		close(fd);
+}
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-22 19:59 ` [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case Stanislav Fomichev
@ 2019-03-23  1:00   ` Alexei Starovoitov
  2019-03-23  1:19     ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-23  1:00 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, bpf, davem, ast, daniel, simon.horman, willemb, peterpenkov96

On Fri, Mar 22, 2019 at 12:59:01PM -0700, Stanislav Fomichev wrote:
> When called without skb, gather all required data from the
> __skb_flow_dissect's arguments and use recently introduces
> no-skb mode of bpf flow dissector.
> 
> Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> +			struct bpf_flow_keys flow_keys;
> +			struct bpf_flow_dissector ctx = {
> +				.flow_keys = &flow_keys,
> +				.data = data,
> +				.data_end = data + hlen,
> +				.protocol = proto,
> +			};
> +
> +			if (skb) {
> +				ctx.skb = skb;
> +				ctx.protocol = skb->protocol;
> +				ctx.vlan_tci = skb->vlan_tci;
> +				ctx.vlan_proto = skb->vlan_proto;
> +				ctx.vlan_present = skb->vlan_present;
> +			}

are you suggesting to have vlan* fields that only work properly for skb case?
It means that progs/bpf_flow.c would not work as-is for eth_get_headlen.
And to have unified program that works in both cases the program would need to rely
on above internal implementation detail, like checking that ctx->protocol == 0 ?
imo that is worse than having two different flow dissector program types.

May be remove protocol and vlan* from ctx ?
Then the only thing program can do is look at the packet data which is
eth_get_headlen use case. For skb case the existence of vlan can be retrofitted into
dissector results by the kernel after the program finished.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-23  1:00   ` Alexei Starovoitov
@ 2019-03-23  1:19     ` Stanislav Fomichev
  2019-03-23  1:41       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-23  1:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On 03/22, Alexei Starovoitov wrote:
> On Fri, Mar 22, 2019 at 12:59:01PM -0700, Stanislav Fomichev wrote:
> > When called without skb, gather all required data from the
> > __skb_flow_dissect's arguments and use recently introduces
> > no-skb mode of bpf flow dissector.
> > 
> > Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > +			struct bpf_flow_keys flow_keys;
> > +			struct bpf_flow_dissector ctx = {
> > +				.flow_keys = &flow_keys,
> > +				.data = data,
> > +				.data_end = data + hlen,
> > +				.protocol = proto,
> > +			};
> > +
> > +			if (skb) {
> > +				ctx.skb = skb;
> > +				ctx.protocol = skb->protocol;
> > +				ctx.vlan_tci = skb->vlan_tci;
> > +				ctx.vlan_proto = skb->vlan_proto;
> > +				ctx.vlan_present = skb->vlan_present;
> > +			}
> 
> are you suggesting to have vlan* fields that only work properly for skb case?
> It means that progs/bpf_flow.c would not work as-is for eth_get_headlen.
> And to have unified program that works in both cases the program would need to rely
> on above internal implementation detail, like checking that ctx->protocol == 0 ?
> imo that is worse than having two different flow dissector program types.
Yeah. The reference implementation (bpf_flow.c) should handle it though.
In parse_eth_proto we handle ETH_P_8021Q/ETH_P_8021AD, so I'm not sure
why we even had that skb->vlan_present check in the first place.
[But, again, who knows what's out there besides bpf_flow.c]

In general, this RFC's approach is to have a "best effort" vlan
detection, bpf program would still have to handle it.
Agree, that it's confusing.

> May be remove protocol and vlan* from ctx ?
> Then the only thing program can do is look at the packet data which is
> eth_get_headlen use case. For skb case the existence of vlan can be retrofitted into
> dissector results by the kernel after the program finished.
Are we ok with breaking api in this case? I'm all in on removing this
extra information. We can always put it back if somebody complains (and
manually parse in eth_get_headlen case).

Regarding dissector results: that's currently not implemented, we don't
export vlan information.

We can still have protocol, because in both skb/skb-less cases we have
it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-23  1:19     ` Stanislav Fomichev
@ 2019-03-23  1:41       ` Alexei Starovoitov
  2019-03-23 16:05         ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-23  1:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> Are we ok with breaking api in this case? I'm all in on removing this
> extra information. We can always put it back if somebody complains (and
> manually parse in eth_get_headlen case).

Fine. That seems to be the only way forward to clean it all up.
Could you submit patch 1 to bpf tree disallowing vlan fields?
Patch 3 looks like candidate as well?

> We can still have protocol, because in both skb/skb-less cases we have
> it.

proto can work in both cases, but is it needed ? Does program benefit from it?
The kernel side burns extra bytes by copying it and extra branches to handle it.
May be drop it as well?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-23  1:41       ` Alexei Starovoitov
@ 2019-03-23 16:05         ` Stanislav Fomichev
  2019-03-26  0:35           ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-23 16:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On 03/22, Alexei Starovoitov wrote:
> On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > Are we ok with breaking api in this case? I'm all in on removing this
> > extra information. We can always put it back if somebody complains (and
> > manually parse in eth_get_headlen case).
> 
> Fine. That seems to be the only way forward to clean it all up.
> Could you submit patch 1 to bpf tree disallowing vlan fields?
> Patch 3 looks like candidate as well?
SGTM, will do. Let me also spend some time and do a simple test for
the vlan case, to make sure I didn't miss something important.
One question here though: would I need to wait for bpf and bpf-next
to re-merge to continues the series? Or we can cherry-pick those
patches to bpf-next as well (and git will work it out during the
merge)?

> > We can still have protocol, because in both skb/skb-less cases we have
> > it.
> 
> proto can work in both cases, but is it needed ? Does program benefit from it?
> The kernel side burns extra bytes by copying it and extra branches to handle it.
> May be drop it as well?
I feel like the program benefits from it, there is no need to go back and
re-parse that (and in the skb case, this data is already pulled). I was
also thinking about re-purposing flow_keys->n_proto for that (instead
of skb->protocol), so it functions as input and output, maybe that's a
more clear way to do it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-23 16:05         ` Stanislav Fomichev
@ 2019-03-26  0:35           ` Alexei Starovoitov
  2019-03-26 16:45             ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-26  0:35 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> On 03/22, Alexei Starovoitov wrote:
> > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > Are we ok with breaking api in this case? I'm all in on removing this
> > > extra information. We can always put it back if somebody complains (and
> > > manually parse in eth_get_headlen case).
> > 
> > Fine. That seems to be the only way forward to clean it all up.
> > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > Patch 3 looks like candidate as well?
> SGTM, will do. Let me also spend some time and do a simple test for
> the vlan case, to make sure I didn't miss something important.
> One question here though: would I need to wait for bpf and bpf-next
> to re-merge to continues the series? Or we can cherry-pick those
> patches to bpf-next as well (and git will work it out during the
> merge)?
> 
> > > We can still have protocol, because in both skb/skb-less cases we have
> > > it.
> > 
> > proto can work in both cases, but is it needed ? Does program benefit from it?
> > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > May be drop it as well?
> I feel like the program benefits from it, there is no need to go back and
> re-parse that (and in the skb case, this data is already pulled). I was
> also thinking about re-purposing flow_keys->n_proto for that (instead
> of skb->protocol), so it functions as input and output, maybe that's a
> more clear way to do it.

Are you saying that skb-less and skb flow dissector progs are looking
at different positions into the packet ?
In case of with-skb it's already after eth header was pulled?
In such case skb-less should be different program type or
both should point at the same point.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26  0:35           ` Alexei Starovoitov
@ 2019-03-26 16:45             ` Stanislav Fomichev
  2019-03-26 17:48               ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-26 16:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On 03/25, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > On 03/22, Alexei Starovoitov wrote:
> > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > extra information. We can always put it back if somebody complains (and
> > > > manually parse in eth_get_headlen case).
> > > 
> > > Fine. That seems to be the only way forward to clean it all up.
> > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > Patch 3 looks like candidate as well?
> > SGTM, will do. Let me also spend some time and do a simple test for
> > the vlan case, to make sure I didn't miss something important.
> > One question here though: would I need to wait for bpf and bpf-next
> > to re-merge to continues the series? Or we can cherry-pick those
> > patches to bpf-next as well (and git will work it out during the
> > merge)?
> > 
> > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > it.
> > > 
> > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > May be drop it as well?
> > I feel like the program benefits from it, there is no need to go back and
> > re-parse that (and in the skb case, this data is already pulled). I was
> > also thinking about re-purposing flow_keys->n_proto for that (instead
> > of skb->protocol), so it functions as input and output, maybe that's a
> > more clear way to do it.
> 
> Are you saying that skb-less and skb flow dissector progs are looking
> at different positions into the packet ?
No, sorry for confusion, they are both called to parse (optional) L2-vlan
and L3+ headers. However, with-skb case can be called with l2-vlan
parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
__netif_receive_skb_core, but we can still invoke flow dissector prior to
that when doing RFS (get_rps_cpu).

That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
(and then also manually test for ETH_P_8021Q/ETH_P_8021AD).

Let me try to post the patches to bpf tree somewhere this week, we
can discuss the API changes there.

> In case of with-skb it's already after eth header was pulled?
> In such case skb-less should be different program type or
> both should point at the same point.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 16:45             ` Stanislav Fomichev
@ 2019-03-26 17:48               ` Alexei Starovoitov
  2019-03-26 17:51                 ` Willem de Bruijn
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-26 17:48 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, bpf, davem, ast, daniel,
	simon.horman, willemb, peterpenkov96

On Tue, Mar 26, 2019 at 09:45:29AM -0700, Stanislav Fomichev wrote:
> On 03/25, Alexei Starovoitov wrote:
> > On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > > On 03/22, Alexei Starovoitov wrote:
> > > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > > extra information. We can always put it back if somebody complains (and
> > > > > manually parse in eth_get_headlen case).
> > > > 
> > > > Fine. That seems to be the only way forward to clean it all up.
> > > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > > Patch 3 looks like candidate as well?
> > > SGTM, will do. Let me also spend some time and do a simple test for
> > > the vlan case, to make sure I didn't miss something important.
> > > One question here though: would I need to wait for bpf and bpf-next
> > > to re-merge to continues the series? Or we can cherry-pick those
> > > patches to bpf-next as well (and git will work it out during the
> > > merge)?
> > > 
> > > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > > it.
> > > > 
> > > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > > May be drop it as well?
> > > I feel like the program benefits from it, there is no need to go back and
> > > re-parse that (and in the skb case, this data is already pulled). I was
> > > also thinking about re-purposing flow_keys->n_proto for that (instead
> > > of skb->protocol), so it functions as input and output, maybe that's a
> > > more clear way to do it.
> > 
> > Are you saying that skb-less and skb flow dissector progs are looking
> > at different positions into the packet ?
> No, sorry for confusion, they are both called to parse (optional) L2-vlan
> and L3+ headers. However, with-skb case can be called with l2-vlan
> parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
> __netif_receive_skb_core, but we can still invoke flow dissector prior to
> that when doing RFS (get_rps_cpu).
> 
> That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
> (and then also manually test for ETH_P_8021Q/ETH_P_8021AD).
> 
> Let me try to post the patches to bpf tree somewhere this week, we
> can discuss the API changes there.

let's figure out what we disable for bpf/stable first.
Sound like skb->protocol isn't helping.
prog has to read it from eth header anyway. then let's drop it from ctx?
skb->vlan_present also seems to be misleading.
For normal processing skb->vlan_present means that there is a vlan hdr
in the packet, but for flow dissector is some sort of weird hint
whether skb-based dissector was pre- or post- rfs.
I think we need make vlan semantics unambiguous first.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 17:48               ` Alexei Starovoitov
@ 2019-03-26 17:51                 ` Willem de Bruijn
  2019-03-26 18:08                   ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Willem de Bruijn @ 2019-03-26 17:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn, Petar Penkov

On Tue, Mar 26, 2019 at 1:48 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 26, 2019 at 09:45:29AM -0700, Stanislav Fomichev wrote:
> > On 03/25, Alexei Starovoitov wrote:
> > > On Sat, Mar 23, 2019 at 09:05:31AM -0700, Stanislav Fomichev wrote:
> > > > On 03/22, Alexei Starovoitov wrote:
> > > > > On Fri, Mar 22, 2019 at 06:19:57PM -0700, Stanislav Fomichev wrote:
> > > > > > Are we ok with breaking api in this case? I'm all in on removing this
> > > > > > extra information. We can always put it back if somebody complains (and
> > > > > > manually parse in eth_get_headlen case).
> > > > >
> > > > > Fine. That seems to be the only way forward to clean it all up.
> > > > > Could you submit patch 1 to bpf tree disallowing vlan fields?
> > > > > Patch 3 looks like candidate as well?
> > > > SGTM, will do. Let me also spend some time and do a simple test for
> > > > the vlan case, to make sure I didn't miss something important.
> > > > One question here though: would I need to wait for bpf and bpf-next
> > > > to re-merge to continues the series? Or we can cherry-pick those
> > > > patches to bpf-next as well (and git will work it out during the
> > > > merge)?
> > > >
> > > > > > We can still have protocol, because in both skb/skb-less cases we have
> > > > > > it.
> > > > >
> > > > > proto can work in both cases, but is it needed ? Does program benefit from it?
> > > > > The kernel side burns extra bytes by copying it and extra branches to handle it.
> > > > > May be drop it as well?
> > > > I feel like the program benefits from it, there is no need to go back and
> > > > re-parse that (and in the skb case, this data is already pulled). I was
> > > > also thinking about re-purposing flow_keys->n_proto for that (instead
> > > > of skb->protocol), so it functions as input and output, maybe that's a
> > > > more clear way to do it.
> > >
> > > Are you saying that skb-less and skb flow dissector progs are looking
> > > at different positions into the packet ?
> > No, sorry for confusion, they are both called to parse (optional) L2-vlan
> > and L3+ headers. However, with-skb case can be called with l2-vlan
> > parsed (post RFS) or with l2-vlan unparsed (RFS). The vlan is pulled in
> > __netif_receive_skb_core, but we can still invoke flow dissector prior to
> > that when doing RFS (get_rps_cpu).
> >
> > That's why have this 'if skb->vlan_present' check in the bpf_flow.c program
> > (and then also manually test for ETH_P_8021Q/ETH_P_8021AD).
> >
> > Let me try to post the patches to bpf tree somewhere this week, we
> > can discuss the API changes there.
>
> let's figure out what we disable for bpf/stable first.
> Sound like skb->protocol isn't helping.
> prog has to read it from eth header anyway. then let's drop it from ctx?

The C flow dissector does not parse link layer headers It starts with
proto either given as argument or derived from skb->protocol (or
skb->vlan_proto).

The BPF flow dissector should work the same. It is fine to pass the
data including ethernet header, but parsing can start at nhoff with
proto explicitly passed.

We should not assume Ethernet link layer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 17:51                 ` Willem de Bruijn
@ 2019-03-26 18:08                   ` Alexei Starovoitov
  2019-03-26 18:17                     ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-26 18:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> The BPF flow dissector should work the same. It is fine to pass the
> data including ethernet header, but parsing can start at nhoff with
> proto explicitly passed.
>
> We should not assume Ethernet link layer.

then skb-less dissector has to be different program type
because semantics are different.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 18:08                   ` Alexei Starovoitov
@ 2019-03-26 18:17                     ` Stanislav Fomichev
  2019-03-26 18:30                       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-26 18:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> > The BPF flow dissector should work the same. It is fine to pass the
> > data including ethernet header, but parsing can start at nhoff with
> > proto explicitly passed.
> >
> > We should not assume Ethernet link layer.
> 
> then skb-less dissector has to be different program type
> because semantics are different.
The semantics are the same as for c-based __skb_flow_dissect.
We just need to pass nhoff and proto that has been passed to
__skb_flow_dissect to the bpf program. In case of with-skb,
take this initial data from skb, like __skb_flow_dissect does (and don't
ask BPF program to do it essentially):

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763

I was thinking of passing proto as flow_keys->n_proto and we already
pass flow_keys->nhoff, so no need to do anything for it. With that,
BPF program doesn't need to look into skb and can parse optional vlan
and L3+ headers. The same way __skb_flow_dissect does that.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 18:17                     ` Stanislav Fomichev
@ 2019-03-26 18:30                       ` Alexei Starovoitov
  2019-03-26 18:54                         ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-26 18:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > > The BPF flow dissector should work the same. It is fine to pass the
> > > data including ethernet header, but parsing can start at nhoff with
> > > proto explicitly passed.
> > >
> > > We should not assume Ethernet link layer.
> > 
> > then skb-less dissector has to be different program type
> > because semantics are different.
> The semantics are the same as for c-based __skb_flow_dissect.
> We just need to pass nhoff and proto that has been passed to
> __skb_flow_dissect to the bpf program. In case of with-skb,
> take this initial data from skb, like __skb_flow_dissect does (and don't
> ask BPF program to do it essentially):
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> 
> I was thinking of passing proto as flow_keys->n_proto and we already
> pass flow_keys->nhoff, so no need to do anything for it. With that,
> BPF program doesn't need to look into skb and can parse optional vlan
> and L3+ headers. The same way __skb_flow_dissect does that.

makes sense. then I'd also prefer for proto to be in flow_keys to
high light this difference.
may be add vlan_proto/present/tci there as well?
At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 18:30                       ` Alexei Starovoitov
@ 2019-03-26 18:54                         ` Stanislav Fomichev
  2019-03-27  1:41                           ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-26 18:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > data including ethernet header, but parsing can start at nhoff with
> > > > proto explicitly passed.
> > > >
> > > > We should not assume Ethernet link layer.
> > > 
> > > then skb-less dissector has to be different program type
> > > because semantics are different.
> > The semantics are the same as for c-based __skb_flow_dissect.
> > We just need to pass nhoff and proto that has been passed to
> > __skb_flow_dissect to the bpf program. In case of with-skb,
> > take this initial data from skb, like __skb_flow_dissect does (and don't
> > ask BPF program to do it essentially):
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > 
> > I was thinking of passing proto as flow_keys->n_proto and we already
> > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > BPF program doesn't need to look into skb and can parse optional vlan
> > and L3+ headers. The same way __skb_flow_dissect does that.
> 
> makes sense. then I'd also prefer for proto to be in flow_keys to
> high light this difference.
Maybe rename existing flow_keys->n_proto to flow_keys->proto?
That would match __skb_flow_dissect and remove ambiguity with both proto
and n_proto in flow_keys.

> may be add vlan_proto/present/tci there as well?
> At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
Why do you think we need them? My understanding was that when
skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
that vlan info has been already parsed out of the packet and stored in
the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
points to proper L3 header.

If that's correct, BPF flow dissector should not care about that. For
example, look at how C-based flow dissector does that:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944

If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
and move on.

But, we would need vlan_proto/present/tci in the flow_keys in the future.
We don't currently return parsed vlan data from the BPF flow dissector.
But it feels like it's getting into bpf-next territory :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-26 18:54                         ` Stanislav Fomichev
@ 2019-03-27  1:41                           ` Alexei Starovoitov
  2019-03-27  2:44                             ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-27  1:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > proto explicitly passed.
> > > > >
> > > > > We should not assume Ethernet link layer.
> > > > 
> > > > then skb-less dissector has to be different program type
> > > > because semantics are different.
> > > The semantics are the same as for c-based __skb_flow_dissect.
> > > We just need to pass nhoff and proto that has been passed to
> > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > ask BPF program to do it essentially):
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > 
> > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > BPF program doesn't need to look into skb and can parse optional vlan
> > > and L3+ headers. The same way __skb_flow_dissect does that.
> > 
> > makes sense. then I'd also prefer for proto to be in flow_keys to
> > high light this difference.
> Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> That would match __skb_flow_dissect and remove ambiguity with both proto
> and n_proto in flow_keys.

disabling useless fields in ctx is one thing, since probability of breaking users
is low, but renaming n_proto is imo too much.

> > may be add vlan_proto/present/tci there as well?
> > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> Why do you think we need them? My understanding was that when
> skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> that vlan info has been already parsed out of the packet and stored in
> the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> points to proper L3 header.
> 
> If that's correct, BPF flow dissector should not care about that. For
> example, look at how C-based flow dissector does that:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> 
> If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> and move on.
> 
> But, we would need vlan_proto/present/tci in the flow_keys in the future.
> We don't currently return parsed vlan data from the BPF flow dissector.
> But it feels like it's getting into bpf-next territory :-)

Whether ctx->data points to L2 or L3 is uapi regardless whether
progs/bpf_flow.c is relying on that or not.
So far I think you're saying that in all three cases:
no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
This has to be preserved.
Only now after reading bpf_flow.c for Nth time I realized what semantics
you gave to skb->vlan* and skb->protocol fields. All of them have
to be kept as-is.
For no-skb cases all of them should be available with the same logic
and it has to documented, since it's different from other bpf progs
that access these fields.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-27  1:41                           ` Alexei Starovoitov
@ 2019-03-27  2:44                             ` Stanislav Fomichev
  2019-03-27 17:55                               ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-27  2:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/26, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > proto explicitly passed.
> > > > > >
> > > > > > We should not assume Ethernet link layer.
> > > > > 
> > > > > then skb-less dissector has to be different program type
> > > > > because semantics are different.
> > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > We just need to pass nhoff and proto that has been passed to
> > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > ask BPF program to do it essentially):
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > 
> > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > 
> > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > high light this difference.
> > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > That would match __skb_flow_dissect and remove ambiguity with both proto
> > and n_proto in flow_keys.
> 
> disabling useless fields in ctx is one thing, since probability of breaking users
> is low, but renaming n_proto is imo too much.
> 
> > > may be add vlan_proto/present/tci there as well?
> > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > Why do you think we need them? My understanding was that when
> > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > that vlan info has been already parsed out of the packet and stored in
> > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > points to proper L3 header.
> > 
> > If that's correct, BPF flow dissector should not care about that. For
> > example, look at how C-based flow dissector does that:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > 
> > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > and move on.
> > 
> > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > We don't currently return parsed vlan data from the BPF flow dissector.
> > But it feels like it's getting into bpf-next territory :-)
> 
> Whether ctx->data points to L2 or L3 is uapi regardless whether
> progs/bpf_flow.c is relying on that or not.
> So far I think you're saying that in all three cases:
> no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> This has to be preserved.
It points to L3 (or vlan). And this will be preserved, I have no
intention to change that.

Just to make sure, we are on the same page, here is what
__skb_flow_dissect (and BPF prog) is seeing in nhoff.

NO-VLAN is always the same for both with-skb/no-skb:
+----+----+-----+--+
|DMAC|SMAC|PROTO|L3|
+----+----+-----+--+
                 ^
                 +-- nhoff
                     proto = PROTO

VLAN no-skb (eth_get_headlen):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                ^
                +-- nhoff
                    proto = TPID

VLAN with-skb, RFS (pre __netif_receive_skb_core):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                ^
                +-- nhoff
                    proto = TPID

VLAN with-skb, post RFS (post __netif_receive_skb_core / skb_vlan_untag):
+----+----+----+---+-----+--+
|DMAC|SMAC|TPID|TCI|PROTO|L3|
+----+----+----+---+-----+--+
                          ^
                          +-- nhoff
			      proto = PROTO

And in the last case, networking stack sets:
 * skb->vlan_present to true
 * skb->vlan_proto to TPID
 * skb->vlan_tci to TCI
 * skb->protocol to PROTO
 * pulls vlan header, so skb->data points to L3 header

> Only now after reading bpf_flow.c for Nth time I realized what semantics
> you gave to skb->vlan* and skb->protocol fields. All of them have
> to be kept as-is.
Don't read too much into current bpf_flow.c, I don't think it really
works with vlans in all the cases :-/

It always looks back, assuming post RFS situation; that needs to be
changed by dropping that "if (!skb->vlan_present)" and just looking
into input 'proto' (and optionally parsing vlan hdr if proto ==
802.1q/ad, which we already, sort of, do).

I'm gonna add a small testcase for BPF_PROG_TEST_RUN.

> For no-skb cases all of them should be available with the same logic
> and it has to documented, since it's different from other bpf progs
> that access these fields.
I feel like dropping those vlan_{present,proto,tci} from bpf flow dissector.
It should not care what's in the skb and should just rely on the input 'proto'
to optionally parse vlan header.

+1 on documenting all of that

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-27  2:44                             ` Stanislav Fomichev
@ 2019-03-27 17:55                               ` Alexei Starovoitov
  2019-03-27 19:58                                 ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-27 17:55 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> On 03/26, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > proto explicitly passed.
> > > > > > >
> > > > > > > We should not assume Ethernet link layer.
> > > > > > 
> > > > > > then skb-less dissector has to be different program type
> > > > > > because semantics are different.
> > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > We just need to pass nhoff and proto that has been passed to
> > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > ask BPF program to do it essentially):
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > 
> > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > 
> > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > high light this difference.
> > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > and n_proto in flow_keys.
> > 
> > disabling useless fields in ctx is one thing, since probability of breaking users
> > is low, but renaming n_proto is imo too much.
> > 
> > > > may be add vlan_proto/present/tci there as well?
> > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > Why do you think we need them? My understanding was that when
> > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > that vlan info has been already parsed out of the packet and stored in
> > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > points to proper L3 header.
> > > 
> > > If that's correct, BPF flow dissector should not care about that. For
> > > example, look at how C-based flow dissector does that:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > 
> > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > and move on.
> > > 
> > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > But it feels like it's getting into bpf-next territory :-)
> > 
> > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > progs/bpf_flow.c is relying on that or not.
> > So far I think you're saying that in all three cases:
> > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > This has to be preserved.
> It points to L3 (or vlan). And this will be preserved, I have no
> intention to change that.
> 
> Just to make sure, we are on the same page, here is what
> __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> 
> NO-VLAN is always the same for both with-skb/no-skb:
> +----+----+-----+--+
> |DMAC|SMAC|PROTO|L3|
> +----+----+-----+--+
>                  ^
>                  +-- nhoff
>                      proto = PROTO
> 
> VLAN no-skb (eth_get_headlen):
> +----+----+----+---+-----+--+
> |DMAC|SMAC|TPID|TCI|PROTO|L3|
> +----+----+----+---+-----+--+
>                 ^
>                 +-- nhoff
>                     proto = TPID

where ctx->data will point to ?
These nhoff differences are fine.
I want to make sure that ctx->data is the same for all.

> And in the last case, networking stack sets:
>  * skb->vlan_present to true
>  * skb->vlan_proto to TPID
>  * skb->vlan_tci to TCI
>  * skb->protocol to PROTO
>  * pulls vlan header, so skb->data points to L3 header

then skb-less should point after L2 as well
or point to L2 for all.

I'm arguing about it because early on during tc-bpf days
we had this discrepancy and it caused issues.
Thankfully we addressed it before it became long term headache.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-27 17:55                               ` Alexei Starovoitov
@ 2019-03-27 19:58                                 ` Stanislav Fomichev
  2019-03-28  1:26                                   ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-27 19:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/27, Alexei Starovoitov wrote:
> On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > On 03/26, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > proto explicitly passed.
> > > > > > > >
> > > > > > > > We should not assume Ethernet link layer.
> > > > > > > 
> > > > > > > then skb-less dissector has to be different program type
> > > > > > > because semantics are different.
> > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > ask BPF program to do it essentially):
> > > > > > 
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > 
> > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > 
> > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > high light this difference.
> > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > and n_proto in flow_keys.
> > > 
> > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > is low, but renaming n_proto is imo too much.
> > > 
> > > > > may be add vlan_proto/present/tci there as well?
> > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > Why do you think we need them? My understanding was that when
> > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > that vlan info has been already parsed out of the packet and stored in
> > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > points to proper L3 header.
> > > > 
> > > > If that's correct, BPF flow dissector should not care about that. For
> > > > example, look at how C-based flow dissector does that:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > 
> > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > and move on.
> > > > 
> > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > But it feels like it's getting into bpf-next territory :-)
> > > 
> > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > progs/bpf_flow.c is relying on that or not.
> > > So far I think you're saying that in all three cases:
> > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > This has to be preserved.
> > It points to L3 (or vlan). And this will be preserved, I have no
> > intention to change that.
> > 
> > Just to make sure, we are on the same page, here is what
> > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > 
> > NO-VLAN is always the same for both with-skb/no-skb:
> > +----+----+-----+--+
> > |DMAC|SMAC|PROTO|L3|
> > +----+----+-----+--+
> >                  ^
> >                  +-- nhoff
> >                      proto = PROTO
> > 
> > VLAN no-skb (eth_get_headlen):
> > +----+----+----+---+-----+--+
> > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > +----+----+----+---+-----+--+
> >                 ^
> >                 +-- nhoff
> >                     proto = TPID
> 
> where ctx->data will point to ?
> These nhoff differences are fine.
> I want to make sure that ctx->data is the same for all.
For with-skb, nhoff would be zero, and ctx->data would point to
TCI/L3.
For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
non-zero (TCI/L3 offset).

If you want, for skb-less case, when calling BPF program we can do the math
ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
But I'm not sure whether we need to do that; flow dissector is supposed
to look at ctx->data + nhoff, it should not matter what each individual
value is, they only make sense together.

> > And in the last case, networking stack sets:
> >  * skb->vlan_present to true
> >  * skb->vlan_proto to TPID
> >  * skb->vlan_tci to TCI
> >  * skb->protocol to PROTO
> >  * pulls vlan header, so skb->data points to L3 header
> 
> then skb-less should point after L2 as well
> or point to L2 for all.
Sure, we can do that, see above, we can manually move ctx->data to nhoff
and set nhoff to zero.

> I'm arguing about it because early on during tc-bpf days
> we had this discrepancy and it caused issues.
> Thankfully we addressed it before it became long term headache.
Ack, that's what we are trying to do here as well :-) Address all
existing shortcomings before it's too late.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-27 19:58                                 ` Stanislav Fomichev
@ 2019-03-28  1:26                                   ` Alexei Starovoitov
  2019-03-28  3:14                                     ` Willem de Bruijn
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-28  1:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> On 03/27, Alexei Starovoitov wrote:
> > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > On 03/26, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > proto explicitly passed.
> > > > > > > > >
> > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > 
> > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > because semantics are different.
> > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > ask BPF program to do it essentially):
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > 
> > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > 
> > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > high light this difference.
> > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > and n_proto in flow_keys.
> > > > 
> > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > is low, but renaming n_proto is imo too much.
> > > > 
> > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > Why do you think we need them? My understanding was that when
> > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > points to proper L3 header.
> > > > > 
> > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > example, look at how C-based flow dissector does that:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > 
> > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > and move on.
> > > > > 
> > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > But it feels like it's getting into bpf-next territory :-)
> > > > 
> > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > progs/bpf_flow.c is relying on that or not.
> > > > So far I think you're saying that in all three cases:
> > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > This has to be preserved.
> > > It points to L3 (or vlan). And this will be preserved, I have no
> > > intention to change that.
> > > 
> > > Just to make sure, we are on the same page, here is what
> > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > 
> > > NO-VLAN is always the same for both with-skb/no-skb:
> > > +----+----+-----+--+
> > > |DMAC|SMAC|PROTO|L3|
> > > +----+----+-----+--+
> > >                  ^
> > >                  +-- nhoff
> > >                      proto = PROTO
> > > 
> > > VLAN no-skb (eth_get_headlen):
> > > +----+----+----+---+-----+--+
> > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > +----+----+----+---+-----+--+
> > >                 ^
> > >                 +-- nhoff
> > >                     proto = TPID
> > 
> > where ctx->data will point to ?
> > These nhoff differences are fine.
> > I want to make sure that ctx->data is the same for all.
> For with-skb, nhoff would be zero, and ctx->data would point to
> TCI/L3.
> For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> non-zero (TCI/L3 offset).
> 
> If you want, for skb-less case, when calling BPF program we can do the math
> ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> But I'm not sure whether we need to do that; flow dissector is supposed
> to look at ctx->data + nhoff, it should not matter what each individual
> value is, they only make sense together.

My strong preference is to have data to point to L2 in all cases.
Semantics of requiring bpf prog to start processing from a tuple
(data + nhoff) where both point to random places is very confusing.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-28  1:26                                   ` Alexei Starovoitov
@ 2019-03-28  3:14                                     ` Willem de Bruijn
  2019-03-28  3:32                                       ` Alexei Starovoitov
  0 siblings, 1 reply; 32+ messages in thread
From: Willem de Bruijn @ 2019-03-28  3:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > On 03/27, Alexei Starovoitov wrote:
> > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > On 03/26, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > proto explicitly passed.
> > > > > > > > > >
> > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > >
> > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > because semantics are different.
> > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > ask BPF program to do it essentially):
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > >
> > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > >
> > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > high light this difference.
> > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > and n_proto in flow_keys.
> > > > >
> > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > is low, but renaming n_proto is imo too much.
> > > > >
> > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > Why do you think we need them? My understanding was that when
> > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > points to proper L3 header.
> > > > > >
> > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > example, look at how C-based flow dissector does that:
> > > > > >
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > >
> > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > and move on.
> > > > > >
> > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > >
> > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > progs/bpf_flow.c is relying on that or not.
> > > > > So far I think you're saying that in all three cases:
> > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > This has to be preserved.
> > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > intention to change that.
> > > >
> > > > Just to make sure, we are on the same page, here is what
> > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > >
> > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > +----+----+-----+--+
> > > > |DMAC|SMAC|PROTO|L3|
> > > > +----+----+-----+--+
> > > >                  ^
> > > >                  +-- nhoff
> > > >                      proto = PROTO
> > > >
> > > > VLAN no-skb (eth_get_headlen):
> > > > +----+----+----+---+-----+--+
> > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > +----+----+----+---+-----+--+
> > > >                 ^
> > > >                 +-- nhoff
> > > >                     proto = TPID
> > >
> > > where ctx->data will point to ?
> > > These nhoff differences are fine.
> > > I want to make sure that ctx->data is the same for all.
> > For with-skb, nhoff would be zero, and ctx->data would point to
> > TCI/L3.
> > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > non-zero (TCI/L3 offset).
> >
> > If you want, for skb-less case, when calling BPF program we can do the math
> > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > But I'm not sure whether we need to do that; flow dissector is supposed
> > to look at ctx->data + nhoff, it should not matter what each individual
> > value is, they only make sense together.
>
> My strong preference is to have data to point to L2 in all cases.
> Semantics of requiring bpf prog to start processing from a tuple
> (data + nhoff) where both point to random places is very confusing.

Since flow dissection starts at the network layer, I would then
suggest data always at L3 and nhoff 0.

This can be derived in the same manner as __skb_flow_dissect
already does if !data, using only skb_network_offset.

From a quick scan, skb_mac_offset should also be valid in all cases
where the flow dissector is called today, so the other can be computed, too.

But this is less obvious. For instance, tun_get_user calls into the flow
dissector up to three times (wow) and IFF_TUN has no link layer
(ARPHRD_NONE). And then there are also fun variable length link layer
protocols to deal with..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-28  3:14                                     ` Willem de Bruijn
@ 2019-03-28  3:32                                       ` Alexei Starovoitov
  2019-03-28  4:17                                         ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Alexei Starovoitov @ 2019-03-28  3:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On Wed, Mar 27, 2019 at 11:14:46PM -0400, Willem de Bruijn wrote:
> On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > > On 03/27, Alexei Starovoitov wrote:
> > > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > > proto explicitly passed.
> > > > > > > > > > >
> > > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > > >
> > > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > > because semantics are different.
> > > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > > ask BPF program to do it essentially):
> > > > > > > > >
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > > >
> > > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > > >
> > > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > > high light this difference.
> > > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > > and n_proto in flow_keys.
> > > > > >
> > > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > > is low, but renaming n_proto is imo too much.
> > > > > >
> > > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > > Why do you think we need them? My understanding was that when
> > > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > > points to proper L3 header.
> > > > > > >
> > > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > > example, look at how C-based flow dissector does that:
> > > > > > >
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > > >
> > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > and move on.
> > > > > > >
> > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > >
> > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > So far I think you're saying that in all three cases:
> > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > This has to be preserved.
> > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > intention to change that.
> > > > >
> > > > > Just to make sure, we are on the same page, here is what
> > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > >
> > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > +----+----+-----+--+
> > > > > |DMAC|SMAC|PROTO|L3|
> > > > > +----+----+-----+--+
> > > > >                  ^
> > > > >                  +-- nhoff
> > > > >                      proto = PROTO
> > > > >
> > > > > VLAN no-skb (eth_get_headlen):
> > > > > +----+----+----+---+-----+--+
> > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > +----+----+----+---+-----+--+
> > > > >                 ^
> > > > >                 +-- nhoff
> > > > >                     proto = TPID
> > > >
> > > > where ctx->data will point to ?
> > > > These nhoff differences are fine.
> > > > I want to make sure that ctx->data is the same for all.
> > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > TCI/L3.
> > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > non-zero (TCI/L3 offset).
> > >
> > > If you want, for skb-less case, when calling BPF program we can do the math
> > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > to look at ctx->data + nhoff, it should not matter what each individual
> > > value is, they only make sense together.
> >
> > My strong preference is to have data to point to L2 in all cases.
> > Semantics of requiring bpf prog to start processing from a tuple
> > (data + nhoff) where both point to random places is very confusing.
> 
> Since flow dissection starts at the network layer, I would then
> suggest data always at L3 and nhoff 0.
> 
> This can be derived in the same manner as __skb_flow_dissect
> already does if !data, using only skb_network_offset.
> 
> From a quick scan, skb_mac_offset should also be valid in all cases
> where the flow dissector is called today, so the other can be computed, too.
> 
> But this is less obvious. For instance, tun_get_user calls into the flow
> dissector up to three times (wow) and IFF_TUN has no link layer
> (ARPHRD_NONE). And then there are also fun variable length link layer
> protocols to deal with..

ahh. ok. Can we guarantee some stable position?
Current bpf_flow_dissect_get_header assumes that
ctx->data + ctx->flow_keys->thoff point to IP, right?
Based on what Stanislav saying above even that is not a guarantee?
I'm struggling to see how users can wrap their heads around this.
It seems bpf_flow.c will become the only prog that can deal with
this range of possible inputs.

I propose to start with the doc that describes all cases, where
things point to and how prog suppose to parse that.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-28  3:32                                       ` Alexei Starovoitov
@ 2019-03-28  4:17                                         ` Stanislav Fomichev
  2019-03-28 12:58                                           ` Willem de Bruijn
  0 siblings, 1 reply; 32+ messages in thread
From: Stanislav Fomichev @ 2019-03-28  4:17 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/27, Alexei Starovoitov wrote:
> On Wed, Mar 27, 2019 at 11:14:46PM -0400, Willem de Bruijn wrote:
> > On Wed, Mar 27, 2019 at 9:26 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Mar 27, 2019 at 12:58:20PM -0700, Stanislav Fomichev wrote:
> > > > On 03/27, Alexei Starovoitov wrote:
> > > > > On Tue, Mar 26, 2019 at 07:44:21PM -0700, Stanislav Fomichev wrote:
> > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > On Tue, Mar 26, 2019 at 11:54:56AM -0700, Stanislav Fomichev wrote:
> > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > On Tue, Mar 26, 2019 at 11:17:19AM -0700, Stanislav Fomichev wrote:
> > > > > > > > > > On 03/26, Alexei Starovoitov wrote:
> > > > > > > > > > > On Tue, Mar 26, 2019 at 10:52 AM Willem de Bruijn
> > > > > > > > > > > <willemdebruijn.kernel@gmail.com> wrote:
> > > > > > > > > > > > The BPF flow dissector should work the same. It is fine to pass the
> > > > > > > > > > > > data including ethernet header, but parsing can start at nhoff with
> > > > > > > > > > > > proto explicitly passed.
> > > > > > > > > > > >
> > > > > > > > > > > > We should not assume Ethernet link layer.
> > > > > > > > > > >
> > > > > > > > > > > then skb-less dissector has to be different program type
> > > > > > > > > > > because semantics are different.
> > > > > > > > > > The semantics are the same as for c-based __skb_flow_dissect.
> > > > > > > > > > We just need to pass nhoff and proto that has been passed to
> > > > > > > > > > __skb_flow_dissect to the bpf program. In case of with-skb,
> > > > > > > > > > take this initial data from skb, like __skb_flow_dissect does (and don't
> > > > > > > > > > ask BPF program to do it essentially):
> > > > > > > > > >
> > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n763
> > > > > > > > > >
> > > > > > > > > > I was thinking of passing proto as flow_keys->n_proto and we already
> > > > > > > > > > pass flow_keys->nhoff, so no need to do anything for it. With that,
> > > > > > > > > > BPF program doesn't need to look into skb and can parse optional vlan
> > > > > > > > > > and L3+ headers. The same way __skb_flow_dissect does that.
> > > > > > > > >
> > > > > > > > > makes sense. then I'd also prefer for proto to be in flow_keys to
> > > > > > > > > high light this difference.
> > > > > > > > Maybe rename existing flow_keys->n_proto to flow_keys->proto?
> > > > > > > > That would match __skb_flow_dissect and remove ambiguity with both proto
> > > > > > > > and n_proto in flow_keys.
> > > > > > >
> > > > > > > disabling useless fields in ctx is one thing, since probability of breaking users
> > > > > > > is low, but renaming n_proto is imo too much.
> > > > > > >
> > > > > > > > > may be add vlan_proto/present/tci there as well?
> > > > > > > > > At least on the kernel side ctx rewriter will be the same for w/ & w/o skb cases.
> > > > > > > > Why do you think we need them? My understanding was that when
> > > > > > > > skb_vlan_tag_present(skb) (or skb->vlan_present) returns true, that means
> > > > > > > > that vlan info has been already parsed out of the packet and stored in
> > > > > > > > the vlan_tci/vlan_proto (where vlan_proto is 8021Q/8021AD); skb data
> > > > > > > > points to proper L3 header.
> > > > > > > >
> > > > > > > > If that's correct, BPF flow dissector should not care about that. For
> > > > > > > > example, look at how C-based flow dissector does that:
> > > > > > > >
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/tree/net/core/flow_dissector.c#n944
> > > > > > > >
> > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > and move on.
> > > > > > > >
> > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > >
> > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > So far I think you're saying that in all three cases:
> > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > This has to be preserved.
> > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > intention to change that.
> > > > > >
> > > > > > Just to make sure, we are on the same page, here is what
> > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > >
> > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > +----+----+-----+--+
> > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > +----+----+-----+--+
> > > > > >                  ^
> > > > > >                  +-- nhoff
> > > > > >                      proto = PROTO
> > > > > >
> > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > +----+----+----+---+-----+--+
> > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > +----+----+----+---+-----+--+
> > > > > >                 ^
> > > > > >                 +-- nhoff
> > > > > >                     proto = TPID
> > > > >
> > > > > where ctx->data will point to ?
> > > > > These nhoff differences are fine.
> > > > > I want to make sure that ctx->data is the same for all.
> > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > TCI/L3.
> > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > non-zero (TCI/L3 offset).
> > > >
> > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > value is, they only make sense together.
> > >
> > > My strong preference is to have data to point to L2 in all cases.
> > > Semantics of requiring bpf prog to start processing from a tuple
> > > (data + nhoff) where both point to random places is very confusing.
> > 
> > Since flow dissection starts at the network layer, I would then
> > suggest data always at L3 and nhoff 0.
For eth_get_headlen we need to manually parse 802.1q header. And for RFS
case as well (unless I'm missing something).

> > This can be derived in the same manner as __skb_flow_dissect
> > already does if !data, using only skb_network_offset.
> > 
> > From a quick scan, skb_mac_offset should also be valid in all cases
> > where the flow dissector is called today, so the other can be computed, too.
> > 
> > But this is less obvious. For instance, tun_get_user calls into the flow
> > dissector up to three times (wow) and IFF_TUN has no link layer
> > (ARPHRD_NONE). And then there are also fun variable length link layer
> > protocols to deal with..
> 
> ahh. ok. Can we guarantee some stable position?
I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
post RFS it will point to L3. The only thing we can do is to have
nhoff=0 (and adjust ctx->data accordingly) when the main bpf
flow dissector procedure is called. But that would require bringing
this new kernel context (bpf_flow_dissector) into bpf/stable.
(And it's not clear what's the benefit, since tail calls would still
have to look at that offset).

> Current bpf_flow_dissect_get_header assumes that
> ctx->data + ctx->flow_keys->thoff point to IP, right?
Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
And it's only for the "main" call; bpf program adjusts this thoff
to make sure that tail calls preserve some sense of progress (so it
eventually points to L4 and that's what we export back).

> Based on what Stanislav saying above even that is not a guarantee?
> I'm struggling to see how users can wrap their heads around this.
> It seems bpf_flow.c will become the only prog that can deal with
> this range of possible inputs.
> 
> I propose to start with the doc that describes all cases, where
> things point to and how prog suppose to parse that.
Yeah, that is what I was going to propose - add a doc along with the
patch series. I don't see how we can make it simple(r) at this point :-(
I can try to document everything so users don't have to read the
kernel code to understand how to write the bpf flow dissector programs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-28  4:17                                         ` Stanislav Fomichev
@ 2019-03-28 12:58                                           ` Willem de Bruijn
  2019-04-01 16:30                                             ` Stanislav Fomichev
  0 siblings, 1 reply; 32+ messages in thread
From: Willem de Bruijn @ 2019-03-28 12:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

> > > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > > and move on.
> > > > > > > > >
> > > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > > >
> > > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > > So far I think you're saying that in all three cases:
> > > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > > This has to be preserved.
> > > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > > intention to change that.
> > > > > > >
> > > > > > > Just to make sure, we are on the same page, here is what
> > > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > > >
> > > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > > +----+----+-----+--+
> > > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > > +----+----+-----+--+
> > > > > > >                  ^
> > > > > > >                  +-- nhoff
> > > > > > >                      proto = PROTO
> > > > > > >
> > > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > > +----+----+----+---+-----+--+
> > > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > > +----+----+----+---+-----+--+
> > > > > > >                 ^
> > > > > > >                 +-- nhoff
> > > > > > >                     proto = TPID
> > > > > >
> > > > > > where ctx->data will point to ?
> > > > > > These nhoff differences are fine.
> > > > > > I want to make sure that ctx->data is the same for all.
> > > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > > TCI/L3.
> > > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > > non-zero (TCI/L3 offset).
> > > > >
> > > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > > value is, they only make sense together.
> > > >
> > > > My strong preference is to have data to point to L2 in all cases.
> > > > Semantics of requiring bpf prog to start processing from a tuple
> > > > (data + nhoff) where both point to random places is very confusing.
> > >
> > > Since flow dissection starts at the network layer, I would then
> > > suggest data always at L3 and nhoff 0.
> For eth_get_headlen we need to manually parse 802.1q header. And for RFS
> case as well (unless I'm missing something).
>
> > > This can be derived in the same manner as __skb_flow_dissect
> > > already does if !data, using only skb_network_offset.
> > >
> > > From a quick scan, skb_mac_offset should also be valid in all cases
> > > where the flow dissector is called today, so the other can be computed, too.
> > >
> > > But this is less obvious. For instance, tun_get_user calls into the flow
> > > dissector up to three times (wow) and IFF_TUN has no link layer
> > > (ARPHRD_NONE). And then there are also fun variable length link layer
> > > protocols to deal with..
> >
> > ahh. ok. Can we guarantee some stable position?
> I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
> post RFS it will point to L3. The only thing we can do is to have
> nhoff=0 (and adjust ctx->data accordingly) when the main bpf
> flow dissector procedure is called. But that would require bringing
> this new kernel context (bpf_flow_dissector) into bpf/stable.
> (And it's not clear what's the benefit, since tail calls would still
> have to look at that offset).

The flow dissector can be called also before and after tunneling, in
which case skb_network_offset points to an inner header. Or after
MPLS, which stumps a flow dissector called earlier as that has no
information about the encapsulated protocol.

I don't think that there should be a goal that flow dissection starts
at the same point in the packet for all callsites along the datapath.
As long as it always starts at a known ETH_P_.. type protocol header
the program should be able to parse that. That is how the non-BPF
flow dissector works.

> > Current bpf_flow_dissect_get_header assumes that
> > ctx->data + ctx->flow_keys->thoff point to IP, right?
> Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
> And it's only for the "main" call; bpf program adjusts this thoff
> to make sure that tail calls preserve some sense of progress (so it
> eventually points to L4 and that's what we export back).
>
> > Based on what Stanislav saying above even that is not a guarantee?
> > I'm struggling to see how users can wrap their heads around this.
> > It seems bpf_flow.c will become the only prog that can deal with
> > this range of possible inputs.
> >
> > I propose to start with the doc that describes all cases, where
> > things point to and how prog suppose to parse that.
> Yeah, that is what I was going to propose - add a doc along with the
> patch series. I don't see how we can make it simple(r) at this point :-(

Does it have to be simpler? A flow dissector should be ready to
dissect VLAN tags. That's the only complication here?



> I can try to document everything so users don't have to read the
> kernel code to understand how to write the bpf flow dissector programs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case
  2019-03-28 12:58                                           ` Willem de Bruijn
@ 2019-04-01 16:30                                             ` Stanislav Fomichev
  0 siblings, 0 replies; 32+ messages in thread
From: Stanislav Fomichev @ 2019-04-01 16:30 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Alexei Starovoitov, Stanislav Fomichev, Network Development, bpf,
	David Miller, Alexei Starovoitov, Daniel Borkmann, Simon Horman,
	Willem de Bruijn, Petar Penkov

On 03/28, Willem de Bruijn wrote:
> > > > > > > > > > If skb_vlan_tag_present(skb) returns true, we set proto to skb->protocol
> > > > > > > > > > and move on.
> > > > > > > > > >
> > > > > > > > > > But, we would need vlan_proto/present/tci in the flow_keys in the future.
> > > > > > > > > > We don't currently return parsed vlan data from the BPF flow dissector.
> > > > > > > > > > But it feels like it's getting into bpf-next territory :-)
> > > > > > > > >
> > > > > > > > > Whether ctx->data points to L2 or L3 is uapi regardless whether
> > > > > > > > > progs/bpf_flow.c is relying on that or not.
> > > > > > > > > So far I think you're saying that in all three cases:
> > > > > > > > > no-skb, skb befor rfs, skb after rfs ctx->data points to L2, right?
> > > > > > > > > This has to be preserved.
> > > > > > > > It points to L3 (or vlan). And this will be preserved, I have no
> > > > > > > > intention to change that.
> > > > > > > >
> > > > > > > > Just to make sure, we are on the same page, here is what
> > > > > > > > __skb_flow_dissect (and BPF prog) is seeing in nhoff.
> > > > > > > >
> > > > > > > > NO-VLAN is always the same for both with-skb/no-skb:
> > > > > > > > +----+----+-----+--+
> > > > > > > > |DMAC|SMAC|PROTO|L3|
> > > > > > > > +----+----+-----+--+
> > > > > > > >                  ^
> > > > > > > >                  +-- nhoff
> > > > > > > >                      proto = PROTO
> > > > > > > >
> > > > > > > > VLAN no-skb (eth_get_headlen):
> > > > > > > > +----+----+----+---+-----+--+
> > > > > > > > |DMAC|SMAC|TPID|TCI|PROTO|L3|
> > > > > > > > +----+----+----+---+-----+--+
> > > > > > > >                 ^
> > > > > > > >                 +-- nhoff
> > > > > > > >                     proto = TPID
> > > > > > >
> > > > > > > where ctx->data will point to ?
> > > > > > > These nhoff differences are fine.
> > > > > > > I want to make sure that ctx->data is the same for all.
> > > > > > For with-skb, nhoff would be zero, and ctx->data would point to
> > > > > > TCI/L3.
> > > > > > For skb-less, ctx->data would point to L2 (DMAC), and nhoff would be
> > > > > > non-zero (TCI/L3 offset).
> > > > > >
> > > > > > If you want, for skb-less case, when calling BPF program we can do the math
> > > > > > ourselves and set ctx->data to data + nhoff, and pass nhoff = 0.
> > > > > > But I'm not sure whether we need to do that; flow dissector is supposed
> > > > > > to look at ctx->data + nhoff, it should not matter what each individual
> > > > > > value is, they only make sense together.
> > > > >
> > > > > My strong preference is to have data to point to L2 in all cases.
> > > > > Semantics of requiring bpf prog to start processing from a tuple
> > > > > (data + nhoff) where both point to random places is very confusing.
> > > >
> > > > Since flow dissection starts at the network layer, I would then
> > > > suggest data always at L3 and nhoff 0.
> > For eth_get_headlen we need to manually parse 802.1q header. And for RFS
> > case as well (unless I'm missing something).
> >
> > > > This can be derived in the same manner as __skb_flow_dissect
> > > > already does if !data, using only skb_network_offset.
> > > >
> > > > From a quick scan, skb_mac_offset should also be valid in all cases
> > > > where the flow dissector is called today, so the other can be computed, too.
> > > >
> > > > But this is less obvious. For instance, tun_get_user calls into the flow
> > > > dissector up to three times (wow) and IFF_TUN has no link layer
> > > > (ARPHRD_NONE). And then there are also fun variable length link layer
> > > > protocols to deal with..
> > >
> > > ahh. ok. Can we guarantee some stable position?
> > I don't think so. Pre RFS ctx->data+nhoff can point to 802.1q header,
> > post RFS it will point to L3. The only thing we can do is to have
> > nhoff=0 (and adjust ctx->data accordingly) when the main bpf
> > flow dissector procedure is called. But that would require bringing
> > this new kernel context (bpf_flow_dissector) into bpf/stable.
> > (And it's not clear what's the benefit, since tail calls would still
> > have to look at that offset).
> 
> The flow dissector can be called also before and after tunneling, in
> which case skb_network_offset points to an inner header. Or after
> MPLS, which stumps a flow dissector called earlier as that has no
> information about the encapsulated protocol.
> 
> I don't think that there should be a goal that flow dissection starts
> at the same point in the packet for all callsites along the datapath.
> As long as it always starts at a known ETH_P_.. type protocol header
> the program should be able to parse that. That is how the non-BPF
> flow dissector works.
> 
> > > Current bpf_flow_dissect_get_header assumes that
> > > ctx->data + ctx->flow_keys->thoff point to IP, right?
> > Yes, mostly, except that if skb->protocol is 802.1q/ad, it's 802.1q header.
> > And it's only for the "main" call; bpf program adjusts this thoff
> > to make sure that tail calls preserve some sense of progress (so it
> > eventually points to L4 and that's what we export back).
> >
> > > Based on what Stanislav saying above even that is not a guarantee?
> > > I'm struggling to see how users can wrap their heads around this.
> > > It seems bpf_flow.c will become the only prog that can deal with
> > > this range of possible inputs.
> > >
> > > I propose to start with the doc that describes all cases, where
> > > things point to and how prog suppose to parse that.
> > Yeah, that is what I was going to propose - add a doc along with the
> > patch series. I don't see how we can make it simple(r) at this point :-(
> 
> Does it have to be simpler? A flow dissector should be ready to
> dissect VLAN tags. That's the only complication here?
I don't see how it can be made simpler. That's the context from which
existing __skb_flow_dissect is called and that's what we have to dissect
from the BPF as well. We can try to make nhoff to be 0 when the
dissector is called, that's probably the only simplification we can
attempt to do (but, as I said previously, it requires bringing
new kernel context to bpf/stable and seems more complicated than
necessary).

Let me prepare a series for bpf/stable with the small doc describing
BPF flow dissector environment. We can continue the discussion
from there :-)

> > I can try to document everything so users don't have to read the
> > kernel code to understand how to write the bpf flow dissector programs.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-04-01 16:30 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22 19:58 [RFC bpf-next v3 0/8] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
2019-03-22 19:58 ` [RFC bpf-next v3 1/8] flow_dissector: allow access only to a subset of __sk_buff fields Stanislav Fomichev
2019-03-22 19:58 ` [RFC bpf-next v3 2/8] flow_dissector: switch kernel context to struct bpf_flow_dissector Stanislav Fomichev
2019-03-22 19:58 ` [RFC bpf-next v3 3/8] flow_dissector: fix clamping of BPF flow_keys for non-zero nhoff Stanislav Fomichev
2019-03-22 19:58 ` [RFC bpf-next v3 4/8] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
2019-03-22 19:59 ` [RFC bpf-next v3 5/8] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
2019-03-22 19:59 ` [RFC bpf-next v3 6/8] flow_dissector: handle no-skb use case Stanislav Fomichev
2019-03-23  1:00   ` Alexei Starovoitov
2019-03-23  1:19     ` Stanislav Fomichev
2019-03-23  1:41       ` Alexei Starovoitov
2019-03-23 16:05         ` Stanislav Fomichev
2019-03-26  0:35           ` Alexei Starovoitov
2019-03-26 16:45             ` Stanislav Fomichev
2019-03-26 17:48               ` Alexei Starovoitov
2019-03-26 17:51                 ` Willem de Bruijn
2019-03-26 18:08                   ` Alexei Starovoitov
2019-03-26 18:17                     ` Stanislav Fomichev
2019-03-26 18:30                       ` Alexei Starovoitov
2019-03-26 18:54                         ` Stanislav Fomichev
2019-03-27  1:41                           ` Alexei Starovoitov
2019-03-27  2:44                             ` Stanislav Fomichev
2019-03-27 17:55                               ` Alexei Starovoitov
2019-03-27 19:58                                 ` Stanislav Fomichev
2019-03-28  1:26                                   ` Alexei Starovoitov
2019-03-28  3:14                                     ` Willem de Bruijn
2019-03-28  3:32                                       ` Alexei Starovoitov
2019-03-28  4:17                                         ` Stanislav Fomichev
2019-03-28 12:58                                           ` Willem de Bruijn
2019-04-01 16:30                                             ` Stanislav Fomichev
2019-03-22 19:59 ` [RFC bpf-next v3 7/8] net: pass net argument to the eth_get_headlen Stanislav Fomichev
2019-03-22 19:59   ` [Intel-wired-lan] " Stanislav Fomichev
2019-03-22 19:59 ` [RFC bpf-next v3 8/8] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test Stanislav Fomichev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.