All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
@ 2019-02-05 17:36 Stanislav Fomichev
  2019-02-05 17:36 ` [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers Stanislav Fomichev
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
skb. Because we use passed skb to lookup associated networking namespace
to find whether we have a BPF program attached or not, we always use
C-based flow dissector in this case.

The goal of this patch series is to add new networking namespace argument
to the eth_get_headlen and make BPF flow dissector programs be able to
work in the skb-less case.

The series goes like this:
1. introduce __init_skb and __init_skb_shinfo; those will be used to
   initialize temporary skb
2. introduce skb_net which can be used to get networking namespace
   associated with an skb
3. add new optional network namespace argument to __skb_flow_dissect and
   plumb through the callers
4. add new __flow_bpf_dissect which constructs temporary on-stack skb
   (using __init_skb) and calls BPF flow dissector program
5. convert flow dissector BPF_PROG_TEST_RUN to skb-less mode to show that
   it works
6. add selftest that makes sure going over the packet bounds in
   bpf_skb_load_bytes with on-stack skb doesn't cause any problems
7. add new net namespace argument go eth_get_headlen and convert the
   callers

Stanislav Fomichev (7):
  net: introduce __init_skb and __init_skb_shinfo helpers
  net: introduce skb_net helper
  net: plumb network namespace into __skb_flow_dissect
  net: flow_dissector: handle no-skb use case
  bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
  selftests/bpf: add flow dissector bpf_skb_load_bytes helper test
  net: flow_dissector: pass net argument to the eth_get_headlen

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |   2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |   3 +-
 .../net/ethernet/hisilicon/hns3/hns3_enet.c   |   3 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   3 +-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c   |   2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   2 +-
 drivers/net/ethernet/intel/igb/igb_main.c     |   2 +-
 drivers/net/ethernet/intel/igc/igc_main.c     |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   3 +-
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |   3 +-
 drivers/net/tun.c                             |   3 +-
 include/linux/etherdevice.h                   |   2 +-
 include/linux/skbuff.h                        |  23 +++-
 net/bpf/test_run.c                            |  52 +++------
 net/core/flow_dissector.c                     | 105 +++++++++++++-----
 net/core/skbuff.c                             |  78 +++++++------
 net/ethernet/eth.c                            |   8 +-
 tools/testing/selftests/bpf/test_progs.c      |  49 ++++++++
 20 files changed, 227 insertions(+), 122 deletions(-)

-- 
2.20.1.611.gfbb209baf1-goog

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:18   ` Willem de Bruijn
  2019-02-05 17:36 ` [RFC bpf-next 2/7] net: introduce skb_net helper Stanislav Fomichev
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

__init_skb is essentially a version of __build_skb which accepts skb as
an argument (instead of doing kmem_cache_alloc to allocate it).

__init_skb_shinfo initializes shinfo.

No functional changes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 68 ++++++++++++++++++++----------------------
 2 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 831846617d07..ad883ab2762c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1001,6 +1001,7 @@ void kfree_skb_partial(struct sk_buff *skb, bool head_stolen);
 bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		      bool *fragstolen, int *delta_truesize);
 
+void __init_skb(struct sk_buff *skb, u8 *data, unsigned int size);
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
 			    int node);
 struct sk_buff *__build_skb(void *data, unsigned int frag_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 26d848484912..23c9cf100bd4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -160,6 +160,34 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
  *
  */
 
+void __init_skb(struct sk_buff *skb, u8 *data, unsigned int size)
+{
+	/* Only clear those fields we need to clear, not those that we will
+	 * actually initialise below. Hence, don't put any more fields after
+	 * the tail pointer in struct sk_buff!
+	 */
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	/* Account for allocated memory : skb + skb->head */
+	skb->truesize = SKB_TRUESIZE(size);
+	refcount_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+}
+
+static inline void __init_skb_shinfo(struct sk_buff *skb)
+{
+	struct skb_shared_info *shinfo;
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+}
+
 /**
  *	__alloc_skb	-	allocate a network buffer
  *	@size: size to allocate
@@ -181,7 +209,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 			    int flags, int node)
 {
 	struct kmem_cache *cache;
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	u8 *data;
 	bool pfmemalloc;
@@ -215,27 +242,9 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	size = SKB_WITH_OVERHEAD(ksize(data));
 	prefetchw(data + size);
 
-	/*
-	 * Only clear those fields we need to clear, not those that we will
-	 * actually initialise below. Hence, don't put any more fields after
-	 * the tail pointer in struct sk_buff!
-	 */
-	memset(skb, 0, offsetof(struct sk_buff, tail));
-	/* Account for allocated memory : skb + skb->head */
-	skb->truesize = SKB_TRUESIZE(size);
+	__init_skb(skb, data, size);
+	__init_skb_shinfo(skb);
 	skb->pfmemalloc = pfmemalloc;
-	refcount_set(&skb->users, 1);
-	skb->head = data;
-	skb->data = data;
-	skb_reset_tail_pointer(skb);
-	skb->end = skb->tail + size;
-	skb->mac_header = (typeof(skb->mac_header))~0U;
-	skb->transport_header = (typeof(skb->transport_header))~0U;
-
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
 
 	if (flags & SKB_ALLOC_FCLONE) {
 		struct sk_buff_fclones *fclones;
@@ -277,7 +286,6 @@ EXPORT_SYMBOL(__alloc_skb);
  */
 struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 {
-	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
@@ -287,20 +295,8 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 
 	size -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->truesize = SKB_TRUESIZE(size);
-	refcount_set(&skb->users, 1);
-	skb->head = data;
-	skb->data = data;
-	skb_reset_tail_pointer(skb);
-	skb->end = skb->tail + size;
-	skb->mac_header = (typeof(skb->mac_header))~0U;
-	skb->transport_header = (typeof(skb->transport_header))~0U;
-
-	/* make sure we initialize shinfo sequentially */
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
+	__init_skb(skb, data, size);
+	__init_skb_shinfo(skb);
 
 	return skb;
 }
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 2/7] net: introduce skb_net helper
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
  2019-02-05 17:36 ` [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 17:36 ` [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

skb_net returns network namespace from the associated device or socket.

This will be used in the next commit.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ad883ab2762c..28723a86efdf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
 	return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
 }
 
+struct net *skb_net(const struct sk_buff *skb);
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 23c9cf100bd4..016db13fa2b6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
 	skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 }
 
+struct net *skb_net(const struct sk_buff *skb)
+{
+	if (skb->dev)
+		return dev_net(skb->dev);
+	else if (skb->sk)
+		return sock_net(skb->sk);
+	return NULL;
+}
+EXPORT_SYMBOL(skb_net);
+
 #ifdef CONFIG_SKB_EXTENSIONS
 static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
 {
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
  2019-02-05 17:36 ` [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers Stanislav Fomichev
  2019-02-05 17:36 ` [RFC bpf-next 2/7] net: introduce skb_net helper Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 17:36 ` [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case Stanislav Fomichev
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

This new argument will be used in the next patches for the
eth_get_headlen use case. eth_get_headlen calls flow dissector
with only data (without skb) so there is currently no way to
pull attached BPF flow dissector program. With this new argument,
we can amend the callers to explicitly pass network namespace
so we can use attached BPF program.

Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h    | 15 +++++++++------
 net/core/flow_dissector.c | 20 +++++++++++---------
 net/ethernet/eth.c        |  5 +++--
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 28723a86efdf..aa9a9983de80 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1227,7 +1227,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    const struct sk_buff *skb,
 			    struct flow_dissector *flow_dissector,
 			    struct bpf_flow_keys *flow_keys);
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(struct net *net,
+			const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
@@ -1237,7 +1238,7 @@ static inline bool skb_flow_dissect(const struct sk_buff *skb,
 				    struct flow_dissector *flow_dissector,
 				    void *target_container, unsigned int flags)
 {
-	return __skb_flow_dissect(skb, flow_dissector, target_container,
+	return __skb_flow_dissect(NULL, skb, flow_dissector, target_container,
 				  NULL, 0, 0, 0, flags);
 }
 
@@ -1246,18 +1247,19 @@ static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
 					      unsigned int flags)
 {
 	memset(flow, 0, sizeof(*flow));
-	return __skb_flow_dissect(skb, &flow_keys_dissector, flow,
+	return __skb_flow_dissect(NULL, skb, &flow_keys_dissector, flow,
 				  NULL, 0, 0, 0, flags);
 }
 
 static inline bool
-skb_flow_dissect_flow_keys_basic(const struct sk_buff *skb,
+skb_flow_dissect_flow_keys_basic(struct net *net,
+				 const struct sk_buff *skb,
 				 struct flow_keys_basic *flow, void *data,
 				 __be16 proto, int nhoff, int hlen,
 				 unsigned int flags)
 {
 	memset(flow, 0, sizeof(*flow));
-	return __skb_flow_dissect(skb, &flow_keys_basic_dissector, flow,
+	return __skb_flow_dissect(net, skb, &flow_keys_basic_dissector, flow,
 				  data, proto, nhoff, hlen, flags);
 }
 
@@ -2438,7 +2440,8 @@ static inline void skb_probe_transport_header(struct sk_buff *skb,
 	if (skb_transport_header_was_set(skb))
 		return;
 
-	if (skb_flow_dissect_flow_keys_basic(skb, &keys, NULL, 0, 0, 0, 0))
+	if (skb_flow_dissect_flow_keys_basic(NULL, skb, &keys,
+					     NULL, 0, 0, 0, 0))
 		skb_set_transport_header(skb, keys.control.thoff);
 	else
 		skb_set_transport_header(skb, offset_hint);
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index bb1a54747d64..dddcc37c0462 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -725,6 +725,7 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 
 /**
  * __skb_flow_dissect - extract the flow_keys struct and return it
+ * @net: associated network namespace, if NULL pulled from skb
  * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
  * @flow_dissector: list of keys to dissect
  * @target_container: target structure to put dissected values into
@@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
  *
  * Caller must take care of zeroing target container memory.
  */
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(struct net *net,
+			const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
 			void *target_container,
 			void *data, __be16 proto, int nhoff, int hlen,
@@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 
 		rcu_read_lock();
 
-		if (skb->dev)
-			attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
-		else if (skb->sk)
-			attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
-		else
-			WARN_ON_ONCE(1);
+		if (!net && skb)
+			net = skb_net(skb);
+		if (net)
+			attached = rcu_dereference(net->flow_dissector_prog);
+		WARN_ON_ONCE(!net);
 
 		if (attached) {
 			ret = __skb_flow_bpf_dissect(attached, skb,
@@ -1406,7 +1407,7 @@ u32 __skb_get_hash_symmetric(const struct sk_buff *skb)
 	__flow_hash_secret_init();
 
 	memset(&keys, 0, sizeof(keys));
-	__skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys,
+	__skb_flow_dissect(NULL, skb, &flow_keys_dissector_symmetric, &keys,
 			   NULL, 0, 0, 0,
 			   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
@@ -1508,7 +1509,8 @@ u32 skb_get_poff(const struct sk_buff *skb)
 {
 	struct flow_keys_basic keys;
 
-	if (!skb_flow_dissect_flow_keys_basic(skb, &keys, NULL, 0, 0, 0, 0))
+	if (!skb_flow_dissect_flow_keys_basic(NULL, skb, &keys,
+					      NULL, 0, 0, 0, 0))
 		return 0;
 
 	return __skb_get_poff(skb, skb->data, &keys, skb_headlen(skb));
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4c520110b04f..155d55025bfc 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -136,8 +136,9 @@ u32 eth_get_headlen(void *data, unsigned int len)
 		return len;
 
 	/* parse any remaining L2/L3 headers, check for L4 */
-	if (!skb_flow_dissect_flow_keys_basic(NULL, &keys, data, eth->h_proto,
-					      sizeof(*eth), len, flags))
+	if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data,
+					      eth->h_proto, sizeof(*eth),
+					      len, flags))
 		return max_t(u32, keys.control.thoff, sizeof(*eth));
 
 	/* parse for any L4 headers */
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2019-02-05 17:36 ` [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 17:36 ` [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

When flow_dissector is called without skb (with only data and hlen),
construct on-stack skb (which has a linear chunk of data passed
to the flow dissector). This should let us handle eth_get_headlen
case where only data is provided and we don't want to (yet) allocate
an skb.

Since this on-stack skb doesn't allocate its own data, we can't
add shinfo and need to be careful to avoid any code paths that use
it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
which doesn't touch shinfo in our case (skb->len is the length of the
linear header so it exits early).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/skbuff.h    |  5 +++
 net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
 2 files changed, 76 insertions(+), 24 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index aa9a9983de80..5f1c085cb34c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    const struct sk_buff *skb,
 			    struct flow_dissector *flow_dissector,
 			    struct bpf_flow_keys *flow_keys);
+bool __flow_bpf_dissect(struct bpf_prog *prog,
+			void *data, __be16 proto,
+			int nhoff, int hlen,
+			struct flow_dissector *flow_dissector,
+			struct bpf_flow_keys *flow_keys);
 bool __skb_flow_dissect(struct net *net,
 			const struct sk_buff *skb,
 			struct flow_dissector *flow_dissector,
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index dddcc37c0462..87167b74f59a 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -683,6 +683,28 @@ static void __skb_flow_bpf_to_target(const struct bpf_flow_keys *flow_keys,
 	}
 }
 
+static inline void init_flow_keys(struct bpf_flow_keys *flow_keys,
+				  struct sk_buff *skb, int nhoff)
+{
+	struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
+
+	memset(cb, 0, sizeof(*cb));
+	memset(flow_keys, 0, sizeof(*flow_keys));
+
+	flow_keys->nhoff = nhoff;
+	flow_keys->thoff = nhoff;
+
+	cb->qdisc_cb.flow_keys = flow_keys;
+}
+
+static inline void clamp_flow_keys(struct bpf_flow_keys *flow_keys,
+				   int hlen)
+{
+	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, 0, hlen);
+	flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
+				   flow_keys->nhoff, hlen);
+}
+
 bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 			    const struct sk_buff *skb,
 			    struct flow_dissector *flow_dissector,
@@ -702,13 +724,9 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 
 	/* Save Control Block */
 	memcpy(&cb_saved, cb, sizeof(cb_saved));
-	memset(cb, 0, sizeof(*cb));
 
 	/* Pass parameters to the BPF program */
-	memset(flow_keys, 0, sizeof(*flow_keys));
-	cb->qdisc_cb.flow_keys = flow_keys;
-	flow_keys->nhoff = skb_network_offset(skb);
-	flow_keys->thoff = flow_keys->nhoff;
+	init_flow_keys(flow_keys, skb, skb_network_offset(skb));
 
 	bpf_compute_data_pointers((struct sk_buff *)skb);
 	result = BPF_PROG_RUN(prog, skb);
@@ -716,9 +734,34 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
 	/* Restore state */
 	memcpy(cb, &cb_saved, sizeof(cb_saved));
 
-	flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, 0, skb->len);
-	flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
-				   flow_keys->nhoff, skb->len);
+	clamp_flow_keys(flow_keys, skb->len);
+
+	return result == BPF_OK;
+}
+
+bool __flow_bpf_dissect(struct bpf_prog *prog,
+			void *data, __be16 proto,
+			int nhoff, int hlen,
+			struct flow_dissector *flow_dissector,
+			struct bpf_flow_keys *flow_keys)
+{
+	struct bpf_skb_data_end *cb;
+	struct sk_buff skb;
+	u32 result;
+
+	__init_skb(&skb, data, hlen);
+	skb_put(&skb, hlen);
+	skb.protocol = proto;
+
+	init_flow_keys(flow_keys, &skb, nhoff);
+
+	cb = (struct bpf_skb_data_end *)skb.cb;
+	cb->data_meta = skb.data;
+	cb->data_end  = skb.data + skb_headlen(&skb);
+
+	result = BPF_PROG_RUN(prog, &skb);
+
+	clamp_flow_keys(flow_keys, hlen);
 
 	return result == BPF_OK;
 }
@@ -754,8 +797,10 @@ bool __skb_flow_dissect(struct net *net,
 	struct flow_dissector_key_icmp *key_icmp;
 	struct flow_dissector_key_tags *key_tags;
 	struct flow_dissector_key_vlan *key_vlan;
-	enum flow_dissect_ret fdret;
 	enum flow_dissector_key_id dissector_vlan = FLOW_DISSECTOR_KEY_MAX;
+	struct bpf_prog *attached = NULL;
+	struct bpf_flow_keys flow_keys;
+	enum flow_dissect_ret fdret;
 	int num_hdrs = 0;
 	u8 ip_proto = 0;
 	bool ret;
@@ -795,30 +840,32 @@ bool __skb_flow_dissect(struct net *net,
 					      FLOW_DISSECTOR_KEY_BASIC,
 					      target_container);
 
-	if (skb) {
-		struct bpf_flow_keys flow_keys;
-		struct bpf_prog *attached = NULL;
+	rcu_read_lock();
 
-		rcu_read_lock();
+	if (!net && skb)
+		net = skb_net(skb);
+	if (net)
+		attached = rcu_dereference(net->flow_dissector_prog);
 
-		if (!net && skb)
-			net = skb_net(skb);
-		if (net)
-			attached = rcu_dereference(net->flow_dissector_prog);
-		WARN_ON_ONCE(!net);
+	WARN_ON_ONCE(!net);
 
-		if (attached) {
+	if (attached) {
+		if (skb)
 			ret = __skb_flow_bpf_dissect(attached, skb,
 						     flow_dissector,
 						     &flow_keys);
-			__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
-						 target_container);
-			rcu_read_unlock();
-			return ret;
-		}
+		else
+			ret = __flow_bpf_dissect(attached, data, proto, nhoff,
+						 hlen, flow_dissector,
+						 &flow_keys);
+		__skb_flow_bpf_to_target(&flow_keys, flow_dissector,
+					 target_container);
 		rcu_read_unlock();
+		return ret;
 	}
 
+	rcu_read_unlock();
+
 	if (dissector_uses_key(flow_dissector,
 			       FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
 		struct ethhdr *eth = eth_hdr(skb);
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2019-02-05 17:36 ` [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 17:36 ` [RFC bpf-next 6/7] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test Stanislav Fomichev
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

Now that we have __flow_bpf_dissect which works on raw data (by
constructing temporary on-stack skb), use it when doing
BPF_PROG_TEST_RUN for flow dissector.

This should help us catch any possible bugs due to missing shinfo on
the on-stack skb.

Note that existing __skb_flow_bpf_dissect swallows L2 headers and returns
nhoff=0, we need to preserve the existing behavior.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 net/bpf/test_run.c | 52 +++++++++++++++-------------------------------
 1 file changed, 17 insertions(+), 35 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2c5172b33209..502ae0e866d3 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -249,10 +249,8 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	u32 repeat = kattr->test.repeat;
 	struct bpf_flow_keys flow_keys;
 	u64 time_start, time_spent = 0;
-	struct bpf_skb_data_end *cb;
+	const struct ethhdr *eth;
 	u32 retval, duration;
-	struct sk_buff *skb;
-	struct sock *sk;
 	void *data;
 	int ret;
 	u32 i;
@@ -260,35 +258,14 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (prog->type != BPF_PROG_TYPE_FLOW_DISSECTOR)
 		return -EINVAL;
 
-	data = bpf_test_init(kattr, size, NET_SKB_PAD + NET_IP_ALIGN,
-			     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
+	if (size < ETH_HLEN)
+		return -EINVAL;
+
+	data = bpf_test_init(kattr, size, 0, 0);
 	if (IS_ERR(data))
 		return PTR_ERR(data);
 
-	sk = kzalloc(sizeof(*sk), GFP_USER);
-	if (!sk) {
-		kfree(data);
-		return -ENOMEM;
-	}
-	sock_net_set(sk, current->nsproxy->net_ns);
-	sock_init_data(NULL, sk);
-
-	skb = build_skb(data, 0);
-	if (!skb) {
-		kfree(data);
-		kfree(sk);
-		return -ENOMEM;
-	}
-	skb->sk = sk;
-
-	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
-	__skb_put(skb, size);
-	skb->protocol = eth_type_trans(skb,
-				       current->nsproxy->net_ns->loopback_dev);
-	skb_reset_network_header(skb);
-
-	cb = (struct bpf_skb_data_end *)skb->cb;
-	cb->qdisc_cb.flow_keys = &flow_keys;
+	eth = (struct ethhdr *)data;
 
 	if (!repeat)
 		repeat = 1;
@@ -297,9 +274,15 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	for (i = 0; i < repeat; i++) {
 		preempt_disable();
 		rcu_read_lock();
-		retval = __skb_flow_bpf_dissect(prog, skb,
-						&flow_keys_dissector,
-						&flow_keys);
+		retval = __flow_bpf_dissect(prog, data,
+					    eth->h_proto, ETH_HLEN,
+					    size,
+					    &flow_keys_dissector,
+					    &flow_keys);
+		if (flow_keys.nhoff >= ETH_HLEN)
+			flow_keys.nhoff -= ETH_HLEN;
+		if (flow_keys.thoff >= ETH_HLEN)
+			flow_keys.thoff -= ETH_HLEN;
 		rcu_read_unlock();
 		preempt_enable();
 
@@ -317,8 +300,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 
 	ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
 			      retval, duration);
-
-	kfree_skb(skb);
-	kfree(sk);
+	kfree(data);
 	return ret;
+
 }
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 6/7] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2019-02-05 17:36 ` [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 17:36 ` [RFC bpf-next 7/7] net: flow_dissector: pass net argument to the eth_get_headlen Stanislav Fomichev
  2019-02-05 20:18 ` [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Willem de Bruijn
  7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

With the on-stack skb, we want to make sure we don't trigger any
shinfo access. Add small test which tries to read the data past
the packet boundary.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_progs.c | 49 ++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index c52bd90fbb34..c12f61efc427 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1995,6 +1995,54 @@ static void test_flow_dissector(void)
 	bpf_object__close(obj);
 }
 
+static void test_flow_dissector_load_bytes(void)
+{
+	struct bpf_flow_keys flow_keys;
+	__u32 duration, retval, size;
+	struct bpf_insn prog[] = {
+		// BPF_REG_1 - 1st argument: context
+		// BPF_REG_2 - 2nd argument: offset, start at last byte + 1
+		BPF_MOV64_IMM(BPF_REG_2, sizeof(pkt_v4)),
+		// BPF_REG_3 - 3rd argument: destination, reserve byte on stack
+		BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -1),
+		// BPF_REG_4 - 4th argument: copy one byte
+		BPF_MOV64_IMM(BPF_REG_4, 1),
+		// bpf_skb_load_bytes(ctx, sizeof(pkt_v4), ptr, 1)
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+			     BPF_FUNC_skb_load_bytes),
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+		// if (ret == 0) return BPF_DROP (2)
+		BPF_MOV64_IMM(BPF_REG_0, BPF_DROP),
+		BPF_EXIT_INSN(),
+		// if (ret != 0) return BPF_OK (0)
+		BPF_MOV64_IMM(BPF_REG_0, BPF_OK),
+		BPF_EXIT_INSN(),
+	};
+	int fd, err;
+
+	/* make sure bpf_skb_load_bytes helper doesn't cause any
+	 * problems when used with the fake skb in the flow
+	 * dissector (try to read past the last byte)
+	 */
+	fd = bpf_load_program(BPF_PROG_TYPE_FLOW_DISSECTOR, prog,
+			      ARRAY_SIZE(prog), "GPL", 0, NULL, 0);
+	CHECK(fd < 0,
+	      "flow_dissector-bpf_skb_load_bytes-load",
+	      "fd %d errno %d\n",
+	      fd, errno);
+
+	err = bpf_prog_test_run(fd, 1, &pkt_v4, sizeof(pkt_v4),
+				&flow_keys, &size, &retval, &duration);
+	CHECK(size != sizeof(flow_keys) || err || retval != 1,
+	      "flow_dissector-bpf_skb_load_bytes",
+	      "err %d errno %d retval %d duration %d size %u/%lu\n",
+	      err, errno, retval, duration, size, sizeof(flow_keys));
+
+	if (fd >= -1)
+		close(fd);
+}
+
 static void *test_spin_lock(void *arg)
 {
 	__u32 duration, retval;
@@ -2136,6 +2184,7 @@ int main(void)
 	test_queue_stack_map(QUEUE);
 	test_queue_stack_map(STACK);
 	test_flow_dissector();
+	test_flow_dissector_load_bytes();
 	test_spinlock();
 	test_map_lock();
 
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC bpf-next 7/7] net: flow_dissector: pass net argument to the eth_get_headlen
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2019-02-05 17:36 ` [RFC bpf-next 6/7] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test Stanislav Fomichev
@ 2019-02-05 17:36 ` Stanislav Fomichev
  2019-02-05 20:18 ` [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Willem de Bruijn
  7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 17:36 UTC (permalink / raw)
  To: netdev; +Cc: davem, ast, daniel, simon.horman, willemb, Stanislav Fomichev

This is an example and for RFC only, just to show what the end result
would look like. In a proper submission, we need to introduce new
helper and convert each driver one by one to make sure we don't
break 'git bisect'.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c     | 3 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c   | 3 ++-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c     | 2 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c       | 3 ++-
 drivers/net/ethernet/intel/iavf/iavf_txrx.c       | 2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c         | 2 +-
 drivers/net/ethernet/intel/igb/igb_main.c         | 2 +-
 drivers/net/ethernet/intel/igc/igc_main.c         | 2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 2 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   | 3 ++-
 drivers/net/tun.c                                 | 3 ++-
 include/linux/etherdevice.h                       | 2 +-
 net/ethernet/eth.c                                | 5 +++--
 15 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6a512871176b..efc3a0455967 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -889,7 +889,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 			     DMA_ATTR_WEAK_ORDERING);
 
 	if (unlikely(!payload))
-		payload = eth_get_headlen(data_ptr, len);
+		payload = eth_get_headlen(bp->dev, data_ptr, len);
 
 	skb = napi_alloc_skb(&rxr->bnapi->napi, payload);
 	if (!skb) {
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 60e7d7ae3787..caf075a7449c 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -603,7 +603,8 @@ static int hns_nic_poll_rx_skb(struct hns_nic_ring_data *ring_data,
 	} else {
 		ring->stats.seg_pkt_cnt++;
 
-		pull_len = eth_get_headlen(va, HNS_RX_HEAD_SIZE);
+		pull_len = eth_get_headlen(dev_net(ndev), va,
+					   HNS_RX_HEAD_SIZE);
 		memcpy(__skb_put(skb, pull_len), va,
 		       ALIGN(pull_len, sizeof(long)));
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 40b17223ee41..81308c7c7d1e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -2433,7 +2433,8 @@ static int hns3_alloc_skb(struct hns3_enet_ring *ring, int length,
 	ring->stats.seg_pkt_cnt++;
 	u64_stats_update_end(&ring->syncp);
 
-	ring->pull_len = eth_get_headlen(va, HNS3_RX_HEAD_SIZE);
+	ring->pull_len = eth_get_headlen(dev_net(netdev), va,
+					 HNS3_RX_HEAD_SIZE);
 	__skb_put(skb, ring->pull_len);
 	hns3_nic_reuse_page(skb, ring->frag_num++, ring, ring->pull_len,
 			    desc_cb);
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 6fd15a734324..b52dfad58b71 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -278,7 +278,7 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer *rx_buffer,
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, FM10K_RX_HDR_LEN);
+	pull_len = eth_get_headlen(skb_net(skb), va, FM10K_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, pull_len), va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index a7e14e98889f..7990f80736e2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2035,7 +2035,8 @@ static struct sk_buff *i40e_construct_skb(struct i40e_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > I40E_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, I40E_RX_HDR_SIZE);
+		headlen = eth_get_headlen(skb_net(skb), xdp->data,
+					  I40E_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
index 9b4d7cec2e18..d3c7fd521fd7 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c
@@ -1315,7 +1315,7 @@ static struct sk_buff *iavf_construct_skb(struct iavf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IAVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(va, IAVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(skb_net(skb), va, IAVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 2357fcac996b..0876ddbc495c 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -708,7 +708,7 @@ static void ice_pull_tail(struct sk_buff *skb)
 	/* we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, ICE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(skb_net(skb), va, ICE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index dfa357b1a9d6..924d91696a01 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -8064,7 +8064,7 @@ static struct sk_buff *igb_construct_skb(struct igb_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGB_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGB_RX_HDR_LEN);
+		headlen = eth_get_headlen(skb_net(skb), va, IGB_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index f20183037fb2..71e14f309189 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1148,7 +1148,7 @@ static struct sk_buff *igc_construct_skb(struct igc_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IGC_RX_HDR_LEN)
-		headlen = eth_get_headlen(va, IGC_RX_HDR_LEN);
+		headlen = eth_get_headlen(skb_net(skb), va, IGC_RX_HDR_LEN);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), va, ALIGN(headlen, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b53087a980ef..9e9e6e5e679a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1799,7 +1799,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
 	 * we need the header to contain the greater of either ETH_HLEN or
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
-	pull_len = eth_get_headlen(va, IXGBE_RX_HDR_SIZE);
+	pull_len = eth_get_headlen(skb_net(skb), va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 49e23afa05a2..c45be7487d57 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -895,7 +895,8 @@ struct sk_buff *ixgbevf_construct_skb(struct ixgbevf_ring *rx_ring,
 	/* Determine available headroom for copy */
 	headlen = size;
 	if (headlen > IXGBEVF_RX_HDR_SIZE)
-		headlen = eth_get_headlen(xdp->data, IXGBEVF_RX_HDR_SIZE);
+		headlen = eth_get_headlen(skb_net(skb), xdp->data,
+					  IXGBEVF_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	memcpy(__skb_put(skb, headlen), xdp->data,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 598ad7e4d5c9..b9fc28787f44 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -167,7 +167,8 @@ static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
 	case MLX5_INLINE_MODE_NONE:
 		return 0;
 	case MLX5_INLINE_MODE_TCP_UDP:
-		hlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		hlen = eth_get_headlen(skb_net(skb), skb->data,
+				       skb_headlen(skb));
 		if (hlen == ETH_HLEN && !skb_vlan_tag_present(skb))
 			hlen += VLAN_HLEN;
 		break;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 18656c4094b3..26063a8daf75 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1961,7 +1961,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (frags) {
 		/* Exercise flow dissector code path. */
-		u32 headlen = eth_get_headlen(skb->data, skb_headlen(skb));
+		u32 headlen = eth_get_headlen(skb_net(skb), skb->data,
+					      skb_headlen(skb));
 
 		if (unlikely(headlen > skb_headlen(skb))) {
 			this_cpu_inc(tun->pcpu_stats->rx_dropped);
diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 2c0af7b00715..2e2b5ed30bfc 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -33,7 +33,7 @@ struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
 unsigned char *arch_get_platform_mac_address(void);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf);
-u32 eth_get_headlen(void *data, unsigned int max_len);
+u32 eth_get_headlen(struct net *net, void *data, unsigned int max_len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;
 
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 155d55025bfc..6fbfd6e41548 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -119,13 +119,14 @@ EXPORT_SYMBOL(eth_header);
 
 /**
  * eth_get_headlen - determine the length of header for an ethernet frame
+ * @net: pointer to device network namespace
  * @data: pointer to start of frame
  * @len: total length of frame
  *
  * Make a best effort attempt to pull the length for all of the headers for
  * a given frame in a linear buffer.
  */
-u32 eth_get_headlen(void *data, unsigned int len)
+u32 eth_get_headlen(struct net *net, void *data, unsigned int len)
 {
 	const unsigned int flags = FLOW_DISSECTOR_F_PARSE_1ST_FRAG;
 	const struct ethhdr *eth = (const struct ethhdr *)data;
@@ -136,7 +137,7 @@ u32 eth_get_headlen(void *data, unsigned int len)
 		return len;
 
 	/* parse any remaining L2/L3 headers, check for L4 */
-	if (!skb_flow_dissect_flow_keys_basic(NULL, NULL, &keys, data,
+	if (!skb_flow_dissect_flow_keys_basic(net, NULL, &keys, data,
 					      eth->h_proto, sizeof(*eth),
 					      len, flags))
 		return max_t(u32, keys.control.thoff, sizeof(*eth));
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2019-02-05 17:36 ` [RFC bpf-next 7/7] net: flow_dissector: pass net argument to the eth_get_headlen Stanislav Fomichev
@ 2019-02-05 20:18 ` Willem de Bruijn
  2019-02-05 20:40   ` Stanislav Fomichev
  7 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> skb. Because we use passed skb to lookup associated networking namespace
> to find whether we have a BPF program attached or not, we always use
> C-based flow dissector in this case.
>
> The goal of this patch series is to add new networking namespace argument
> to the eth_get_headlen and make BPF flow dissector programs be able to
> work in the skb-less case.
>
> The series goes like this:
> 1. introduce __init_skb and __init_skb_shinfo; those will be used to
>    initialize temporary skb
> 2. introduce skb_net which can be used to get networking namespace
>    associated with an skb
> 3. add new optional network namespace argument to __skb_flow_dissect and
>    plumb through the callers
> 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
>    (using __init_skb) and calls BPF flow dissector program

The main concern I see with this series is this cost of skb zeroing
for every packet in the device driver receive routine, *independent*
from the real skb allocation and zeroing which will likely happen
later.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers
  2019-02-05 17:36 ` [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers Stanislav Fomichev
@ 2019-02-05 20:18   ` Willem de Bruijn
  0 siblings, 0 replies; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> __init_skb is essentially a version of __build_skb which accepts skb as
> an argument (instead of doing kmem_cache_alloc to allocate it).
>
> __init_skb_shinfo initializes shinfo.
>
> No functional changes.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

Nice code deduplication :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
  2019-02-05 17:36 ` [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
@ 2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 20:40     ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> This new argument will be used in the next patches for the
> eth_get_headlen use case. eth_get_headlen calls flow dissector
> with only data (without skb) so there is currently no way to
> pull attached BPF flow dissector program. With this new argument,
> we can amend the callers to explicitly pass network namespace
> so we can use attached BPF program.
>
> Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

>  /**
>   * __skb_flow_dissect - extract the flow_keys struct and return it
> + * @net: associated network namespace, if NULL pulled from skb
>   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
>   * @flow_dissector: list of keys to dissect
>   * @target_container: target structure to put dissected values into
> @@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>   *
>   * Caller must take care of zeroing target container memory.
>   */
> -bool __skb_flow_dissect(const struct sk_buff *skb,
> +bool __skb_flow_dissect(struct net *net,
> +                       const struct sk_buff *skb,
>                         struct flow_dissector *flow_dissector,
>                         void *target_container,
>                         void *data, __be16 proto, int nhoff, int hlen,
> @@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
>
>                 rcu_read_lock();
>
> -               if (skb->dev)
> -                       attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
> -               else if (skb->sk)
> -                       attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
> -               else
> -                       WARN_ON_ONCE(1);
> +               if (!net && skb)
> +                       net = skb_net(skb);
> +               if (net)
> +                       attached = rcu_dereference(net->flow_dissector_prog);
> +               WARN_ON_ONCE(!net);

Instead of this just call skb_net(skb) in all callers of
__skb_flow_dissect that are called with an skb argument directly?

It may have to be able to handle skb == NULL args.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 2/7] net: introduce skb_net helper
  2019-02-05 17:36 ` [RFC bpf-next 2/7] net: introduce skb_net helper Stanislav Fomichev
@ 2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 20:42     ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> skb_net returns network namespace from the associated device or socket.
>
> This will be used in the next commit.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h |  2 ++
>  net/core/skbuff.c      | 10 ++++++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ad883ab2762c..28723a86efdf 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
>         return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
>  }
>
> +struct net *skb_net(const struct sk_buff *skb);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_SKBUFF_H */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 23c9cf100bd4..016db13fa2b6 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
>         skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
>  }
>
> +struct net *skb_net(const struct sk_buff *skb)
> +{
> +       if (skb->dev)
> +               return dev_net(skb->dev);
> +       else if (skb->sk)
> +               return sock_net(skb->sk);
> +       return NULL;
> +}
> +EXPORT_SYMBOL(skb_net);

If this needs a helper it is probably better static inline in the header.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
  2019-02-05 17:36 ` [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case Stanislav Fomichev
@ 2019-02-05 20:19   ` Willem de Bruijn
  2019-02-05 20:45     ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> When flow_dissector is called without skb (with only data and hlen),
> construct on-stack skb (which has a linear chunk of data passed
> to the flow dissector). This should let us handle eth_get_headlen
> case where only data is provided and we don't want to (yet) allocate
> an skb.
>
> Since this on-stack skb doesn't allocate its own data, we can't
> add shinfo and need to be careful to avoid any code paths that use
> it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
> which doesn't touch shinfo in our case (skb->len is the length of the
> linear header so it exits early).
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/skbuff.h    |  5 +++
>  net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
>  2 files changed, 76 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index aa9a9983de80..5f1c085cb34c 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
>                             const struct sk_buff *skb,
>                             struct flow_dissector *flow_dissector,
>                             struct bpf_flow_keys *flow_keys);
> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys);

nit: please use more descriptive name. Perhaps bpf_flow_dissect_raw
and rename __skb_flow_bpf_dissect to bpf_flow_dissect_skb.

> +bool __flow_bpf_dissect(struct bpf_prog *prog,
> +                       void *data, __be16 proto,
> +                       int nhoff, int hlen,
> +                       struct flow_dissector *flow_dissector,
> +                       struct bpf_flow_keys *flow_keys)
> +{
> +       struct bpf_skb_data_end *cb;
> +       struct sk_buff skb;
> +       u32 result;
> +
> +       __init_skb(&skb, data, hlen);
> +       skb_put(&skb, hlen);
> +       skb.protocol = proto;
> +
> +       init_flow_keys(flow_keys, &skb, nhoff);
> +
> +       cb = (struct bpf_skb_data_end *)skb.cb;
> +       cb->data_meta = skb.data;
> +       cb->data_end  = skb.data + skb_headlen(&skb);
> +
> +       result = BPF_PROG_RUN(prog, &skb);
> +
> +       clamp_flow_keys(flow_keys, hlen);
>
>         return result == BPF_OK;
>  }

Can__flow_bpf_dissect just construct an skb and then call
__skb_flow_bpf_dissect?

It will unnecessarily save and restore the control block, but that is
a relatively small cost (compared to, say, zeroing the entire skb).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode
  2019-02-05 17:36 ` [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
@ 2019-02-05 20:19   ` Willem de Bruijn
  0 siblings, 0 replies; 30+ messages in thread
From: Willem de Bruijn @ 2019-02-05 20:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Network Development, David Miller, Alexei Starovoitov,
	Daniel Borkmann, simon.horman, Willem de Bruijn

On Tue, Feb 5, 2019 at 12:58 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> Now that we have __flow_bpf_dissect which works on raw data (by
> constructing temporary on-stack skb), use it when doing
> BPF_PROG_TEST_RUN for flow dissector.
>
> This should help us catch any possible bugs due to missing shinfo on
> the on-stack skb.
>
> Note that existing __skb_flow_bpf_dissect swallows L2 headers and returns
> nhoff=0, we need to preserve the existing behavior.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  net/bpf/test_run.c | 52 +++++++++++++++-------------------------------
>  1 file changed, 17 insertions(+), 35 deletions(-)
>

>         ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
>                               retval, duration);
> -
> -       kfree_skb(skb);
> -       kfree(sk);
> +       kfree(data);
>         return ret;
> +

nit: unnecessary whitespace line

>  }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-05 20:18 ` [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Willem de Bruijn
@ 2019-02-05 20:40   ` Stanislav Fomichev
  2019-02-06  0:47     ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 20:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > skb. Because we use passed skb to lookup associated networking namespace
> > to find whether we have a BPF program attached or not, we always use
> > C-based flow dissector in this case.
> >
> > The goal of this patch series is to add new networking namespace argument
> > to the eth_get_headlen and make BPF flow dissector programs be able to
> > work in the skb-less case.
> >
> > The series goes like this:
> > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> >    initialize temporary skb
> > 2. introduce skb_net which can be used to get networking namespace
> >    associated with an skb
> > 3. add new optional network namespace argument to __skb_flow_dissect and
> >    plumb through the callers
> > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> >    (using __init_skb) and calls BPF flow dissector program
> 
> The main concern I see with this series is this cost of skb zeroing
> for every packet in the device driver receive routine, *independent*
> from the real skb allocation and zeroing which will likely happen
> later.
Yes, plus ~200 bytes on the stack for the callers.

Not sure how visible this zeroing though, I can probably try to get some
numbers from BPF_PROG_TEST_RUN (running current version vs running with
on-stack skb).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect
  2019-02-05 20:19   ` Willem de Bruijn
@ 2019-02-05 20:40     ` Stanislav Fomichev
  0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 20:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > This new argument will be used in the next patches for the
> > eth_get_headlen use case. eth_get_headlen calls flow dissector
> > with only data (without skb) so there is currently no way to
> > pull attached BPF flow dissector program. With this new argument,
> > we can amend the callers to explicitly pass network namespace
> > so we can use attached BPF program.
> >
> > Note: WARN_ON_ONCE(!net) will now trigger for eth_get_headlen users.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> 
> >  /**
> >   * __skb_flow_dissect - extract the flow_keys struct and return it
> > + * @net: associated network namespace, if NULL pulled from skb
> >   * @skb: sk_buff to extract the flow from, can be NULL if the rest are specified
> >   * @flow_dissector: list of keys to dissect
> >   * @target_container: target structure to put dissected values into
> > @@ -739,7 +740,8 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
> >   *
> >   * Caller must take care of zeroing target container memory.
> >   */
> > -bool __skb_flow_dissect(const struct sk_buff *skb,
> > +bool __skb_flow_dissect(struct net *net,
> > +                       const struct sk_buff *skb,
> >                         struct flow_dissector *flow_dissector,
> >                         void *target_container,
> >                         void *data, __be16 proto, int nhoff, int hlen,
> > @@ -799,12 +801,11 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >
> >                 rcu_read_lock();
> >
> > -               if (skb->dev)
> > -                       attached = rcu_dereference(dev_net(skb->dev)->flow_dissector_prog);
> > -               else if (skb->sk)
> > -                       attached = rcu_dereference(sock_net(skb->sk)->flow_dissector_prog);
> > -               else
> > -                       WARN_ON_ONCE(1);
> > +               if (!net && skb)
> > +                       net = skb_net(skb);
> > +               if (net)
> > +                       attached = rcu_dereference(net->flow_dissector_prog);
> > +               WARN_ON_ONCE(!net);
> 
> Instead of this just call skb_net(skb) in all callers of
> __skb_flow_dissect that are called with an skb argument directly?
> 
> It may have to be able to handle skb == NULL args.
Ack, will look into it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 2/7] net: introduce skb_net helper
  2019-02-05 20:19   ` Willem de Bruijn
@ 2019-02-05 20:42     ` Stanislav Fomichev
  0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 20:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > skb_net returns network namespace from the associated device or socket.
> >
> > This will be used in the next commit.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/skbuff.h |  2 ++
> >  net/core/skbuff.c      | 10 ++++++++++
> >  2 files changed, 12 insertions(+)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index ad883ab2762c..28723a86efdf 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4343,5 +4343,7 @@ static inline __wsum lco_csum(struct sk_buff *skb)
> >         return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
> >  }
> >
> > +struct net *skb_net(const struct sk_buff *skb);
> > +
> >  #endif /* __KERNEL__ */
> >  #endif /* _LINUX_SKBUFF_H */
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 23c9cf100bd4..016db13fa2b6 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -5585,6 +5585,16 @@ void skb_condense(struct sk_buff *skb)
> >         skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
> >  }
> >
> > +struct net *skb_net(const struct sk_buff *skb)
> > +{
> > +       if (skb->dev)
> > +               return dev_net(skb->dev);
> > +       else if (skb->sk)
> > +               return sock_net(skb->sk);
> > +       return NULL;
> > +}
> > +EXPORT_SYMBOL(skb_net);
> 
> If this needs a helper it is probably better static inline in the header.
I did that initially in skbuff.h as inline, but it is missing some
struct definitions, decided to go with this for RFC. Will look into
inlining.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case
  2019-02-05 20:19   ` Willem de Bruijn
@ 2019-02-05 20:45     ` Stanislav Fomichev
  0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-05 20:45 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, Network Development, David Miller,
	Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Willem de Bruijn wrote:
> On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > When flow_dissector is called without skb (with only data and hlen),
> > construct on-stack skb (which has a linear chunk of data passed
> > to the flow dissector). This should let us handle eth_get_headlen
> > case where only data is provided and we don't want to (yet) allocate
> > an skb.
> >
> > Since this on-stack skb doesn't allocate its own data, we can't
> > add shinfo and need to be careful to avoid any code paths that use
> > it. Flow dissector BPF programs can only call bpf_skb_load_bytes helper,
> > which doesn't touch shinfo in our case (skb->len is the length of the
> > linear header so it exits early).
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/skbuff.h    |  5 +++
> >  net/core/flow_dissector.c | 95 +++++++++++++++++++++++++++++----------
> >  2 files changed, 76 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index aa9a9983de80..5f1c085cb34c 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1227,6 +1227,11 @@ bool __skb_flow_bpf_dissect(struct bpf_prog *prog,
> >                             const struct sk_buff *skb,
> >                             struct flow_dissector *flow_dissector,
> >                             struct bpf_flow_keys *flow_keys);
> > +bool __flow_bpf_dissect(struct bpf_prog *prog,
> > +                       void *data, __be16 proto,
> > +                       int nhoff, int hlen,
> > +                       struct flow_dissector *flow_dissector,
> > +                       struct bpf_flow_keys *flow_keys);
> 
> nit: please use more descriptive name. Perhaps bpf_flow_dissect_raw
> and rename __skb_flow_bpf_dissect to bpf_flow_dissect_skb.
Agreed.

> > +bool __flow_bpf_dissect(struct bpf_prog *prog,
> > +                       void *data, __be16 proto,
> > +                       int nhoff, int hlen,
> > +                       struct flow_dissector *flow_dissector,
> > +                       struct bpf_flow_keys *flow_keys)
> > +{
> > +       struct bpf_skb_data_end *cb;
> > +       struct sk_buff skb;
> > +       u32 result;
> > +
> > +       __init_skb(&skb, data, hlen);
> > +       skb_put(&skb, hlen);
> > +       skb.protocol = proto;
> > +
> > +       init_flow_keys(flow_keys, &skb, nhoff);
> > +
> > +       cb = (struct bpf_skb_data_end *)skb.cb;
> > +       cb->data_meta = skb.data;
> > +       cb->data_end  = skb.data + skb_headlen(&skb);
> > +
> > +       result = BPF_PROG_RUN(prog, &skb);
> > +
> > +       clamp_flow_keys(flow_keys, hlen);
> >
> >         return result == BPF_OK;
> >  }
> 
> Can__flow_bpf_dissect just construct an skb and then call
> __skb_flow_bpf_dissect?
__skb_flow_bpf_dissect calls bpf_compute_data_pointers which calls
skb_metadata_len which touches shinfo. And I don't think I have a
clever way to handle that.

> 
> It will unnecessarily save and restore the control block, but that is
> a relatively small cost (compared to, say, zeroing the entire skb).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-05 20:40   ` Stanislav Fomichev
@ 2019-02-06  0:47     ` Alexei Starovoitov
  2019-02-06  0:59       ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2019-02-06  0:47 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> On 02/05, Willem de Bruijn wrote:
> > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > skb. Because we use passed skb to lookup associated networking namespace
> > > to find whether we have a BPF program attached or not, we always use
> > > C-based flow dissector in this case.
> > >
> > > The goal of this patch series is to add new networking namespace argument
> > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > work in the skb-less case.
> > >
> > > The series goes like this:
> > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > >    initialize temporary skb
> > > 2. introduce skb_net which can be used to get networking namespace
> > >    associated with an skb
> > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > >    plumb through the callers
> > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > >    (using __init_skb) and calls BPF flow dissector program
> > 
> > The main concern I see with this series is this cost of skb zeroing
> > for every packet in the device driver receive routine, *independent*
> > from the real skb allocation and zeroing which will likely happen
> > later.
> Yes, plus ~200 bytes on the stack for the callers.
> 
> Not sure how visible this zeroing though, I can probably try to get some
> numbers from BPF_PROG_TEST_RUN (running current version vs running with
> on-stack skb).

imo extra 256 byte memset for every packet is non starter.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  0:47     ` Alexei Starovoitov
@ 2019-02-06  0:59       ` Stanislav Fomichev
  2019-02-06  3:12         ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-06  0:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Alexei Starovoitov wrote:
> On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > On 02/05, Willem de Bruijn wrote:
> > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > to find whether we have a BPF program attached or not, we always use
> > > > C-based flow dissector in this case.
> > > >
> > > > The goal of this patch series is to add new networking namespace argument
> > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > work in the skb-less case.
> > > >
> > > > The series goes like this:
> > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > >    initialize temporary skb
> > > > 2. introduce skb_net which can be used to get networking namespace
> > > >    associated with an skb
> > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > >    plumb through the callers
> > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > >    (using __init_skb) and calls BPF flow dissector program
> > > 
> > > The main concern I see with this series is this cost of skb zeroing
> > > for every packet in the device driver receive routine, *independent*
> > > from the real skb allocation and zeroing which will likely happen
> > > later.
> > Yes, plus ~200 bytes on the stack for the callers.
> > 
> > Not sure how visible this zeroing though, I can probably try to get some
> > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > on-stack skb).
> 
> imo extra 256 byte memset for every packet is non starter.
We can put pre-allocated/initialized skbs without data into percpu or even
use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
about having multiple percpu for irq/softirq/process contexts.
Any concerns with that approach?
Any other possible concerns with the overall series?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  0:59       ` Stanislav Fomichev
@ 2019-02-06  3:12         ` Alexei Starovoitov
  2019-02-06  3:56           ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2019-02-06  3:12 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> On 02/05, Alexei Starovoitov wrote:
> > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > On 02/05, Willem de Bruijn wrote:
> > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > to find whether we have a BPF program attached or not, we always use
> > > > > C-based flow dissector in this case.
> > > > >
> > > > > The goal of this patch series is to add new networking namespace argument
> > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > work in the skb-less case.
> > > > >
> > > > > The series goes like this:
> > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > >    initialize temporary skb
> > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > >    associated with an skb
> > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > >    plumb through the callers
> > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > 
> > > > The main concern I see with this series is this cost of skb zeroing
> > > > for every packet in the device driver receive routine, *independent*
> > > > from the real skb allocation and zeroing which will likely happen
> > > > later.
> > > Yes, plus ~200 bytes on the stack for the callers.
> > > 
> > > Not sure how visible this zeroing though, I can probably try to get some
> > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > on-stack skb).
> > 
> > imo extra 256 byte memset for every packet is non starter.
> We can put pre-allocated/initialized skbs without data into percpu or even
> use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> about having multiple percpu for irq/softirq/process contexts.
> Any concerns with that approach?
> Any other possible concerns with the overall series?

I'm missing why the whole thing is needed.
You're saying:
" make BPF flow dissector programs be able to work in the skb-less case".
What does it mean specifically?
The only non-skb case is XDP.
Are you saying you want flow_dissector prog to be run in XDP?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  3:12         ` Alexei Starovoitov
@ 2019-02-06  3:56           ` Stanislav Fomichev
  2019-02-06  4:11             ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-06  3:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Alexei Starovoitov wrote:
> On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > On 02/05, Alexei Starovoitov wrote:
> > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > On 02/05, Willem de Bruijn wrote:
> > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > >
> > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > C-based flow dissector in this case.
> > > > > >
> > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > work in the skb-less case.
> > > > > >
> > > > > > The series goes like this:
> > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > >    initialize temporary skb
> > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > >    associated with an skb
> > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > >    plumb through the callers
> > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > 
> > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > for every packet in the device driver receive routine, *independent*
> > > > > from the real skb allocation and zeroing which will likely happen
> > > > > later.
> > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > 
> > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > on-stack skb).
> > > 
> > > imo extra 256 byte memset for every packet is non starter.
> > We can put pre-allocated/initialized skbs without data into percpu or even
> > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > about having multiple percpu for irq/softirq/process contexts.
> > Any concerns with that approach?
> > Any other possible concerns with the overall series?
> 
> I'm missing why the whole thing is needed.
> You're saying:
> " make BPF flow dissector programs be able to work in the skb-less case".
> What does it mean specifically?
> The only non-skb case is XDP.
> Are you saying you want flow_dissector prog to be run in XDP?
eth_get_headlen that drivers call on RX path on a chunk of data to
guesstimate the length of the headers calls flow dissector without an skb
(__skb_flow_dissect was a weird interface where it accepts skb or
data+len). Right now, there is no way to trigger BPF flow dissector
for this case (we don't have an skb to get associated namespace/etc/etc).
The patch series tries to fix that to make sure that we always trigger
BPF program if it's attached to a device's namespace.

Context:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ba38943ba190eb6a494262003e23187d1b40fb4

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  3:56           ` Stanislav Fomichev
@ 2019-02-06  4:11             ` Alexei Starovoitov
  2019-02-06  5:49               ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2019-02-06  4:11 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> On 02/05, Alexei Starovoitov wrote:
> > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > On 02/05, Alexei Starovoitov wrote:
> > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > >
> > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > C-based flow dissector in this case.
> > > > > > >
> > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > work in the skb-less case.
> > > > > > >
> > > > > > > The series goes like this:
> > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > >    initialize temporary skb
> > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > >    associated with an skb
> > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > >    plumb through the callers
> > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > 
> > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > later.
> > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > 
> > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > on-stack skb).
> > > > 
> > > > imo extra 256 byte memset for every packet is non starter.
> > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > about having multiple percpu for irq/softirq/process contexts.
> > > Any concerns with that approach?
> > > Any other possible concerns with the overall series?
> > 
> > I'm missing why the whole thing is needed.
> > You're saying:
> > " make BPF flow dissector programs be able to work in the skb-less case".
> > What does it mean specifically?
> > The only non-skb case is XDP.
> > Are you saying you want flow_dissector prog to be run in XDP?
> eth_get_headlen that drivers call on RX path on a chunk of data to
> guesstimate the length of the headers calls flow dissector without an skb
> (__skb_flow_dissect was a weird interface where it accepts skb or
> data+len). Right now, there is no way to trigger BPF flow dissector
> for this case (we don't have an skb to get associated namespace/etc/etc).
> The patch series tries to fix that to make sure that we always trigger
> BPF program if it's attached to a device's namespace.

then why not to create flow_dissector prog type that works without skb?
Why do you need to fake an skb?
XDP progs work just fine without it.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  4:11             ` Alexei Starovoitov
@ 2019-02-06  5:49               ` Stanislav Fomichev
  2019-02-12 17:02                 ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-06  5:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Alexei Starovoitov wrote:
> On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> > On 02/05, Alexei Starovoitov wrote:
> > > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > > On 02/05, Alexei Starovoitov wrote:
> > > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > > >
> > > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > > C-based flow dissector in this case.
> > > > > > > >
> > > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > > work in the skb-less case.
> > > > > > > >
> > > > > > > > The series goes like this:
> > > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > > >    initialize temporary skb
> > > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > > >    associated with an skb
> > > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > > >    plumb through the callers
> > > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > > 
> > > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > > later.
> > > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > > 
> > > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > > on-stack skb).
> > > > > 
> > > > > imo extra 256 byte memset for every packet is non starter.
> > > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > > about having multiple percpu for irq/softirq/process contexts.
> > > > Any concerns with that approach?
> > > > Any other possible concerns with the overall series?
> > > 
> > > I'm missing why the whole thing is needed.
> > > You're saying:
> > > " make BPF flow dissector programs be able to work in the skb-less case".
> > > What does it mean specifically?
> > > The only non-skb case is XDP.
> > > Are you saying you want flow_dissector prog to be run in XDP?
> > eth_get_headlen that drivers call on RX path on a chunk of data to
> > guesstimate the length of the headers calls flow dissector without an skb
> > (__skb_flow_dissect was a weird interface where it accepts skb or
> > data+len). Right now, there is no way to trigger BPF flow dissector
> > for this case (we don't have an skb to get associated namespace/etc/etc).
> > The patch series tries to fix that to make sure that we always trigger
> > BPF program if it's attached to a device's namespace.
> 
> then why not to create flow_dissector prog type that works without skb?
> Why do you need to fake an skb?
> XDP progs work just fine without it.
What's the advantage of having another prog type? In this case we would have
to write the same flow dissector program twice: first time against __skb_buff
interface, second time against xdp_md.
By using fake skb, we make the same flow dissector __sk_buff BPF program
work in both contexts without a rewrite to an xdp interface (I don't
think users should care whether flow dissector was called form "xdp" vs skb
context; and we're sort of stuck with __sk_buff interface already).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-06  5:49               ` Stanislav Fomichev
@ 2019-02-12 17:02                 ` Stanislav Fomichev
  2019-02-14  4:39                   ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-12 17:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/05, Stanislav Fomichev wrote:
> On 02/05, Alexei Starovoitov wrote:
> > On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> > > On 02/05, Alexei Starovoitov wrote:
> > > > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > > > >
> > > > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > > > C-based flow dissector in this case.
> > > > > > > > >
> > > > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > > > work in the skb-less case.
> > > > > > > > >
> > > > > > > > > The series goes like this:
> > > > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > > > >    initialize temporary skb
> > > > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > > > >    associated with an skb
> > > > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > > > >    plumb through the callers
> > > > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > > > 
> > > > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > > > later.
> > > > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > > > 
> > > > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > > > on-stack skb).
> > > > > > 
> > > > > > imo extra 256 byte memset for every packet is non starter.
> > > > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > > > about having multiple percpu for irq/softirq/process contexts.
> > > > > Any concerns with that approach?
> > > > > Any other possible concerns with the overall series?
> > > > 
> > > > I'm missing why the whole thing is needed.
> > > > You're saying:
> > > > " make BPF flow dissector programs be able to work in the skb-less case".
> > > > What does it mean specifically?
> > > > The only non-skb case is XDP.
> > > > Are you saying you want flow_dissector prog to be run in XDP?
> > > eth_get_headlen that drivers call on RX path on a chunk of data to
> > > guesstimate the length of the headers calls flow dissector without an skb
> > > (__skb_flow_dissect was a weird interface where it accepts skb or
> > > data+len). Right now, there is no way to trigger BPF flow dissector
> > > for this case (we don't have an skb to get associated namespace/etc/etc).
> > > The patch series tries to fix that to make sure that we always trigger
> > > BPF program if it's attached to a device's namespace.
> > 
> > then why not to create flow_dissector prog type that works without skb?
> > Why do you need to fake an skb?
> > XDP progs work just fine without it.
> What's the advantage of having another prog type? In this case we would have
> to write the same flow dissector program twice: first time against __skb_buff
> interface, second time against xdp_md.
> By using fake skb, we make the same flow dissector __sk_buff BPF program
> work in both contexts without a rewrite to an xdp interface (I don't
> think users should care whether flow dissector was called form "xdp" vs skb
> context; and we're sort of stuck with __sk_buff interface already).
Should I follow up with v2 where I address memset(,,256) for each packet?
Or you still have some questions/doubts/suggestions regarding the problem
I'm trying to solve?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-12 17:02                 ` Stanislav Fomichev
@ 2019-02-14  4:39                   ` Alexei Starovoitov
  2019-02-14  5:57                     ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2019-02-14  4:39 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Tue, Feb 12, 2019 at 09:02:32AM -0800, Stanislav Fomichev wrote:
> On 02/05, Stanislav Fomichev wrote:
> > On 02/05, Alexei Starovoitov wrote:
> > > On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> > > > On 02/05, Alexei Starovoitov wrote:
> > > > > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > > > > C-based flow dissector in this case.
> > > > > > > > > >
> > > > > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > > > > work in the skb-less case.
> > > > > > > > > >
> > > > > > > > > > The series goes like this:
> > > > > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > > > > >    initialize temporary skb
> > > > > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > > > > >    associated with an skb
> > > > > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > > > > >    plumb through the callers
> > > > > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > > > > 
> > > > > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > > > > later.
> > > > > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > > > > 
> > > > > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > > > > on-stack skb).
> > > > > > > 
> > > > > > > imo extra 256 byte memset for every packet is non starter.
> > > > > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > > > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > > > > about having multiple percpu for irq/softirq/process contexts.
> > > > > > Any concerns with that approach?
> > > > > > Any other possible concerns with the overall series?
> > > > > 
> > > > > I'm missing why the whole thing is needed.
> > > > > You're saying:
> > > > > " make BPF flow dissector programs be able to work in the skb-less case".
> > > > > What does it mean specifically?
> > > > > The only non-skb case is XDP.
> > > > > Are you saying you want flow_dissector prog to be run in XDP?
> > > > eth_get_headlen that drivers call on RX path on a chunk of data to
> > > > guesstimate the length of the headers calls flow dissector without an skb
> > > > (__skb_flow_dissect was a weird interface where it accepts skb or
> > > > data+len). Right now, there is no way to trigger BPF flow dissector
> > > > for this case (we don't have an skb to get associated namespace/etc/etc).
> > > > The patch series tries to fix that to make sure that we always trigger
> > > > BPF program if it's attached to a device's namespace.
> > > 
> > > then why not to create flow_dissector prog type that works without skb?
> > > Why do you need to fake an skb?
> > > XDP progs work just fine without it.
> > What's the advantage of having another prog type? In this case we would have
> > to write the same flow dissector program twice: first time against __skb_buff
> > interface, second time against xdp_md.
> > By using fake skb, we make the same flow dissector __sk_buff BPF program
> > work in both contexts without a rewrite to an xdp interface (I don't
> > think users should care whether flow dissector was called form "xdp" vs skb
> > context; and we're sort of stuck with __sk_buff interface already).
> Should I follow up with v2 where I address memset(,,256) for each packet?
> Or you still have some questions/doubts/suggestions regarding the problem
> I'm trying to solve?

sorry for delay. I'm still thinking what is the path forward here.

That 'stuck with __sk_buff' is what bothers me.
It's an indication that api wasn't thought through if first thing
it needs is this fake skb hack.
If bpf_flow.c is a realistic example of such flow dissector prog
it means that real skb fields are accessed.
In particular skb->vlan_proto, skb->protocol.
These fields in case of 'fake skb' will not be set, since eth_type_trans()
isn't called yet.
So either flow_dissector needs a real __sk_buff and all of its fields
should be real or it's a different flow_dissector prog type that
needs ctx->data, ctx->data_end, ctx->flow_keys only.
Either way going with fake skb is incorrect, since bpf_flow.c example
will be broken and for program writers it will be hard to figure why
it's broken.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-14  4:39                   ` Alexei Starovoitov
@ 2019-02-14  5:57                     ` Stanislav Fomichev
  2019-02-14  6:38                       ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-14  5:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/13, Alexei Starovoitov wrote:
> On Tue, Feb 12, 2019 at 09:02:32AM -0800, Stanislav Fomichev wrote:
> > On 02/05, Stanislav Fomichev wrote:
> > > On 02/05, Alexei Starovoitov wrote:
> > > > On Tue, Feb 05, 2019 at 07:56:19PM -0800, Stanislav Fomichev wrote:
> > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > On Tue, Feb 05, 2019 at 04:59:31PM -0800, Stanislav Fomichev wrote:
> > > > > > > On 02/05, Alexei Starovoitov wrote:
> > > > > > > > On Tue, Feb 05, 2019 at 12:40:03PM -0800, Stanislav Fomichev wrote:
> > > > > > > > > On 02/05, Willem de Bruijn wrote:
> > > > > > > > > > On Tue, Feb 5, 2019 at 12:57 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Currently, when eth_get_headlen calls flow dissector, it doesn't pass any
> > > > > > > > > > > skb. Because we use passed skb to lookup associated networking namespace
> > > > > > > > > > > to find whether we have a BPF program attached or not, we always use
> > > > > > > > > > > C-based flow dissector in this case.
> > > > > > > > > > >
> > > > > > > > > > > The goal of this patch series is to add new networking namespace argument
> > > > > > > > > > > to the eth_get_headlen and make BPF flow dissector programs be able to
> > > > > > > > > > > work in the skb-less case.
> > > > > > > > > > >
> > > > > > > > > > > The series goes like this:
> > > > > > > > > > > 1. introduce __init_skb and __init_skb_shinfo; those will be used to
> > > > > > > > > > >    initialize temporary skb
> > > > > > > > > > > 2. introduce skb_net which can be used to get networking namespace
> > > > > > > > > > >    associated with an skb
> > > > > > > > > > > 3. add new optional network namespace argument to __skb_flow_dissect and
> > > > > > > > > > >    plumb through the callers
> > > > > > > > > > > 4. add new __flow_bpf_dissect which constructs temporary on-stack skb
> > > > > > > > > > >    (using __init_skb) and calls BPF flow dissector program
> > > > > > > > > > 
> > > > > > > > > > The main concern I see with this series is this cost of skb zeroing
> > > > > > > > > > for every packet in the device driver receive routine, *independent*
> > > > > > > > > > from the real skb allocation and zeroing which will likely happen
> > > > > > > > > > later.
> > > > > > > > > Yes, plus ~200 bytes on the stack for the callers.
> > > > > > > > > 
> > > > > > > > > Not sure how visible this zeroing though, I can probably try to get some
> > > > > > > > > numbers from BPF_PROG_TEST_RUN (running current version vs running with
> > > > > > > > > on-stack skb).
> > > > > > > > 
> > > > > > > > imo extra 256 byte memset for every packet is non starter.
> > > > > > > We can put pre-allocated/initialized skbs without data into percpu or even
> > > > > > > use pcpu_freelist_pop/pcpu_freelist_push to make sure we don't have to think
> > > > > > > about having multiple percpu for irq/softirq/process contexts.
> > > > > > > Any concerns with that approach?
> > > > > > > Any other possible concerns with the overall series?
> > > > > > 
> > > > > > I'm missing why the whole thing is needed.
> > > > > > You're saying:
> > > > > > " make BPF flow dissector programs be able to work in the skb-less case".
> > > > > > What does it mean specifically?
> > > > > > The only non-skb case is XDP.
> > > > > > Are you saying you want flow_dissector prog to be run in XDP?
> > > > > eth_get_headlen that drivers call on RX path on a chunk of data to
> > > > > guesstimate the length of the headers calls flow dissector without an skb
> > > > > (__skb_flow_dissect was a weird interface where it accepts skb or
> > > > > data+len). Right now, there is no way to trigger BPF flow dissector
> > > > > for this case (we don't have an skb to get associated namespace/etc/etc).
> > > > > The patch series tries to fix that to make sure that we always trigger
> > > > > BPF program if it's attached to a device's namespace.
> > > > 
> > > > then why not to create flow_dissector prog type that works without skb?
> > > > Why do you need to fake an skb?
> > > > XDP progs work just fine without it.
> > > What's the advantage of having another prog type? In this case we would have
> > > to write the same flow dissector program twice: first time against __skb_buff
> > > interface, second time against xdp_md.
> > > By using fake skb, we make the same flow dissector __sk_buff BPF program
> > > work in both contexts without a rewrite to an xdp interface (I don't
> > > think users should care whether flow dissector was called form "xdp" vs skb
> > > context; and we're sort of stuck with __sk_buff interface already).
> > Should I follow up with v2 where I address memset(,,256) for each packet?
> > Or you still have some questions/doubts/suggestions regarding the problem
> > I'm trying to solve?
> 
> sorry for delay. I'm still thinking what is the path forward here.
No worries, thanks for sticking with me :-)

> That 'stuck with __sk_buff' is what bothers me.
I might have use the wrong word here. I don't think there is another
option to be honest. Using __sk_buff makes flow dissector programs work
with fragmented packets; if we were to use xdp_meta instead, it would
not work in this case. Another point here: the fact that
eth_get_headlen calls flow dissector on a chunk of data instead of skb
feel like an implementation detail. Imo, application writers should not
care about this context; coding against __sk_buff feels like the best we
can do.

> It's an indication that api wasn't thought through if first thing
> it needs is this fake skb hack.
> If bpf_flow.c is a realistic example of such flow dissector prog
> it means that real skb fields are accessed.
> In particular skb->vlan_proto, skb->protocol.
I do manually set skb->protocol to eth->h_proto in my proposal. This is later
correctly handled by bpf_flow.c: parse_eth_proto() is called on skb->protocol
and we correctly handle bpf_htons(ETH_P_8021Q) there. So existing
bpf_flow.c works as expected.

Related: I was also thinking about moving this check out of bpf_flow.c
and pass n_proto directly. I don't see why bpf_flow_keys "export"
n_proto, we know it when we call flow dissector, there is no need to
test for skb->vlan_present in the bpf program.

> These fields in case of 'fake skb' will not be set, since eth_type_trans()
> isn't called yet.
Just to reiterate, I do set skb->protocol manually and
skb->vlan_preset == false (and bpf_flow.c handles this case).
We can also set correct vlan_preset with an additional check, but I
decided to not do it since bpf_flow.c handles that.

> So either flow_dissector needs a real __sk_buff and all of its fields
> should be real or it's a different flow_dissector prog type that
> needs ctx->data, ctx->data_end, ctx->flow_keys only.
> Either way going with fake skb is incorrect, since bpf_flow.c example
> will be broken and for program writers it will be hard to figure why
> it's broken.
It's fake in a sense that it's stack allocated in this particular case.
To the bpf_flow.c it looks like a real __sk_buff. I don't see why
bpf_flow.c example might be broken in this case, can you elaborate?

The goal of this patch series was to essentially make this skb/no-skb
context transparent to the bpf_flow.c (i.e. no changes from the user
flow programs). Adding another flow dissector for eth_get_headlen case
also seems as a no go.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-14  5:57                     ` Stanislav Fomichev
@ 2019-02-14  6:38                       ` Alexei Starovoitov
  2019-02-14 17:35                         ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2019-02-14  6:38 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On Wed, Feb 13, 2019 at 09:57:25PM -0800, Stanislav Fomichev wrote:
> 
> > That 'stuck with __sk_buff' is what bothers me.
> I might have use the wrong word here. I don't think there is another
> option to be honest. Using __sk_buff makes flow dissector programs work
> with fragmented packets;

good point. indeed real skb is essential.

> > It's an indication that api wasn't thought through if first thing
> > it needs is this fake skb hack.
> > If bpf_flow.c is a realistic example of such flow dissector prog
> > it means that real skb fields are accessed.
> > In particular skb->vlan_proto, skb->protocol.
> I do manually set skb->protocol to eth->h_proto in my proposal. This is later
> correctly handled by bpf_flow.c: parse_eth_proto() is called on skb->protocol
> and we correctly handle bpf_htons(ETH_P_8021Q) there. So existing
> bpf_flow.c works as expected.
...
> The goal of this patch series was to essentially make this skb/no-skb
> context transparent to the bpf_flow.c (i.e. no changes from the user
> flow programs). Adding another flow dissector for eth_get_headlen case
> also seems as a no go.

The problem with this thinking is assumption that bpf_flow.c is the only program.
Since ctx of flow_dissector prog type is 'struct __sk_buff'
all fields should be valid or the verifier has to reject access
to fields that were not set.
You cannot "manually set skb->protocol to eth->h_proto" in fake skb
and ignore the rest.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-14  6:38                       ` Alexei Starovoitov
@ 2019-02-14 17:35                         ` Stanislav Fomichev
  2019-02-25 20:33                           ` Stanislav Fomichev
  0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-14 17:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/13, Alexei Starovoitov wrote:
> On Wed, Feb 13, 2019 at 09:57:25PM -0800, Stanislav Fomichev wrote:
> > 
> > > That 'stuck with __sk_buff' is what bothers me.
> > I might have use the wrong word here. I don't think there is another
> > option to be honest. Using __sk_buff makes flow dissector programs work
> > with fragmented packets;
> 
> good point. indeed real skb is essential.
> 
> > > It's an indication that api wasn't thought through if first thing
> > > it needs is this fake skb hack.
> > > If bpf_flow.c is a realistic example of such flow dissector prog
> > > it means that real skb fields are accessed.
> > > In particular skb->vlan_proto, skb->protocol.
> > I do manually set skb->protocol to eth->h_proto in my proposal. This is later
> > correctly handled by bpf_flow.c: parse_eth_proto() is called on skb->protocol
> > and we correctly handle bpf_htons(ETH_P_8021Q) there. So existing
> > bpf_flow.c works as expected.
> ...
> > The goal of this patch series was to essentially make this skb/no-skb
> > context transparent to the bpf_flow.c (i.e. no changes from the user
> > flow programs). Adding another flow dissector for eth_get_headlen case
> > also seems as a no go.
> 
> The problem with this thinking is assumption that bpf_flow.c is the only program.
I agree, it's a bad assumption, but it is sort of a reference implementation,
I don't expect other users to do something wildly different. Hopefully :-)

> Since ctx of flow_dissector prog type is 'struct __sk_buff'
> all fields should be valid or the verifier has to reject access
> to fields that were not set.
> You cannot "manually set skb->protocol to eth->h_proto" in fake skb
> and ignore the rest.
Ugh, I did expect that we only allow a minimal set of __sk_buff fields
to be allowed from the flow dissector program type, but that's not the
case. We explicitly prohibit access only to
family/ips/ports/tc_classid/tstamp/wire_len, everything else is readable :-/
Any idea why?
Stuff like ingress_ifindex/ifindex/hash/mark/queue_mapping, does flow dissector
programs really need to know that?

For the most part, using zero-initialized fake skb looks fine, except:
* infindex, where we do skb->dev->ifndex (skb->dev is NULL)
* gso_segs, where we do skb_shinfo(skb)->gso_segs (we are missing
  shinfo)

So there is indeed a couple of problems.

How do you feel about tightening down the access to sk_buff fields from
the flow dissector program type? That is an API change, but I don't see why
existing users should use those fields. Let's allow access only to
len/data/data_end, protocol, vlan_{present,tci,proto}, cb, flow_keys,
that should be enough to dissect the packet (I also looked at C-based
implementation, it doesn't use anything besides that).
We can always rollback if somebody complains about.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen
  2019-02-14 17:35                         ` Stanislav Fomichev
@ 2019-02-25 20:33                           ` Stanislav Fomichev
  0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2019-02-25 20:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, Network Development,
	David Miller, Alexei Starovoitov, Daniel Borkmann, simon.horman,
	Willem de Bruijn

On 02/14, Stanislav Fomichev wrote:
> On 02/13, Alexei Starovoitov wrote:
> > On Wed, Feb 13, 2019 at 09:57:25PM -0800, Stanislav Fomichev wrote:
> > > 
> > > > That 'stuck with __sk_buff' is what bothers me.
> > > I might have use the wrong word here. I don't think there is another
> > > option to be honest. Using __sk_buff makes flow dissector programs work
> > > with fragmented packets;
> > 
> > good point. indeed real skb is essential.
> > 
> > > > It's an indication that api wasn't thought through if first thing
> > > > it needs is this fake skb hack.
> > > > If bpf_flow.c is a realistic example of such flow dissector prog
> > > > it means that real skb fields are accessed.
> > > > In particular skb->vlan_proto, skb->protocol.
> > > I do manually set skb->protocol to eth->h_proto in my proposal. This is later
> > > correctly handled by bpf_flow.c: parse_eth_proto() is called on skb->protocol
> > > and we correctly handle bpf_htons(ETH_P_8021Q) there. So existing
> > > bpf_flow.c works as expected.
> > ...
> > > The goal of this patch series was to essentially make this skb/no-skb
> > > context transparent to the bpf_flow.c (i.e. no changes from the user
> > > flow programs). Adding another flow dissector for eth_get_headlen case
> > > also seems as a no go.
> > 
> > The problem with this thinking is assumption that bpf_flow.c is the only program.
> I agree, it's a bad assumption, but it is sort of a reference implementation,
> I don't expect other users to do something wildly different. Hopefully :-)
> 
> > Since ctx of flow_dissector prog type is 'struct __sk_buff'
> > all fields should be valid or the verifier has to reject access
> > to fields that were not set.
> > You cannot "manually set skb->protocol to eth->h_proto" in fake skb
> > and ignore the rest.
> Ugh, I did expect that we only allow a minimal set of __sk_buff fields
> to be allowed from the flow dissector program type, but that's not the
> case. We explicitly prohibit access only to
> family/ips/ports/tc_classid/tstamp/wire_len, everything else is readable :-/
> Any idea why?
> Stuff like ingress_ifindex/ifindex/hash/mark/queue_mapping, does flow dissector
> programs really need to know that?
> 
> For the most part, using zero-initialized fake skb looks fine, except:
> * infindex, where we do skb->dev->ifndex (skb->dev is NULL)
> * gso_segs, where we do skb_shinfo(skb)->gso_segs (we are missing
>   shinfo)
> 
> So there is indeed a couple of problems.
> 
> How do you feel about tightening down the access to sk_buff fields from
> the flow dissector program type? That is an API change, but I don't see why
> existing users should use those fields. Let's allow access only to
> len/data/data_end, protocol, vlan_{present,tci,proto}, cb, flow_keys,
> that should be enough to dissect the packet (I also looked at C-based
> implementation, it doesn't use anything besides that).
> We can always rollback if somebody complains about.

To revive the conversation, here is what I was thinking about (whitelist
the skb fields, not blacklist them):

--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6591,11 +6591,14 @@ static bool flow_dissector_is_valid_access(int off, int size,
        case bpf_ctx_range_ptr(struct __sk_buff, flow_keys):
                info->reg_type = PTR_TO_FLOW_KEYS;
                break;
-       case bpf_ctx_range(struct __sk_buff, tc_classid):
-       case bpf_ctx_range(struct __sk_buff, data_meta):
-       case bpf_ctx_range_till(struct __sk_buff, family, local_port):
-       case bpf_ctx_range(struct __sk_buff, tstamp):
-       case bpf_ctx_range(struct __sk_buff, wire_len):
+       case bpf_ctx_range(struct __sk_buff, len):
+       case bpf_ctx_range(struct __sk_buff, protocol):
+       case bpf_ctx_range(struct __sk_buff, vlan_present):
+       case bpf_ctx_range(struct __sk_buff, vlan_tci):
+       case bpf_ctx_range(struct __sk_buff, vlan_proto):
+       case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+               break;
+       default:
                return false;
        }

What do you think?

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-02-25 20:33 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-05 17:36 [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Stanislav Fomichev
2019-02-05 17:36 ` [RFC bpf-next 1/7] net: introduce __init_skb and __init_skb_shinfo helpers Stanislav Fomichev
2019-02-05 20:18   ` Willem de Bruijn
2019-02-05 17:36 ` [RFC bpf-next 2/7] net: introduce skb_net helper Stanislav Fomichev
2019-02-05 20:19   ` Willem de Bruijn
2019-02-05 20:42     ` Stanislav Fomichev
2019-02-05 17:36 ` [RFC bpf-next 3/7] net: plumb network namespace into __skb_flow_dissect Stanislav Fomichev
2019-02-05 20:19   ` Willem de Bruijn
2019-02-05 20:40     ` Stanislav Fomichev
2019-02-05 17:36 ` [RFC bpf-next 4/7] net: flow_dissector: handle no-skb use case Stanislav Fomichev
2019-02-05 20:19   ` Willem de Bruijn
2019-02-05 20:45     ` Stanislav Fomichev
2019-02-05 17:36 ` [RFC bpf-next 5/7] bpf: when doing BPF_PROG_TEST_RUN for flow dissector use no-skb mode Stanislav Fomichev
2019-02-05 20:19   ` Willem de Bruijn
2019-02-05 17:36 ` [RFC bpf-next 6/7] selftests/bpf: add flow dissector bpf_skb_load_bytes helper test Stanislav Fomichev
2019-02-05 17:36 ` [RFC bpf-next 7/7] net: flow_dissector: pass net argument to the eth_get_headlen Stanislav Fomichev
2019-02-05 20:18 ` [RFC bpf-next 0/7] net: flow_dissector: trigger BPF hook when called from eth_get_headlen Willem de Bruijn
2019-02-05 20:40   ` Stanislav Fomichev
2019-02-06  0:47     ` Alexei Starovoitov
2019-02-06  0:59       ` Stanislav Fomichev
2019-02-06  3:12         ` Alexei Starovoitov
2019-02-06  3:56           ` Stanislav Fomichev
2019-02-06  4:11             ` Alexei Starovoitov
2019-02-06  5:49               ` Stanislav Fomichev
2019-02-12 17:02                 ` Stanislav Fomichev
2019-02-14  4:39                   ` Alexei Starovoitov
2019-02-14  5:57                     ` Stanislav Fomichev
2019-02-14  6:38                       ` Alexei Starovoitov
2019-02-14 17:35                         ` Stanislav Fomichev
2019-02-25 20:33                           ` Stanislav Fomichev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.