All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
@ 2018-11-26 11:38 Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly Florian Westphal
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 11:38 UTC (permalink / raw)
  To: netdev

The (out-of-tree) Multipath-TCP implementation needs a significant amount
of extra space in the skb control buffer.

Increasing skb->cb[] size in mainline is a non-starter for memory and
and performance reasons (f.e. increase in cb size also moves several
frequently-accessed fields to other cache lines).

One approach that might work for MPTCP is to extend skb_shared_info instead
of sk_buff.  However, this comes with other drawbacks, e.g.  it either
needs special skb allocation to make sure there is enough space for such
'extended shinfo' at the end of data buffer (which makes this only useable
for tx path) or increased size of skb_shared_info.

This adds an extension infrastructure for sk_buff instead:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.
3. adding extension to an skb will COW the extension
   buffer if needed.

This is also how xfrm and bridge_nf extra data (skb->sp, skb->nf_bridge)
are handled.

In the future, protocols that need to store more than 48 bytes in skb->cb[]
could add a 'SKB_EXT_EXTRA_CB' or similar to allocate extra space.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   have been enabled for this skb.
2. extension pointer, located at the end of the sk_buff.
   If active_extensions byte is 0, pointer value is undefined.

Last patch converts nf_bridge to use the extension infrastructure:
The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.

Extra code added to skb clone and free paths (to deal with
refcount/free of extension area) replace the existing code that
deals with skb->nf_bridge.

Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension could be
done as a followup, but I'm reluctant to work on this before there is
agreement that this is the right direction.

Comments welcome.

 include/linux/netfilter_bridge.h     |   33 +++++---
 include/linux/skbuff.h               |  142 +++++++++++++++++++++++++++++------
 include/net/netfilter/br_netfilter.h |   14 ---
 net/Kconfig                          |    4 
 net/bridge/br_netfilter_hooks.c      |   39 +++------
 net/bridge/br_netfilter_ipv6.c       |    4 
 net/core/skbuff.c                    |  134 ++++++++++++++++++++++++++++++++-
 net/ipv4/ip_output.c                 |    1 
 net/ipv4/netfilter/nf_reject_ipv4.c  |    6 -
 net/ipv6/ip6_output.c                |    1 
 net/ipv6/netfilter/nf_reject_ipv6.c  |   10 +-
 net/netfilter/nf_log_common.c        |   20 ++--
 net/netfilter/nf_queue.c             |   50 ++++++++----
 net/netfilter/nfnetlink_queue.c      |   23 ++---
 net/netfilter/xt_physdev.c           |    2 
 15 files changed, 368 insertions(+), 115 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly
  2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
@ 2018-11-26 11:38 ` Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 2/3] sk_buff: add skb extension infrastructure Florian Westphal
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 11:38 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

plan is to remove this pointer from sk_buff, so use the existing helpers
in more places to reduce noise later on.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h     | 33 +++++++++++++-----
 include/net/netfilter/br_netfilter.h |  6 ----
 net/bridge/br_netfilter_hooks.c      | 19 ++++++++---
 net/ipv4/netfilter/nf_reject_ipv4.c  |  6 ++--
 net/ipv6/netfilter/nf_reject_ipv6.c  | 10 ++++--
 net/netfilter/nf_log_common.c        | 20 +++++------
 net/netfilter/nf_queue.c             | 50 ++++++++++++++++++----------
 net/netfilter/nfnetlink_queue.c      | 23 ++++++-------
 net/netfilter/xt_physdev.c           |  2 +-
 9 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index fa0686500970..0a65a422587c 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -17,43 +17,58 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 		skb_dst_drop(skb);
 }
 
+static inline struct nf_bridge_info *
+nf_bridge_info_get(const struct sk_buff *skb)
+{
+	return skb->nf_bridge;
+}
+
+static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
+{
+	return skb->nf_bridge != NULL;
+}
+
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-	if (skb->nf_bridge == NULL)
+	if (!nf_bridge)
 		return 0;
 
-	nf_bridge = skb->nf_bridge;
 	return nf_bridge->physindev ? nf_bridge->physindev->ifindex : 0;
 }
 
 static inline int nf_bridge_get_physoutif(const struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-	if (skb->nf_bridge == NULL)
+	if (!nf_bridge)
 		return 0;
 
-	nf_bridge = skb->nf_bridge;
 	return nf_bridge->physoutdev ? nf_bridge->physoutdev->ifindex : 0;
 }
 
 static inline struct net_device *
 nf_bridge_get_physindev(const struct sk_buff *skb)
 {
-	return skb->nf_bridge ? skb->nf_bridge->physindev : NULL;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge ? nf_bridge->physindev : NULL;
 }
 
 static inline struct net_device *
 nf_bridge_get_physoutdev(const struct sk_buff *skb)
 {
-	return skb->nf_bridge ? skb->nf_bridge->physoutdev : NULL;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge ? nf_bridge->physoutdev : NULL;
 }
 
 static inline bool nf_bridge_in_prerouting(const struct sk_buff *skb)
 {
-	return skb->nf_bridge && skb->nf_bridge->in_prerouting;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge && nf_bridge->in_prerouting;
 }
 #else
 #define br_drop_fake_rtable(skb)	        do { } while (0)
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index 74af19c3a8f7..6efc0153987b 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -22,12 +22,6 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
 		      int (*okfn)(struct net *, struct sock *,
 				  struct sk_buff *));
 
-static inline struct nf_bridge_info *
-nf_bridge_info_get(const struct sk_buff *skb)
-{
-	return skb->nf_bridge;
-}
-
 unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb);
 
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index c9383c470a83..c58cf68b45c5 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -247,7 +247,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb)
 {
-	switch (skb->nf_bridge->orig_proto) {
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	switch (nf_bridge->orig_proto) {
 	case BRNF_PROTO_8021Q:
 		skb->protocol = htons(ETH_P_8021Q);
 		break;
@@ -569,7 +571,8 @@ static unsigned int br_nf_forward_ip(void *priv,
 	struct net_device *parent;
 	u_int8_t pf;
 
-	if (!skb->nf_bridge)
+	nf_bridge = nf_bridge_info_get(skb);
+	if (!nf_bridge)
 		return NF_ACCEPT;
 
 	/* Need exclusive nf_bridge_info since we might have multiple
@@ -701,7 +704,9 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 
 static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
-	if (skb->nf_bridge->orig_proto == BRNF_PROTO_PPPOE)
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge->orig_proto == BRNF_PROTO_PPPOE)
 		return PPPOE_SES_HLEN;
 	return 0;
 }
@@ -839,7 +844,9 @@ static unsigned int ip_sabotage_in(void *priv,
 				   struct sk_buff *skb,
 				   const struct nf_hook_state *state)
 {
-	if (skb->nf_bridge && !skb->nf_bridge->in_prerouting &&
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge && !nf_bridge->in_prerouting &&
 	    !netif_is_l3_master(skb->dev)) {
 		state->okfn(state->net, state->sk, skb);
 		return NF_STOLEN;
@@ -877,7 +884,9 @@ static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
 
 static int br_nf_dev_xmit(struct sk_buff *skb)
 {
-	if (skb->nf_bridge && skb->nf_bridge->bridged_dnat) {
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge && nf_bridge->bridged_dnat) {
 		br_nf_pre_routing_finish_bridge_slow(skb);
 		return 1;
 	}
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 5cd06ba3535d..aa8304c618b8 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -102,6 +102,7 @@ EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook)
 {
+	struct net_device *br_indev __maybe_unused;
 	struct sk_buff *nskb;
 	struct iphdr *niph;
 	const struct tcphdr *oth;
@@ -147,10 +148,11 @@ void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook)
 	 * build the eth header using the original destination's MAC as the
 	 * source, and send the RST packet directly.
 	 */
-	if (oldskb->nf_bridge) {
+	br_indev = nf_bridge_get_physindev(oldskb);
+	if (br_indev) {
 		struct ethhdr *oeth = eth_hdr(oldskb);
 
-		nskb->dev = nf_bridge_get_physindev(oldskb);
+		nskb->dev = br_indev;
 		niph->tot_len = htons(nskb->len);
 		ip_send_check(niph);
 		if (dev_hard_header(nskb, nskb->dev, ntohs(nskb->protocol),
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 24858402e374..b9c8a763c863 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -131,6 +131,7 @@ EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);
 
 void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
 {
+	struct net_device *br_indev __maybe_unused;
 	struct sk_buff *nskb;
 	struct tcphdr _otcph;
 	const struct tcphdr *otcph;
@@ -197,15 +198,18 @@ void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
 	 * build the eth header using the original destination's MAC as the
 	 * source, and send the RST packet directly.
 	 */
-	if (oldskb->nf_bridge) {
+	br_indev = nf_bridge_get_physindev(oldskb);
+	if (br_indev) {
 		struct ethhdr *oeth = eth_hdr(oldskb);
 
-		nskb->dev = nf_bridge_get_physindev(oldskb);
+		nskb->dev = br_indev;
 		nskb->protocol = htons(ETH_P_IPV6);
 		ip6h->payload_len = htons(sizeof(struct tcphdr));
 		if (dev_hard_header(nskb, nskb->dev, ntohs(nskb->protocol),
-				    oeth->h_source, oeth->h_dest, nskb->len) < 0)
+				    oeth->h_source, oeth->h_dest, nskb->len) < 0) {
+			kfree_skb(nskb);
 			return;
+		}
 		dev_queue_xmit(nskb);
 	} else
 #endif
diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c
index a8c5c846aec1..3a0d6880b7c9 100644
--- a/net/netfilter/nf_log_common.c
+++ b/net/netfilter/nf_log_common.c
@@ -156,22 +156,20 @@ nf_log_dump_packet_common(struct nf_log_buf *m, u_int8_t pf,
 			  const struct net_device *out,
 			  const struct nf_loginfo *loginfo, const char *prefix)
 {
+	const struct net_device *physoutdev __maybe_unused;
+	const struct net_device *physindev __maybe_unused;
+
 	nf_log_buf_add(m, KERN_SOH "%c%sIN=%s OUT=%s ",
 	       '0' + loginfo->u.log.level, prefix,
 	       in ? in->name : "",
 	       out ? out->name : "");
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge) {
-		const struct net_device *physindev;
-		const struct net_device *physoutdev;
-
-		physindev = nf_bridge_get_physindev(skb);
-		if (physindev && in != physindev)
-			nf_log_buf_add(m, "PHYSIN=%s ", physindev->name);
-		physoutdev = nf_bridge_get_physoutdev(skb);
-		if (physoutdev && out != physoutdev)
-			nf_log_buf_add(m, "PHYSOUT=%s ", physoutdev->name);
-	}
+	physindev = nf_bridge_get_physindev(skb);
+	if (physindev && in != physindev)
+		nf_log_buf_add(m, "PHYSIN=%s ", physindev->name);
+	physoutdev = nf_bridge_get_physoutdev(skb);
+	if (physoutdev && out != physoutdev)
+		nf_log_buf_add(m, "PHYSOUT=%s ", physoutdev->name);
 #endif
 }
 EXPORT_SYMBOL_GPL(nf_log_dump_packet_common);
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index d67a96a25a68..a36a77bae1d6 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -46,6 +46,24 @@ void nf_unregister_queue_handler(struct net *net)
 }
 EXPORT_SYMBOL(nf_unregister_queue_handler);
 
+static void nf_queue_entry_release_br_nf_refs(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge) {
+		struct net_device *physdev;
+
+		physdev = nf_bridge_get_physindev(skb);
+		if (physdev)
+			dev_put(physdev);
+		physdev = nf_bridge_get_physoutdev(skb);
+		if (physdev)
+			dev_put(physdev);
+	}
+#endif
+}
+
 void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 {
 	struct nf_hook_state *state = &entry->state;
@@ -57,20 +75,28 @@ void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 		dev_put(state->out);
 	if (state->sk)
 		sock_put(state->sk);
+
+	nf_queue_entry_release_br_nf_refs(entry->skb);
+}
+EXPORT_SYMBOL_GPL(nf_queue_entry_release_refs);
+
+static void nf_queue_entry_get_br_nf_refs(struct sk_buff *skb)
+{
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge) {
 		struct net_device *physdev;
 
-		physdev = nf_bridge_get_physindev(entry->skb);
+		physdev = nf_bridge_get_physindev(skb);
 		if (physdev)
-			dev_put(physdev);
-		physdev = nf_bridge_get_physoutdev(entry->skb);
+			dev_hold(physdev);
+		physdev = nf_bridge_get_physoutdev(skb);
 		if (physdev)
-			dev_put(physdev);
+			dev_hold(physdev);
 	}
 #endif
 }
-EXPORT_SYMBOL_GPL(nf_queue_entry_release_refs);
 
 /* Bump dev refs so they don't vanish while packet is out */
 void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
@@ -83,18 +109,8 @@ void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
 		dev_hold(state->out);
 	if (state->sk)
 		sock_hold(state->sk);
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
-		struct net_device *physdev;
 
-		physdev = nf_bridge_get_physindev(entry->skb);
-		if (physdev)
-			dev_hold(physdev);
-		physdev = nf_bridge_get_physoutdev(entry->skb);
-		if (physdev)
-			dev_hold(physdev);
-	}
-#endif
+	nf_queue_entry_get_br_nf_refs(entry->skb);
 }
 EXPORT_SYMBOL_GPL(nf_queue_entry_get_refs);
 
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 1ce30efe6854..0dcc3592d053 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -727,13 +727,13 @@ nf_queue_entry_dup(struct nf_queue_entry *e)
  */
 static void nf_bridge_adjust_skb_data(struct sk_buff *skb)
 {
-	if (skb->nf_bridge)
+	if (nf_bridge_info_get(skb))
 		__skb_push(skb, skb->network_header - skb->mac_header);
 }
 
 static void nf_bridge_adjust_segmented_data(struct sk_buff *skb)
 {
-	if (skb->nf_bridge)
+	if (nf_bridge_info_get(skb))
 		__skb_pull(skb, skb->network_header - skb->mac_header);
 }
 #else
@@ -904,23 +904,22 @@ nfqnl_set_mode(struct nfqnl_instance *queue,
 static int
 dev_cmp(struct nf_queue_entry *entry, unsigned long ifindex)
 {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	int physinif, physoutif;
+
+	physinif = nf_bridge_get_physinif(entry->skb);
+	physoutif = nf_bridge_get_physoutif(entry->skb);
+
+	if (physinif == ifindex || physoutif == ifindex)
+		return 1;
+#endif
 	if (entry->state.in)
 		if (entry->state.in->ifindex == ifindex)
 			return 1;
 	if (entry->state.out)
 		if (entry->state.out->ifindex == ifindex)
 			return 1;
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
-		int physinif, physoutif;
 
-		physinif = nf_bridge_get_physinif(entry->skb);
-		physoutif = nf_bridge_get_physoutif(entry->skb);
-
-		if (physinif == ifindex || physoutif == ifindex)
-			return 1;
-	}
-#endif
 	return 0;
 }
 
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 9d6d67b953ac..4034d70bff39 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -33,7 +33,7 @@ physdev_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	/* Not a bridged IP packet or no info available yet:
 	 * LOCAL_OUT/mangle and LOCAL_OUT/nat don't know if
 	 * the destination device will be a bridge. */
-	if (!skb->nf_bridge) {
+	if (!nf_bridge_info_exists(skb)) {
 		/* Return MATCH if the invert flags of the used options are on */
 		if ((info->bitmask & XT_PHYSDEV_OP_BRIDGED) &&
 		    !(info->invert & XT_PHYSDEV_OP_BRIDGED))
-- 
2.18.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/3] sk_buff: add skb extension infrastructure
  2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly Florian Westphal
@ 2018-11-26 11:38 ` Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 3/3] net: convert bridge_nf to use " Florian Westphal
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 11:38 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

adds an extension infrastructure for sk_buff:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.

This is also how xfrm and bridge netfilter skb-associated data
(skb->sp and skb->nf_bridge) are handled.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   have been allocated for the skb.
2. extension pointer, located at the end of the sk_buff.
   If active_extensions is 0, its content is undefined.

The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same,
in a followup patch.

This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but replaces the existing code that
deals with skb->nf_bridge.

This patch only adds the basic infrastructure, the nf_bridge conversion
is done in the next patch.

Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension is planned
as a followup.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h | 124 +++++++++++++++++++++++++++++++++++++-
 net/Kconfig            |   3 +
 net/core/skbuff.c      | 131 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/ip_output.c   |   1 +
 net/ipv6/ip6_output.c  |   1 +
 5 files changed, 259 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..832904d71a85 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -245,6 +245,7 @@ struct iov_iter;
 struct napi_struct;
 struct bpf_prog;
 union bpf_attr;
+struct skb_ext;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -633,6 +634,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@xmit_more: More SKBs are pending for this queue
  *	@pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
+ *	@active_extensions: active extensions (skb_ext_id types)
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
  *	@l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -662,6 +664,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@data: Data head pointer
  *	@truesize: Buffer size
  *	@users: User count - see {datagram,tcp}.c
+ *	@extensions: allocated extensions, valid if active_extensions is nonzero
  */
 
 struct sk_buff {
@@ -744,7 +747,9 @@ struct sk_buff {
 				head_frag:1,
 				xmit_more:1,
 				pfmemalloc:1;
-
+#ifdef CONFIG_SKB_EXTENSIONS
+	__u8			active_extensions;
+#endif
 	/* fields enclosed in headers_start/headers_end are copied
 	 * using a single memcpy() in __copy_skb_header()
 	 */
@@ -866,6 +871,11 @@ struct sk_buff {
 				*data;
 	unsigned int		truesize;
 	refcount_t		users;
+
+#ifdef CONFIG_SKB_EXTENSIONS
+	/* only useable after checking ->active_extensions != 0 */
+	struct skb_ext		*extensions;
+#endif
 };
 
 #ifdef __KERNEL__
@@ -3889,6 +3899,118 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct)
 		atomic_inc(&nfct->use);
 }
 #endif
+
+#ifdef CONFIG_SKB_EXTENSIONS
+enum skb_ext_id {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	SKB_EXT_BRIDGE_NF,
+#endif
+	SKB_EXT_NUM, /* must be last */
+};
+
+/* each extension aligned to this value */
+#define SKB_EXT_ALIGN	8
+/* offsets/len: left-shift needed to translate offset to bytes */
+#define SKB_EXT_ALIGN_SHIFT 3
+
+/**
+ *	struct skb_ext - sk_buff extensions
+ *	@refcount: 1 on allocation, deallocated on 0
+ *	@offset: offset to add to @data to obtain extension address
+ *	@len: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units
+ *	@data: start of extension data, variable sized
+ *
+ *	Note: offsets and len are stored in chunks of 8 bytes, this allows
+ *	to use 'u8' types while allowing up to 2kb worth of extension data.
+ */
+struct skb_ext {
+	refcount_t refcnt;
+	u8 offset[SKB_EXT_NUM]; /* chunks of 8 bytes */
+	u8 len;			/* same, i.e. size == len << 3 */
+	char data[0] __aligned(SKB_EXT_ALIGN);
+};
+
+void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_free(struct skb_ext *ext);
+
+static inline void __skb_ext_put(struct skb_ext *ext)
+{
+	if (ext && refcount_dec_and_test(&ext->refcnt))
+		__skb_ext_free(ext);
+}
+
+static inline void skb_ext_put(struct sk_buff *skb)
+{
+	if (skb->active_extensions)
+		__skb_ext_put(skb->extensions);
+}
+
+static inline void skb_ext_get(struct sk_buff *skb)
+{
+	if (skb->active_extensions) {
+		struct skb_ext *ext = skb->extensions;
+
+		if (ext)
+			refcount_inc(&ext->refcnt);
+	}
+}
+
+static inline void __skb_ext_copy(struct sk_buff *dst,
+				  const struct sk_buff *src)
+{
+	dst->active_extensions = src->active_extensions;
+
+	if (src->active_extensions) {
+		struct skb_ext *ext = src->extensions;
+
+		if (ext)
+			refcount_inc(&ext->refcnt);
+		dst->extensions = ext;
+	}
+}
+
+static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *src)
+{
+	skb_ext_put(dst);
+	__skb_ext_copy(dst, src);
+}
+
+static inline bool __skb_ext_exist(const struct skb_ext *ext, enum skb_ext_id i)
+{
+	return !!ext->offset[i];
+}
+
+static inline bool skb_ext_exist(const struct sk_buff *skb, enum skb_ext_id id)
+{
+	return skb->active_extensions & (1 << id);
+}
+
+static inline void skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
+{
+	if (skb_ext_exist(skb, id))
+		__skb_ext_del(skb, id);
+}
+
+static inline void *skb_ext_find(const struct sk_buff *skb, enum skb_ext_id id)
+{
+	if (skb_ext_exist(skb, id)) {
+		struct skb_ext *ext = skb->extensions;
+
+		if (ext && __skb_ext_exist(ext, id))
+			return (void *)ext + (ext->offset[id] << 3);
+	}
+
+	return NULL;
+}
+#else
+static inline void skb_ext_put(struct sk_buff *skb) {}
+static inline void skb_ext_get(struct sk_buff *skb) {}
+static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
+static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
+static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
+#endif /* CONFIG_SKB_EXTENSIONS */
+
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
 {
diff --git a/net/Kconfig b/net/Kconfig
index f235edb593ba..93b291292860 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -51,6 +51,9 @@ config NET_INGRESS
 config NET_EGRESS
 	bool
 
+config SKB_EXTENSIONS
+	bool
+
 menu "Networking options"
 
 source "net/packet/Kconfig"
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 02cd7ae3d0fb..e29016030633 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -617,6 +617,7 @@ void skb_release_head_state(struct sk_buff *skb)
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	nf_bridge_put(skb->nf_bridge);
 #endif
+	skb_ext_put(skb);
 }
 
 /* Free everything but the sk_buff shell. */
@@ -796,6 +797,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->dev		= old->dev;
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	skb_dst_copy(new, old);
+	__skb_ext_copy(new, old);
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -5531,3 +5533,132 @@ void skb_condense(struct sk_buff *skb)
 	 */
 	skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 }
+
+#ifdef CONFIG_SKB_EXTENSIONS
+static const u8 skb_ext_type_len[] = {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	[SKB_EXT_BRIDGE_NF] = sizeof(struct nf_bridge_info),
+#endif
+};
+
+static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
+{
+	return (void *)ext + (ext->offset[id] << SKB_EXT_ALIGN_SHIFT);
+}
+
+static struct skb_ext *skb_ext_cow(unsigned int len,
+				   struct skb_ext *old)
+{
+	struct skb_ext *new = kmalloc(len, GFP_ATOMIC);
+
+	if (!new)
+		return NULL;
+
+	if (!old) {
+		memset(new->offset, 0, sizeof(new->offset));
+		refcount_set(&new->refcnt, 1);
+		return new;
+	}
+
+	memcpy(new, old, old->len << SKB_EXT_ALIGN_SHIFT);
+	refcount_set(&new->refcnt, 1);
+	__skb_ext_put(old);
+	return new;
+}
+
+static __always_inline unsigned int skb_ext_total_length(void)
+{
+	return 0 +
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+		skb_ext_type_len[SKB_EXT_BRIDGE_NF] +
+#endif
+		0;
+}
+
+/**
+ * skb_ext_add - allocate space for given extension, COW if needed
+ * @skb: buffer
+ * @id: extension to allocate space for
+ *
+ * Allocates enough space for the given extension.
+ * If the extension is already present, a pointer to that extension
+ * is returned.
+ *
+ * If the skb was cloned, COW applies and the returned memory can be
+ * modified without changing the extension space of clones buffers.
+ *
+ * Returns pointer to the extenion or NULL on allocation failure.
+ */
+void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
+{
+	unsigned int newlen, newoff, oldlen;
+	struct skb_ext *new, *old = NULL;
+	bool cow_needed = true;
+
+	BUILD_BUG_ON(SKB_EXT_NUM >= 8);
+	BUILD_BUG_ON(skb_ext_total_length() > (255 << 3));
+
+	if (skb->active_extensions) {
+		old = skb->extensions;
+
+		cow_needed = refcount_read(&old->refcnt) > 1;
+
+		if (__skb_ext_exist(old, id)) {
+			if (!cow_needed) {
+				new = old;
+				goto set_active;
+			}
+
+			/* extension was allocated previously and it
+			 * might be used by a cloned skb. COW needed.
+			 */
+			new = skb_ext_cow(old->len << SKB_EXT_ALIGN_SHIFT, old);
+			if (!new)
+				return NULL;
+
+			skb->extensions = new;
+			goto set_active;
+		}
+		oldlen = old->len << SKB_EXT_ALIGN_SHIFT;
+	} else {
+		oldlen = sizeof(*new);
+	}
+
+	newoff = ALIGN(oldlen, SKB_EXT_ALIGN);
+	newlen = newoff + skb_ext_type_len[id];
+
+	if (cow_needed)
+		new = skb_ext_cow(newlen, old);
+	else
+		new = krealloc(old, newlen, GFP_ATOMIC);
+	if (!new)
+		return NULL;
+
+	new->offset[id] = newoff >> SKB_EXT_ALIGN_SHIFT;
+	new->len = newlen >> SKB_EXT_ALIGN_SHIFT;
+	skb->extensions = new;
+set_active:
+	skb->active_extensions |= 1 << id;
+	return skb_ext_get_ptr(new, id);
+}
+EXPORT_SYMBOL(skb_ext_add);
+
+void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
+{
+	struct skb_ext *ext;
+
+	skb->active_extensions &= ~(1 << id);
+	if (skb->active_extensions == 0) {
+		ext = skb->extensions;
+		skb->extensions = NULL;
+		__skb_ext_put(ext);
+	}
+}
+EXPORT_SYMBOL(__skb_ext_del);
+
+void __skb_ext_free(struct skb_ext *ext)
+{
+	kfree(ext);
+}
+EXPORT_SYMBOL(__skb_ext_free);
+#endif /* CONFIG_SKB_EXTENSIONS */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c09219e7f230..a12e12f983d5 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -533,6 +533,7 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	to->tc_index = from->tc_index;
 #endif
 	nf_copy(to, from);
+	skb_ext_copy(to, from);
 #if IS_ENABLED(CONFIG_IP_VS)
 	to->ipvs_property = from->ipvs_property;
 #endif
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 89e0d5118afe..7eeb0f24be87 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -574,6 +574,7 @@ static void ip6_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	to->tc_index = from->tc_index;
 #endif
 	nf_copy(to, from);
+	skb_ext_copy(to, from);
 	skb_copy_secmark(to, from);
 }
 
-- 
2.18.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/3] net: convert bridge_nf to use skb extension infrastructure
  2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly Florian Westphal
  2018-11-26 11:38 ` [RFC PATCH 2/3] sk_buff: add skb extension infrastructure Florian Westphal
@ 2018-11-26 11:38 ` Florian Westphal
  2018-11-26 20:41 ` [RFC PATCH 0/3] sk_buff: add " David Miller
  2018-11-26 23:05 ` Eric Dumazet
  4 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 11:38 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure instead.

The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h     |  4 ++--
 include/linux/skbuff.h               | 28 ++--------------------------
 include/net/netfilter/br_netfilter.h |  8 ++++----
 net/Kconfig                          |  1 +
 net/bridge/br_netfilter_hooks.c      | 20 ++------------------
 net/bridge/br_netfilter_ipv6.c       |  4 ++--
 net/core/skbuff.c                    |  3 ---
 7 files changed, 13 insertions(+), 55 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 0a65a422587c..5f2614d02e03 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -20,12 +20,12 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
-	return skb->nf_bridge;
+	return skb_ext_find(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
 {
-	return skb->nf_bridge != NULL;
+	return skb_ext_exist(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 832904d71a85..c33654da09c3 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -255,7 +255,6 @@ struct nf_conntrack {
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 struct nf_bridge_info {
-	refcount_t		use;
 	enum {
 		BRNF_PROTO_UNCHANGED,
 		BRNF_PROTO_8021Q,
@@ -717,9 +716,6 @@ struct sk_buff {
 #endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	unsigned long		 _nfct;
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	struct nf_bridge_info	*nf_bridge;
 #endif
 	unsigned int		len,
 				data_len;
@@ -4011,18 +4007,6 @@ static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
 static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
 #endif /* CONFIG_SKB_EXTENSIONS */
 
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
-{
-	if (nf_bridge && refcount_dec_and_test(&nf_bridge->use))
-		kfree(nf_bridge);
-}
-static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
-{
-	if (nf_bridge)
-		refcount_inc(&nf_bridge->use);
-}
-#endif /* CONFIG_BRIDGE_NETFILTER */
 static inline void nf_reset(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
@@ -4030,8 +4014,7 @@ static inline void nf_reset(struct sk_buff *skb)
 	skb->_nfct = 0;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(skb->nf_bridge);
-	skb->nf_bridge = NULL;
+	skb_ext_del(skb, SKB_EXT_BRIDGE_NF);
 #endif
 }
 
@@ -4049,7 +4032,7 @@ static inline void ipvs_reset(struct sk_buff *skb)
 #endif
 }
 
-/* Note: This doesn't put any conntrack and bridge info in dst. */
+/* Note: This doesn't put any conntrack info in dst. */
 static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 			     bool copy)
 {
@@ -4057,10 +4040,6 @@ static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 	dst->_nfct = src->_nfct;
 	nf_conntrack_get(skb_nfct(src));
 #endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	dst->nf_bridge  = src->nf_bridge;
-	nf_bridge_get(src->nf_bridge);
-#endif
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE) || defined(CONFIG_NF_TABLES)
 	if (copy)
 		dst->nf_trace = src->nf_trace;
@@ -4071,9 +4050,6 @@ static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	nf_conntrack_put(skb_nfct(dst));
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(dst->nf_bridge);
 #endif
 	__nf_copy(dst, src, true);
 }
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index 6efc0153987b..4cd56808ac4e 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -6,12 +6,12 @@
 
 static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 {
-	skb->nf_bridge = kzalloc(sizeof(struct nf_bridge_info), GFP_ATOMIC);
+	struct nf_bridge_info *b = skb_ext_add(skb, SKB_EXT_BRIDGE_NF);
 
-	if (likely(skb->nf_bridge))
-		refcount_set(&(skb->nf_bridge->use), 1);
+	if (b)
+		memset(b, 0, sizeof(*b));
 
-	return skb->nf_bridge;
+	return b;
 }
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
diff --git a/net/Kconfig b/net/Kconfig
index 93b291292860..5cb9de1aaf88 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -187,6 +187,7 @@ config BRIDGE_NETFILTER
 	depends on NETFILTER && INET
 	depends on NETFILTER_ADVANCED
 	select NETFILTER_FAMILY_BRIDGE
+	select SKB_EXTENSIONS
 	default m
 	---help---
 	  Enabling this option will let arptables resp. iptables see bridged
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index c58cf68b45c5..d21a23698410 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -132,10 +132,7 @@ static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
 
 static void nf_bridge_info_free(struct sk_buff *skb)
 {
-	if (skb->nf_bridge) {
-		nf_bridge_put(skb->nf_bridge);
-		skb->nf_bridge = NULL;
-	}
+	skb_ext_del(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline struct net_device *bridge_parent(const struct net_device *dev)
@@ -148,19 +145,7 @@ static inline struct net_device *bridge_parent(const struct net_device *dev)
 
 static inline struct nf_bridge_info *nf_bridge_unshare(struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-
-	if (refcount_read(&nf_bridge->use) > 1) {
-		struct nf_bridge_info *tmp = nf_bridge_alloc(skb);
-
-		if (tmp) {
-			memcpy(tmp, nf_bridge, sizeof(struct nf_bridge_info));
-			refcount_set(&tmp->use, 1);
-		}
-		nf_bridge_put(nf_bridge);
-		nf_bridge = tmp;
-	}
-	return nf_bridge;
+	return skb_ext_add(skb, SKB_EXT_BRIDGE_NF);
 }
 
 unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
@@ -508,7 +493,6 @@ static unsigned int br_nf_pre_routing(void *priv,
 	if (br_validate_ipv4(state->net, skb))
 		return NF_DROP;
 
-	nf_bridge_put(skb->nf_bridge);
 	if (!nf_bridge_alloc(skb))
 		return NF_DROP;
 	if (!setup_pre_routing(skb))
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 96c072e71ea2..94039f588f1d 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -224,8 +224,8 @@ unsigned int br_nf_pre_routing_ipv6(void *priv,
 	if (br_validate_ipv6(state->net, skb))
 		return NF_DROP;
 
-	nf_bridge_put(skb->nf_bridge);
-	if (!nf_bridge_alloc(skb))
+	nf_bridge = nf_bridge_alloc(skb);
+	if (!nf_bridge)
 		return NF_DROP;
 	if (!setup_pre_routing(skb))
 		return NF_DROP;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e29016030633..614499d9898c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -613,9 +613,6 @@ void skb_release_head_state(struct sk_buff *skb)
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	nf_conntrack_put(skb_nfct(skb));
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(skb->nf_bridge);
 #endif
 	skb_ext_put(skb);
 }
-- 
2.18.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
                   ` (2 preceding siblings ...)
  2018-11-26 11:38 ` [RFC PATCH 3/3] net: convert bridge_nf to use " Florian Westphal
@ 2018-11-26 20:41 ` David Miller
  2018-11-26 21:19   ` Florian Westphal
  2018-11-26 23:05 ` Eric Dumazet
  4 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2018-11-26 20:41 UTC (permalink / raw)
  To: fw; +Cc: netdev

From: Florian Westphal <fw@strlen.de>
Date: Mon, 26 Nov 2018 12:38:54 +0100

> This adds an extension infrastructure for sk_buff instead:
> 1. extension memory is released when the sk_buff is free'd.
> 2. data is shared after cloning an skb.
> 3. adding extension to an skb will COW the extension
>    buffer if needed.

So MP-TCP, when enabled for a connection, will have a new atomic
operation for every packet?

And new tests all in the fast paths of the networking to facilitate
this feature, a cost paid by everyone.

Sorry, that doesn't seem like a good idea to me.

Can't they just encode whatever huge amount of crap they want to
put into the CB by deriving the information from skb->sk and some
tiny value like an index or something to resolve the path?

In the future please document what is so enormous and absolutely
required that they must put it all into the SKB control block.

Like Eric, I am concerned about the slow creep of overhead.  Lots of
small "not that bad" additions of extra cycles here and there over
time adds up to impossible to fix performance regressions.

I'm sorry if this is a major disappointment for the MP-TCP folks but a
better way needs to be found to integrate what they want to do with
real zero cost for the rest of the world which won't be using MP-TCP
and therefore should not be paying for it's added overhead at all.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 20:41 ` [RFC PATCH 0/3] sk_buff: add " David Miller
@ 2018-11-26 21:19   ` Florian Westphal
  2018-11-26 21:27     ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 21:19 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev

David Miller <davem@davemloft.net> wrote:
> > This adds an extension infrastructure for sk_buff instead:
> > 1. extension memory is released when the sk_buff is free'd.
> > 2. data is shared after cloning an skb.
> > 3. adding extension to an skb will COW the extension
> >    buffer if needed.
> 
> So MP-TCP, when enabled for a connection, will have a new atomic
> operation for every packet?

Yes, at least for every kfree_skb call.

> And new tests all in the fast paths of the networking to facilitate
> this feature, a cost paid by everyone.

No, right now everyone has two non-atomic tests (skb->sp + skb->nf_bridge),
with this proposal everyone has one (skb->active_extensions), assuming that
both br_nf and xfrm are converted to use the extension system.

Test(s) occur both on copy/clone and kfree_skb, just like in current
kernels.

atomic test(s) are done in case skb->{sp,nf_bridge} are set, with
this patch its done if skb->active_exensions is != 0.

So from that angle current status is kept.

Main motivation was to find a solution that does not add more costs
for normal cases.

I did a quick hack to also convert skb->sp, it seems possible to do so.

In that case skbuff size is reduced by 8 bytes as sp/nf_bridge get
replaced by single 'extension pointer', and slightly less code
provided kernel is built with both XFRM and bridge netfilter support.

> Sorry, that doesn't seem like a good idea to me.
>
> Can't they just encode whatever huge amount of crap they want to
> put into the CB by deriving the information from skb->sk and some
> tiny value like an index or something to resolve the path?

Perhaps, if thats the only way I'm afraid thats what will need to be
used.  I did try such a scheme once in the past to get
rid of skb->nf_bridge and things became very very fugly due to
kfree_skb() not being aware of such 'external storage', i.e. no
way to easily clean the external storage when an skbuff gets tossed.

Might be possibe to use destructor to take care of this in mptcp case.
I can have a look if this is the only possible way.

> In the future please document what is so enormous and absolutely
> required that they must put it all into the SKB control block.

Ok, will do.

> Like Eric, I am concerned about the slow creep of overhead.  Lots of
> small "not that bad" additions of extra cycles here and there over
> time adds up to impossible to fix performance regressions.

I have the same concern, which is why i am proposing the conversion
of xfrm and nf_bridge to use this instead of the current
nf_bridge/secpath maintanance.

Although MPTCP is the main motivation here, it was intended as a
standalone series, i.e., these 3 patches and a few followup changes
to convert xfrm.

> I'm sorry if this is a major disappointment for the MP-TCP folks but a
> better way needs to be found to integrate what they want to do with
> real zero cost for the rest of the world which won't be using MP-TCP
> and therefore should not be paying for it's added overhead at all.

Agreed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 21:19   ` Florian Westphal
@ 2018-11-26 21:27     ` David Miller
  2018-11-26 21:41       ` Florian Westphal
  0 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2018-11-26 21:27 UTC (permalink / raw)
  To: fw; +Cc: netdev

From: Florian Westphal <fw@strlen.de>
Date: Mon, 26 Nov 2018 22:19:33 +0100

>> In the future please document what is so enormous and absolutely
>> required that they must put it all into the SKB control block.
> 
> Ok, will do.

Why don't we establish the details about what MP-TCP needs in the CB
before we discuss this further.

Could you do that for us?

I'm open minded about your approach still, now that I've taken the
xfrm array and nf_bridge aspects into consideration.  So please keep
pursuing this.

Thanks Florian.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 21:27     ` David Miller
@ 2018-11-26 21:41       ` Florian Westphal
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-11-26 21:41 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 26 Nov 2018 22:19:33 +0100
> 
> >> In the future please document what is so enormous and absolutely
> >> required that they must put it all into the SKB control block.
> > 
> > Ok, will do.
> 
> Why don't we establish the details about what MP-TCP needs in the CB
> before we discuss this further.
> 
> Could you do that for us?

Sure.

There is also an MPTCP conf call this week and I'll present this
patchset there as well.

I'll post the requirements and what possible alternatives have
been considered here as a follow-up.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
                   ` (3 preceding siblings ...)
  2018-11-26 20:41 ` [RFC PATCH 0/3] sk_buff: add " David Miller
@ 2018-11-26 23:05 ` Eric Dumazet
  2018-11-28  0:39   ` David Miller
  4 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2018-11-26 23:05 UTC (permalink / raw)
  To: Florian Westphal, netdev



On 11/26/2018 03:38 AM, Florian Westphal wrote:
> The (out-of-tree) Multipath-TCP implementation needs a significant amount
> of extra space in the skb control buffer.
> 
> Increasing skb->cb[] size in mainline is a non-starter for memory and
> and performance reasons (f.e. increase in cb size also moves several
> frequently-accessed fields to other cache lines).
>

One thing that could be done without too much impact is to provide a cbext[]
only for TCP packets in write/rtx queue, that is not in sk_buff but
on the struct sk_buff_fclones

This extra space would not be 0-initialized at alloc_skb()
and would not be copied at skb_clone()

I mentioned this idea a while back already.


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71c8800d81b744a936a7420f33b459..c4bfc2fd98eb9723c0219d5cd8bf5b28afaf5398 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1018,6 +1018,8 @@ struct sk_buff_fclones {
        struct sk_buff  skb2;
 
        refcount_t      fclone_ref;
+
+       char            cbext[128] __aligned(8);
 };
 
 /**

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/3] sk_buff: add skb extension infrastructure
  2018-11-26 23:05 ` Eric Dumazet
@ 2018-11-28  0:39   ` David Miller
  0 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2018-11-28  0:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fw, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 26 Nov 2018 15:05:02 -0800

> One thing that could be done without too much impact is to provide a cbext[]
> only for TCP packets in write/rtx queue, that is not in sk_buff but
> on the struct sk_buff_fclones
> 
> This extra space would not be 0-initialized at alloc_skb()
> and would not be copied at skb_clone()
> 
> I mentioned this idea a while back already.
> 
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 73902acf2b71c8800d81b744a936a7420f33b459..c4bfc2fd98eb9723c0219d5cd8bf5b28afaf5398 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1018,6 +1018,8 @@ struct sk_buff_fclones {
>         struct sk_buff  skb2;
>  
>         refcount_t      fclone_ref;
> +
> +       char            cbext[128] __aligned(8);
>  };

So, potentially at least, MP-TCP could use this.

The down side is that it doesn't allow for the br_netfilter, skb->sp,
etc. consoldation down to one test.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-11-28 11:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-26 11:38 [RFC PATCH 0/3] sk_buff: add skb extension infrastructure Florian Westphal
2018-11-26 11:38 ` [RFC PATCH 1/3] netfilter: avoid using skb->nf_bridge directly Florian Westphal
2018-11-26 11:38 ` [RFC PATCH 2/3] sk_buff: add skb extension infrastructure Florian Westphal
2018-11-26 11:38 ` [RFC PATCH 3/3] net: convert bridge_nf to use " Florian Westphal
2018-11-26 20:41 ` [RFC PATCH 0/3] sk_buff: add " David Miller
2018-11-26 21:19   ` Florian Westphal
2018-11-26 21:27     ` David Miller
2018-11-26 21:41       ` Florian Westphal
2018-11-26 23:05 ` Eric Dumazet
2018-11-28  0:39   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.