All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/13] sk_buff: add extension infrastructure
@ 2018-12-10 14:49 Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 01/13] netfilter: avoid using skb->nf_bridge directly Florian Westphal
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev

The (out-of-tree) Multipath-TCP implementation needs to map logical mptcp
sequence numbers to the tcp sequence numbers used by individual subflows.
This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.

Increasing skb->cb[] size in mainline to store the DSS mapping
is a non-starter for memory and and performance reasons
(f.e. increase in cb size also moves several frequently-accessed fields
 to other cache lines).

Extend skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method
would be required for the receive side.

The current MPTCP implementation adds an additional mptcp specific
pointer to sk_buff.

This series adds an extension infrastructure for sk_buff instead:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.
3. adding extension to an skb will COW the extension buffer if needed.

This is also how xfrm and bridge_nf extra data (skb->sp, skb->nf_bridge)
are handled.

MPTCP could then add a new SKB_EXT_MPTCP_DSS (or similar) to store the
mapping for tx and rx processing.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   are available for this skb.
2. extension pointer, located at the end of the sk_buff.
   If the active_extensions byte is 0, the pointer is undefined.

Third patch converts nf_bridge to use the extension infrastructure:
The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.

After this, there are a few preparation patches to reduce "skb->sp"
usage by using the secpath helper functions instead.

Last patch converts skb->sp, secpath information gets stored as
new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.

Extra code added to skb clone and free paths (to deal with refcount/free
of extension area) replace the existing code that does the same for
skb->nf_bridge and skb->secpath.

I don't see any other in-tree users that could benefit from this
infrastructure, it doesn't make sense to add an extension just for the sake
of a single flag bit (like skb->nf_trace).

Changes since RFC:

Convert secpath.

Unlike nf_bridge, the secpath struct needs to hold reference on the
xfrm state structure(s), thus handling gets more complicated when
an existing secpath extension has to be COW'd (we need to take additional
reference count on the xfrm states contained in the new copy).

Florian Westphal (13):
      netfilter: avoid using skb->nf_bridge directly
      sk_buff: add skb extension infrastructure
      net: convert bridge_nf to use skb extension infrastructure
      xfrm: change secpath_set to return secpath struct, not error value
      net: move secpath_exist helper to sk_buff.h
      net: use skb_sec_path helper in more places
      drivers: net: intel: use secpath helpers in more places
      drivers: net: ethernet: mellanox: use skb_sec_path helper
      drivers: net: netdevsim: use skb_sec_path helper
      xfrm: use secpath_exist where applicable
      drivers: chelsio: use skb_sec_path helper
      xfrm: prefer secpath_set over secpath_dup
      net: switch secpath to use skb extension infrastructure

 Documentation/networking/xfrm_device.txt                      |    7 
 drivers/crypto/chelsio/chcr_ipsec.c                           |    4 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c                |   15 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c                 |    5 
 drivers/net/ethernet/intel/ixgbevf/ipsec.c                    |   15 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c             |    2 
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c |   19 -
 drivers/net/netdevsim/ipsec.c                                 |    7 
 include/linux/netfilter_bridge.h                              |   33 +
 include/linux/skbuff.h                                        |  160 +++++++-
 include/net/netfilter/br_netfilter.h                          |   14 
 include/net/xfrm.h                                            |   40 --
 net/Kconfig                                                   |    4 
 net/bridge/br_netfilter_hooks.c                               |   39 --
 net/bridge/br_netfilter_ipv6.c                                |    4 
 net/core/skbuff.c                                             |  182 +++++++++-
 net/ipv4/esp4.c                                               |    9 
 net/ipv4/esp4_offload.c                                       |   15 
 net/ipv4/ip_output.c                                          |    1 
 net/ipv4/netfilter/nf_reject_ipv4.c                           |    6 
 net/ipv6/esp6.c                                               |    9 
 net/ipv6/esp6_offload.c                                       |   15 
 net/ipv6/ip6_output.c                                         |    1 
 net/ipv6/netfilter/nf_reject_ipv6.c                           |   10 
 net/ipv6/xfrm6_input.c                                        |    8 
 net/netfilter/nf_log_common.c                                 |   20 -
 net/netfilter/nf_queue.c                                      |   50 +-
 net/netfilter/nfnetlink_queue.c                               |   23 -
 net/netfilter/nft_meta.c                                      |    2 
 net/netfilter/nft_xfrm.c                                      |    2 
 net/netfilter/xt_physdev.c                                    |    2 
 net/netfilter/xt_policy.c                                     |    2 
 net/xfrm/Kconfig                                              |    1 
 net/xfrm/xfrm_device.c                                        |    4 
 net/xfrm/xfrm_input.c                                         |   76 +---
 net/xfrm/xfrm_interface.c                                     |    2 
 net/xfrm/xfrm_output.c                                        |    7 
 net/xfrm/xfrm_policy.c                                        |   19 -
 security/selinux/xfrm.c                                       |    4 
 39 files changed, 553 insertions(+), 285 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH net-next 01/13] netfilter: avoid using skb->nf_bridge directly
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h     | 33 +++++++++++++-----
 include/net/netfilter/br_netfilter.h |  6 ----
 net/bridge/br_netfilter_hooks.c      | 19 ++++++++---
 net/ipv4/netfilter/nf_reject_ipv4.c  |  6 ++--
 net/ipv6/netfilter/nf_reject_ipv6.c  | 10 ++++--
 net/netfilter/nf_log_common.c        | 20 +++++------
 net/netfilter/nf_queue.c             | 50 ++++++++++++++++++----------
 net/netfilter/nfnetlink_queue.c      | 23 ++++++-------
 net/netfilter/xt_physdev.c           |  2 +-
 9 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index fa0686500970..0a65a422587c 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -17,43 +17,58 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 		skb_dst_drop(skb);
 }
 
+static inline struct nf_bridge_info *
+nf_bridge_info_get(const struct sk_buff *skb)
+{
+	return skb->nf_bridge;
+}
+
+static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
+{
+	return skb->nf_bridge != NULL;
+}
+
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-	if (skb->nf_bridge == NULL)
+	if (!nf_bridge)
 		return 0;
 
-	nf_bridge = skb->nf_bridge;
 	return nf_bridge->physindev ? nf_bridge->physindev->ifindex : 0;
 }
 
 static inline int nf_bridge_get_physoutif(const struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
-	if (skb->nf_bridge == NULL)
+	if (!nf_bridge)
 		return 0;
 
-	nf_bridge = skb->nf_bridge;
 	return nf_bridge->physoutdev ? nf_bridge->physoutdev->ifindex : 0;
 }
 
 static inline struct net_device *
 nf_bridge_get_physindev(const struct sk_buff *skb)
 {
-	return skb->nf_bridge ? skb->nf_bridge->physindev : NULL;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge ? nf_bridge->physindev : NULL;
 }
 
 static inline struct net_device *
 nf_bridge_get_physoutdev(const struct sk_buff *skb)
 {
-	return skb->nf_bridge ? skb->nf_bridge->physoutdev : NULL;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge ? nf_bridge->physoutdev : NULL;
 }
 
 static inline bool nf_bridge_in_prerouting(const struct sk_buff *skb)
 {
-	return skb->nf_bridge && skb->nf_bridge->in_prerouting;
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	return nf_bridge && nf_bridge->in_prerouting;
 }
 #else
 #define br_drop_fake_rtable(skb)	        do { } while (0)
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index 74af19c3a8f7..6efc0153987b 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -22,12 +22,6 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net, struct sock *sk,
 		      int (*okfn)(struct net *, struct sock *,
 				  struct sk_buff *));
 
-static inline struct nf_bridge_info *
-nf_bridge_info_get(const struct sk_buff *skb)
-{
-	return skb->nf_bridge;
-}
-
 unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb);
 
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index c9383c470a83..c58cf68b45c5 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -247,7 +247,9 @@ static int br_validate_ipv4(struct net *net, struct sk_buff *skb)
 
 void nf_bridge_update_protocol(struct sk_buff *skb)
 {
-	switch (skb->nf_bridge->orig_proto) {
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	switch (nf_bridge->orig_proto) {
 	case BRNF_PROTO_8021Q:
 		skb->protocol = htons(ETH_P_8021Q);
 		break;
@@ -569,7 +571,8 @@ static unsigned int br_nf_forward_ip(void *priv,
 	struct net_device *parent;
 	u_int8_t pf;
 
-	if (!skb->nf_bridge)
+	nf_bridge = nf_bridge_info_get(skb);
+	if (!nf_bridge)
 		return NF_ACCEPT;
 
 	/* Need exclusive nf_bridge_info since we might have multiple
@@ -701,7 +704,9 @@ br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
 
 static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
-	if (skb->nf_bridge->orig_proto == BRNF_PROTO_PPPOE)
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge->orig_proto == BRNF_PROTO_PPPOE)
 		return PPPOE_SES_HLEN;
 	return 0;
 }
@@ -839,7 +844,9 @@ static unsigned int ip_sabotage_in(void *priv,
 				   struct sk_buff *skb,
 				   const struct nf_hook_state *state)
 {
-	if (skb->nf_bridge && !skb->nf_bridge->in_prerouting &&
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge && !nf_bridge->in_prerouting &&
 	    !netif_is_l3_master(skb->dev)) {
 		state->okfn(state->net, state->sk, skb);
 		return NF_STOLEN;
@@ -877,7 +884,9 @@ static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
 
 static int br_nf_dev_xmit(struct sk_buff *skb)
 {
-	if (skb->nf_bridge && skb->nf_bridge->bridged_dnat) {
+	const struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge && nf_bridge->bridged_dnat) {
 		br_nf_pre_routing_finish_bridge_slow(skb);
 		return 1;
 	}
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 5cd06ba3535d..aa8304c618b8 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -102,6 +102,7 @@ EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook)
 {
+	struct net_device *br_indev __maybe_unused;
 	struct sk_buff *nskb;
 	struct iphdr *niph;
 	const struct tcphdr *oth;
@@ -147,10 +148,11 @@ void nf_send_reset(struct net *net, struct sk_buff *oldskb, int hook)
 	 * build the eth header using the original destination's MAC as the
 	 * source, and send the RST packet directly.
 	 */
-	if (oldskb->nf_bridge) {
+	br_indev = nf_bridge_get_physindev(oldskb);
+	if (br_indev) {
 		struct ethhdr *oeth = eth_hdr(oldskb);
 
-		nskb->dev = nf_bridge_get_physindev(oldskb);
+		nskb->dev = br_indev;
 		niph->tot_len = htons(nskb->len);
 		ip_send_check(niph);
 		if (dev_hard_header(nskb, nskb->dev, ntohs(nskb->protocol),
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 24858402e374..b9c8a763c863 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -131,6 +131,7 @@ EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);
 
 void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
 {
+	struct net_device *br_indev __maybe_unused;
 	struct sk_buff *nskb;
 	struct tcphdr _otcph;
 	const struct tcphdr *otcph;
@@ -197,15 +198,18 @@ void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
 	 * build the eth header using the original destination's MAC as the
 	 * source, and send the RST packet directly.
 	 */
-	if (oldskb->nf_bridge) {
+	br_indev = nf_bridge_get_physindev(oldskb);
+	if (br_indev) {
 		struct ethhdr *oeth = eth_hdr(oldskb);
 
-		nskb->dev = nf_bridge_get_physindev(oldskb);
+		nskb->dev = br_indev;
 		nskb->protocol = htons(ETH_P_IPV6);
 		ip6h->payload_len = htons(sizeof(struct tcphdr));
 		if (dev_hard_header(nskb, nskb->dev, ntohs(nskb->protocol),
-				    oeth->h_source, oeth->h_dest, nskb->len) < 0)
+				    oeth->h_source, oeth->h_dest, nskb->len) < 0) {
+			kfree_skb(nskb);
 			return;
+		}
 		dev_queue_xmit(nskb);
 	} else
 #endif
diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c
index a8c5c846aec1..3a0d6880b7c9 100644
--- a/net/netfilter/nf_log_common.c
+++ b/net/netfilter/nf_log_common.c
@@ -156,22 +156,20 @@ nf_log_dump_packet_common(struct nf_log_buf *m, u_int8_t pf,
 			  const struct net_device *out,
 			  const struct nf_loginfo *loginfo, const char *prefix)
 {
+	const struct net_device *physoutdev __maybe_unused;
+	const struct net_device *physindev __maybe_unused;
+
 	nf_log_buf_add(m, KERN_SOH "%c%sIN=%s OUT=%s ",
 	       '0' + loginfo->u.log.level, prefix,
 	       in ? in->name : "",
 	       out ? out->name : "");
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge) {
-		const struct net_device *physindev;
-		const struct net_device *physoutdev;
-
-		physindev = nf_bridge_get_physindev(skb);
-		if (physindev && in != physindev)
-			nf_log_buf_add(m, "PHYSIN=%s ", physindev->name);
-		physoutdev = nf_bridge_get_physoutdev(skb);
-		if (physoutdev && out != physoutdev)
-			nf_log_buf_add(m, "PHYSOUT=%s ", physoutdev->name);
-	}
+	physindev = nf_bridge_get_physindev(skb);
+	if (physindev && in != physindev)
+		nf_log_buf_add(m, "PHYSIN=%s ", physindev->name);
+	physoutdev = nf_bridge_get_physoutdev(skb);
+	if (physoutdev && out != physoutdev)
+		nf_log_buf_add(m, "PHYSOUT=%s ", physoutdev->name);
 #endif
 }
 EXPORT_SYMBOL_GPL(nf_log_dump_packet_common);
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index d67a96a25a68..a36a77bae1d6 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -46,6 +46,24 @@ void nf_unregister_queue_handler(struct net *net)
 }
 EXPORT_SYMBOL(nf_unregister_queue_handler);
 
+static void nf_queue_entry_release_br_nf_refs(struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge) {
+		struct net_device *physdev;
+
+		physdev = nf_bridge_get_physindev(skb);
+		if (physdev)
+			dev_put(physdev);
+		physdev = nf_bridge_get_physoutdev(skb);
+		if (physdev)
+			dev_put(physdev);
+	}
+#endif
+}
+
 void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 {
 	struct nf_hook_state *state = &entry->state;
@@ -57,20 +75,28 @@ void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 		dev_put(state->out);
 	if (state->sk)
 		sock_put(state->sk);
+
+	nf_queue_entry_release_br_nf_refs(entry->skb);
+}
+EXPORT_SYMBOL_GPL(nf_queue_entry_release_refs);
+
+static void nf_queue_entry_get_br_nf_refs(struct sk_buff *skb)
+{
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
+	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
+
+	if (nf_bridge) {
 		struct net_device *physdev;
 
-		physdev = nf_bridge_get_physindev(entry->skb);
+		physdev = nf_bridge_get_physindev(skb);
 		if (physdev)
-			dev_put(physdev);
-		physdev = nf_bridge_get_physoutdev(entry->skb);
+			dev_hold(physdev);
+		physdev = nf_bridge_get_physoutdev(skb);
 		if (physdev)
-			dev_put(physdev);
+			dev_hold(physdev);
 	}
 #endif
 }
-EXPORT_SYMBOL_GPL(nf_queue_entry_release_refs);
 
 /* Bump dev refs so they don't vanish while packet is out */
 void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
@@ -83,18 +109,8 @@ void nf_queue_entry_get_refs(struct nf_queue_entry *entry)
 		dev_hold(state->out);
 	if (state->sk)
 		sock_hold(state->sk);
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
-		struct net_device *physdev;
 
-		physdev = nf_bridge_get_physindev(entry->skb);
-		if (physdev)
-			dev_hold(physdev);
-		physdev = nf_bridge_get_physoutdev(entry->skb);
-		if (physdev)
-			dev_hold(physdev);
-	}
-#endif
+	nf_queue_entry_get_br_nf_refs(entry->skb);
 }
 EXPORT_SYMBOL_GPL(nf_queue_entry_get_refs);
 
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 1ce30efe6854..0dcc3592d053 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -727,13 +727,13 @@ nf_queue_entry_dup(struct nf_queue_entry *e)
  */
 static void nf_bridge_adjust_skb_data(struct sk_buff *skb)
 {
-	if (skb->nf_bridge)
+	if (nf_bridge_info_get(skb))
 		__skb_push(skb, skb->network_header - skb->mac_header);
 }
 
 static void nf_bridge_adjust_segmented_data(struct sk_buff *skb)
 {
-	if (skb->nf_bridge)
+	if (nf_bridge_info_get(skb))
 		__skb_pull(skb, skb->network_header - skb->mac_header);
 }
 #else
@@ -904,23 +904,22 @@ nfqnl_set_mode(struct nfqnl_instance *queue,
 static int
 dev_cmp(struct nf_queue_entry *entry, unsigned long ifindex)
 {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	int physinif, physoutif;
+
+	physinif = nf_bridge_get_physinif(entry->skb);
+	physoutif = nf_bridge_get_physoutif(entry->skb);
+
+	if (physinif == ifindex || physoutif == ifindex)
+		return 1;
+#endif
 	if (entry->state.in)
 		if (entry->state.in->ifindex == ifindex)
 			return 1;
 	if (entry->state.out)
 		if (entry->state.out->ifindex == ifindex)
 			return 1;
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (entry->skb->nf_bridge) {
-		int physinif, physoutif;
 
-		physinif = nf_bridge_get_physinif(entry->skb);
-		physoutif = nf_bridge_get_physoutif(entry->skb);
-
-		if (physinif == ifindex || physoutif == ifindex)
-			return 1;
-	}
-#endif
 	return 0;
 }
 
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 9d6d67b953ac..4034d70bff39 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -33,7 +33,7 @@ physdev_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	/* Not a bridged IP packet or no info available yet:
 	 * LOCAL_OUT/mangle and LOCAL_OUT/nat don't know if
 	 * the destination device will be a bridge. */
-	if (!skb->nf_bridge) {
+	if (!nf_bridge_info_exists(skb)) {
 		/* Return MATCH if the invert flags of the used options are on */
 		if ((info->bitmask & XT_PHYSDEV_OP_BRIDGED) &&
 		    !(info->invert & XT_PHYSDEV_OP_BRIDGED))
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 01/13] netfilter: avoid using skb->nf_bridge directly Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
       [not found]   ` <CAPUCuiADYwjY4kpq76-w9BKL3uiRvNjnmzKG29mCrb=b8YeesA@mail.gmail.com>
                     ` (3 more replies)
  2018-12-10 14:49 ` [PATCH net-next 03/13] net: convert bridge_nf to use " Florian Westphal
                   ` (11 subsequent siblings)
  13 siblings, 4 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

The (out-of-tree) Multipath-TCP implementation needs a significant amount
of extra space in the skb control buffer.

Increasing skb->cb[] size in mainline is a non-starter for memory and
and performance reasons (f.e. increase in cb size also moves several
frequently-accessed fields to other cache lines).

One approach that might work for MPTCP is to extend skb_shared_info instead
of sk_buff.  However, this comes with other drawbacks, e.g.  it either
needs special skb allocation to make sure there is enough space for such
'extended shinfo' at the end of data buffer (which would make this only
useable for the MPTCP tx path) or such a change would increase size of
skb_shared_info.

This work adds an extension infrastructure for sk_buff:
1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb.

This is also how xfrm and bridge netfilter skb-associated data
(skb->sp and skb->nf_bridge) is handled.

This series removes the nf_bridge pointer and makes bridge netfilter
use the extension infrastructure instead.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   have been allocated for the skb.
2. extension pointer, located at the end of the sk_buff.
   If active_extensions is 0, its content is undefined.

The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.

This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but replaces the existing code that
deals with skb->nf_bridge.

This patch only adds the basic infrastructure, the nf_bridge conversion
is added in the next patch.

Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension is planned
as a followup.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h | 119 +++++++++++++++++++++++++++++++++++-
 net/Kconfig            |   3 +
 net/core/skbuff.c      | 133 +++++++++++++++++++++++++++++++++++++++++
 net/ipv4/ip_output.c   |   1 +
 net/ipv6/ip6_output.c  |   1 +
 5 files changed, 256 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b1831a5ca173..d715736eb734 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -245,6 +245,7 @@ struct iov_iter;
 struct napi_struct;
 struct bpf_prog;
 union bpf_attr;
+struct skb_ext;
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 struct nf_conntrack {
@@ -636,6 +637,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@queue_mapping: Queue mapping for multiqueue devices
  *	@xmit_more: More SKBs are pending for this queue
  *	@pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
+ *	@active_extensions: active extensions (skb_ext_id types)
  *	@ndisc_nodetype: router type (from link layer)
  *	@ooo_okay: allow the mapping of a socket to a queue to be changed
  *	@l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -665,6 +667,7 @@ typedef unsigned char *sk_buff_data_t;
  *	@data: Data head pointer
  *	@truesize: Buffer size
  *	@users: User count - see {datagram,tcp}.c
+ *	@extensions: allocated extensions, valid if active_extensions is nonzero
  */
 
 struct sk_buff {
@@ -747,7 +750,9 @@ struct sk_buff {
 				head_frag:1,
 				xmit_more:1,
 				pfmemalloc:1;
-
+#ifdef CONFIG_SKB_EXTENSIONS
+	__u8			active_extensions;
+#endif
 	/* fields enclosed in headers_start/headers_end are copied
 	 * using a single memcpy() in __copy_skb_header()
 	 */
@@ -869,6 +874,11 @@ struct sk_buff {
 				*data;
 	unsigned int		truesize;
 	refcount_t		users;
+
+#ifdef CONFIG_SKB_EXTENSIONS
+	/* only useable after checking ->active_extensions != 0 */
+	struct skb_ext		*extensions;
+#endif
 };
 
 #ifdef __KERNEL__
@@ -3896,6 +3906,113 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct)
 		atomic_inc(&nfct->use);
 }
 #endif
+
+#ifdef CONFIG_SKB_EXTENSIONS
+enum skb_ext_id {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	SKB_EXT_BRIDGE_NF,
+#endif
+	SKB_EXT_NUM, /* must be last */
+};
+
+/**
+ *	struct skb_ext - sk_buff extensions
+ *	@refcnt: 1 on allocation, deallocated on 0
+ *	@offset: offset to add to @data to obtain extension address
+ *	@chunks: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units
+ *	@data: start of extension data, variable sized
+ *
+ *	Note: offsets/lengths are stored in chunks of 8 bytes, this allows
+ *	to use 'u8' types while allowing up to 2kb worth of extension data.
+ */
+struct skb_ext {
+	refcount_t refcnt;
+	u8 offset[SKB_EXT_NUM]; /* in chunks of 8 bytes */
+	u8 chunks;		/* same */
+	char data[0] __aligned(8);
+};
+
+void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
+void __skb_ext_free(struct skb_ext *ext);
+
+static inline void __skb_ext_put(struct skb_ext *ext)
+{
+	if (ext && refcount_dec_and_test(&ext->refcnt))
+		__skb_ext_free(ext);
+}
+
+static inline void skb_ext_put(struct sk_buff *skb)
+{
+	if (skb->active_extensions)
+		__skb_ext_put(skb->extensions);
+}
+
+static inline void skb_ext_get(struct sk_buff *skb)
+{
+	if (skb->active_extensions) {
+		struct skb_ext *ext = skb->extensions;
+
+		if (ext)
+			refcount_inc(&ext->refcnt);
+	}
+}
+
+static inline void __skb_ext_copy(struct sk_buff *dst,
+				  const struct sk_buff *src)
+{
+	dst->active_extensions = src->active_extensions;
+
+	if (src->active_extensions) {
+		struct skb_ext *ext = src->extensions;
+
+		if (ext)
+			refcount_inc(&ext->refcnt);
+		dst->extensions = ext;
+	}
+}
+
+static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *src)
+{
+	skb_ext_put(dst);
+	__skb_ext_copy(dst, src);
+}
+
+static inline bool __skb_ext_exist(const struct skb_ext *ext, enum skb_ext_id i)
+{
+	return !!ext->offset[i];
+}
+
+static inline bool skb_ext_exist(const struct sk_buff *skb, enum skb_ext_id id)
+{
+	return skb->active_extensions & (1 << id);
+}
+
+static inline void skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
+{
+	if (skb_ext_exist(skb, id))
+		__skb_ext_del(skb, id);
+}
+
+static inline void *skb_ext_find(const struct sk_buff *skb, enum skb_ext_id id)
+{
+	if (skb_ext_exist(skb, id)) {
+		struct skb_ext *ext = skb->extensions;
+
+		if (ext && __skb_ext_exist(ext, id))
+			return (void *)ext + (ext->offset[id] << 3);
+	}
+
+	return NULL;
+}
+#else
+static inline void skb_ext_put(struct sk_buff *skb) {}
+static inline void skb_ext_get(struct sk_buff *skb) {}
+static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
+static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
+static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
+#endif /* CONFIG_SKB_EXTENSIONS */
+
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
 {
diff --git a/net/Kconfig b/net/Kconfig
index f235edb593ba..93b291292860 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -51,6 +51,9 @@ config NET_INGRESS
 config NET_EGRESS
 	bool
 
+config SKB_EXTENSIONS
+	bool
+
 menu "Networking options"
 
 source "net/packet/Kconfig"
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 40552547c69a..e36461d03d4c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -617,6 +617,7 @@ void skb_release_head_state(struct sk_buff *skb)
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	nf_bridge_put(skb->nf_bridge);
 #endif
+	skb_ext_put(skb);
 }
 
 /* Free everything but the sk_buff shell. */
@@ -796,6 +797,7 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	new->dev		= old->dev;
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	skb_dst_copy(new, old);
+	__skb_ext_copy(new, old);
 #ifdef CONFIG_XFRM
 	new->sp			= secpath_get(old->sp);
 #endif
@@ -5554,3 +5556,134 @@ void skb_condense(struct sk_buff *skb)
 	 */
 	skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 }
+
+#ifdef CONFIG_SKB_EXTENSIONS
+#define SKB_EXT_ALIGN_VALUE	8
+#define SKB_EXT_CHUNKSIZEOF(x)	(ALIGN((sizeof(x)), SKB_EXT_ALIGN_VALUE) / SKB_EXT_ALIGN_VALUE)
+
+static const u8 skb_ext_type_len[] = {
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+	[SKB_EXT_BRIDGE_NF] = SKB_EXT_CHUNKSIZEOF(struct nf_bridge_info),
+#endif
+};
+
+static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
+{
+	return (void *)ext + (ext->offset[id] * SKB_EXT_ALIGN_VALUE);
+}
+
+static struct skb_ext *skb_ext_cow(unsigned int len,
+				   struct skb_ext *old)
+{
+	struct skb_ext *new = kmalloc(len, GFP_ATOMIC);
+
+	if (!new)
+		return NULL;
+
+	if (!old) {
+		memset(new->offset, 0, sizeof(new->offset));
+		refcount_set(&new->refcnt, 1);
+		return new;
+	}
+
+	memcpy(new, old, old->chunks * SKB_EXT_ALIGN_VALUE);
+	refcount_set(&new->refcnt, 1);
+	__skb_ext_put(old);
+	return new;
+}
+
+static __always_inline unsigned int skb_ext_total_length(void)
+{
+	return 0 +
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+		skb_ext_type_len[SKB_EXT_BRIDGE_NF] +
+#endif
+		0;
+}
+
+/**
+ * skb_ext_add - allocate space for given extension, COW if needed
+ * @skb: buffer
+ * @id: extension to allocate space for
+ *
+ * Allocates enough space for the given extension.
+ * If the extension is already present, a pointer to that extension
+ * is returned.
+ *
+ * If the skb was cloned, COW applies and the returned memory can be
+ * modified without changing the extension space of clones buffers.
+ *
+ * Returns pointer to the extenion or NULL on allocation failure.
+ */
+void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
+{
+	struct skb_ext *new, *old = NULL;
+	unsigned int newlen, newoff;
+	bool cow_needed = true;
+
+	BUILD_BUG_ON(SKB_EXT_NUM >= 8);
+	BUILD_BUG_ON(skb_ext_total_length() > 255);
+
+	if (skb->active_extensions) {
+		old = skb->extensions;
+
+		cow_needed = refcount_read(&old->refcnt) > 1;
+
+		if (__skb_ext_exist(old, id)) {
+			if (!cow_needed) {
+				new = old;
+				goto set_active;
+			}
+
+			/* extension was allocated previously and it
+			 * might be used by a cloned skb. COW needed.
+			 */
+			new = skb_ext_cow(old->chunks * SKB_EXT_ALIGN_VALUE, old);
+			if (!new)
+				return NULL;
+
+			skb->extensions = new;
+			goto set_active;
+		}
+		newoff = old->chunks;
+	} else {
+		newoff = SKB_EXT_CHUNKSIZEOF(*new);
+	}
+
+	newlen = newoff + skb_ext_type_len[id];
+
+	if (cow_needed)
+		new = skb_ext_cow(newlen * SKB_EXT_ALIGN_VALUE, old);
+	else
+		new = krealloc(old, newlen * SKB_EXT_ALIGN_VALUE, GFP_ATOMIC);
+	if (!new)
+		return NULL;
+
+	new->offset[id] = newoff;
+	new->chunks = newlen;
+	skb->extensions = new;
+set_active:
+	skb->active_extensions |= 1 << id;
+	return skb_ext_get_ptr(new, id);
+}
+EXPORT_SYMBOL(skb_ext_add);
+
+void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
+{
+	struct skb_ext *ext;
+
+	skb->active_extensions &= ~(1 << id);
+	if (skb->active_extensions == 0) {
+		ext = skb->extensions;
+		skb->extensions = NULL;
+		__skb_ext_put(ext);
+	}
+}
+EXPORT_SYMBOL(__skb_ext_del);
+
+void __skb_ext_free(struct skb_ext *ext)
+{
+	kfree(ext);
+}
+EXPORT_SYMBOL(__skb_ext_free);
+#endif /* CONFIG_SKB_EXTENSIONS */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index ab6618036afe..c80188875f39 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -533,6 +533,7 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	to->tc_index = from->tc_index;
 #endif
 	nf_copy(to, from);
+	skb_ext_copy(to, from);
 #if IS_ENABLED(CONFIG_IP_VS)
 	to->ipvs_property = from->ipvs_property;
 #endif
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 9d55ee33b7f9..703a8e801c5c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -581,6 +581,7 @@ static void ip6_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	to->tc_index = from->tc_index;
 #endif
 	nf_copy(to, from);
+	skb_ext_copy(to, from);
 	skb_copy_secmark(to, from);
 }
 
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 03/13] net: convert bridge_nf to use skb extension infrastructure
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 01/13] netfilter: avoid using skb->nf_bridge directly Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 04/13] xfrm: change secpath_set to return secpath struct, not error value Florian Westphal
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.

The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h     |  4 ++--
 include/linux/skbuff.h               | 28 ++--------------------------
 include/net/netfilter/br_netfilter.h |  8 ++++----
 net/Kconfig                          |  1 +
 net/bridge/br_netfilter_hooks.c      | 20 ++------------------
 net/bridge/br_netfilter_ipv6.c       |  4 ++--
 net/core/skbuff.c                    |  3 ---
 7 files changed, 13 insertions(+), 55 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 0a65a422587c..5f2614d02e03 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -20,12 +20,12 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 static inline struct nf_bridge_info *
 nf_bridge_info_get(const struct sk_buff *skb)
 {
-	return skb->nf_bridge;
+	return skb_ext_find(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline bool nf_bridge_info_exists(const struct sk_buff *skb)
 {
-	return skb->nf_bridge != NULL;
+	return skb_ext_exist(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline int nf_bridge_get_physinif(const struct sk_buff *skb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d715736eb734..d2d32c9fa715 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -255,7 +255,6 @@ struct nf_conntrack {
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 struct nf_bridge_info {
-	refcount_t		use;
 	enum {
 		BRNF_PROTO_UNCHANGED,
 		BRNF_PROTO_8021Q,
@@ -720,9 +719,6 @@ struct sk_buff {
 #endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	unsigned long		 _nfct;
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	struct nf_bridge_info	*nf_bridge;
 #endif
 	unsigned int		len,
 				data_len;
@@ -4013,18 +4009,6 @@ static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
 static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
 #endif /* CONFIG_SKB_EXTENSIONS */
 
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
-{
-	if (nf_bridge && refcount_dec_and_test(&nf_bridge->use))
-		kfree(nf_bridge);
-}
-static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
-{
-	if (nf_bridge)
-		refcount_inc(&nf_bridge->use);
-}
-#endif /* CONFIG_BRIDGE_NETFILTER */
 static inline void nf_reset(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
@@ -4032,8 +4016,7 @@ static inline void nf_reset(struct sk_buff *skb)
 	skb->_nfct = 0;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(skb->nf_bridge);
-	skb->nf_bridge = NULL;
+	skb_ext_del(skb, SKB_EXT_BRIDGE_NF);
 #endif
 }
 
@@ -4051,7 +4034,7 @@ static inline void ipvs_reset(struct sk_buff *skb)
 #endif
 }
 
-/* Note: This doesn't put any conntrack and bridge info in dst. */
+/* Note: This doesn't put any conntrack info in dst. */
 static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 			     bool copy)
 {
@@ -4059,10 +4042,6 @@ static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
 	dst->_nfct = src->_nfct;
 	nf_conntrack_get(skb_nfct(src));
 #endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	dst->nf_bridge  = src->nf_bridge;
-	nf_bridge_get(src->nf_bridge);
-#endif
 #if IS_ENABLED(CONFIG_NETFILTER_XT_TARGET_TRACE) || defined(CONFIG_NF_TABLES)
 	if (copy)
 		dst->nf_trace = src->nf_trace;
@@ -4073,9 +4052,6 @@ static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	nf_conntrack_put(skb_nfct(dst));
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(dst->nf_bridge);
 #endif
 	__nf_copy(dst, src, true);
 }
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index 6efc0153987b..4cd56808ac4e 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -6,12 +6,12 @@
 
 static inline struct nf_bridge_info *nf_bridge_alloc(struct sk_buff *skb)
 {
-	skb->nf_bridge = kzalloc(sizeof(struct nf_bridge_info), GFP_ATOMIC);
+	struct nf_bridge_info *b = skb_ext_add(skb, SKB_EXT_BRIDGE_NF);
 
-	if (likely(skb->nf_bridge))
-		refcount_set(&(skb->nf_bridge->use), 1);
+	if (b)
+		memset(b, 0, sizeof(*b));
 
-	return skb->nf_bridge;
+	return b;
 }
 
 void nf_bridge_update_protocol(struct sk_buff *skb);
diff --git a/net/Kconfig b/net/Kconfig
index 93b291292860..5cb9de1aaf88 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -187,6 +187,7 @@ config BRIDGE_NETFILTER
 	depends on NETFILTER && INET
 	depends on NETFILTER_ADVANCED
 	select NETFILTER_FAMILY_BRIDGE
+	select SKB_EXTENSIONS
 	default m
 	---help---
 	  Enabling this option will let arptables resp. iptables see bridged
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index c58cf68b45c5..d21a23698410 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -132,10 +132,7 @@ static DEFINE_PER_CPU(struct brnf_frag_data, brnf_frag_data_storage);
 
 static void nf_bridge_info_free(struct sk_buff *skb)
 {
-	if (skb->nf_bridge) {
-		nf_bridge_put(skb->nf_bridge);
-		skb->nf_bridge = NULL;
-	}
+	skb_ext_del(skb, SKB_EXT_BRIDGE_NF);
 }
 
 static inline struct net_device *bridge_parent(const struct net_device *dev)
@@ -148,19 +145,7 @@ static inline struct net_device *bridge_parent(const struct net_device *dev)
 
 static inline struct nf_bridge_info *nf_bridge_unshare(struct sk_buff *skb)
 {
-	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-
-	if (refcount_read(&nf_bridge->use) > 1) {
-		struct nf_bridge_info *tmp = nf_bridge_alloc(skb);
-
-		if (tmp) {
-			memcpy(tmp, nf_bridge, sizeof(struct nf_bridge_info));
-			refcount_set(&tmp->use, 1);
-		}
-		nf_bridge_put(nf_bridge);
-		nf_bridge = tmp;
-	}
-	return nf_bridge;
+	return skb_ext_add(skb, SKB_EXT_BRIDGE_NF);
 }
 
 unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
@@ -508,7 +493,6 @@ static unsigned int br_nf_pre_routing(void *priv,
 	if (br_validate_ipv4(state->net, skb))
 		return NF_DROP;
 
-	nf_bridge_put(skb->nf_bridge);
 	if (!nf_bridge_alloc(skb))
 		return NF_DROP;
 	if (!setup_pre_routing(skb))
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 96c072e71ea2..94039f588f1d 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -224,8 +224,8 @@ unsigned int br_nf_pre_routing_ipv6(void *priv,
 	if (br_validate_ipv6(state->net, skb))
 		return NF_DROP;
 
-	nf_bridge_put(skb->nf_bridge);
-	if (!nf_bridge_alloc(skb))
+	nf_bridge = nf_bridge_alloc(skb);
+	if (!nf_bridge)
 		return NF_DROP;
 	if (!setup_pre_routing(skb))
 		return NF_DROP;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e36461d03d4c..960406724970 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -613,9 +613,6 @@ void skb_release_head_state(struct sk_buff *skb)
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	nf_conntrack_put(skb_nfct(skb));
-#endif
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	nf_bridge_put(skb->nf_bridge);
 #endif
 	skb_ext_put(skb);
 }
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 04/13] xfrm: change secpath_set to return secpath struct, not error value
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (2 preceding siblings ...)
  2018-12-10 14:49 ` [PATCH net-next 03/13] net: convert bridge_nf to use " Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 05/13] net: move secpath_exist helper to sk_buff.h Florian Westphal
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.

This avoids direct access to skb->sp:

err = secpath_set(skb);
if (!err) ..
skb->sp-> ...

Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..

This reduces noise in followup patch which is going to remove skb->sp.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/xfrm.h      |  2 +-
 net/ipv4/esp4_offload.c | 11 ++++++-----
 net/ipv6/esp6_offload.c | 11 ++++++-----
 net/ipv6/xfrm6_input.c  |  6 ++++--
 net/xfrm/xfrm_input.c   | 16 +++++++++-------
 5 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 0eb390c205af..71b88b539f37 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1128,7 +1128,7 @@ secpath_put(struct sec_path *sp)
 }
 
 struct sec_path *secpath_dup(struct sec_path *src);
-int secpath_set(struct sk_buff *skb);
+struct sec_path *secpath_set(struct sk_buff *skb);
 
 static inline void
 secpath_reset(struct sk_buff *skb)
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 58834a10c0be..19bd22aa05f9 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -46,11 +46,12 @@ static struct sk_buff *esp4_gro_receive(struct list_head *head,
 
 	xo = xfrm_offload(skb);
 	if (!xo || !(xo->flags & CRYPTO_DONE)) {
-		err = secpath_set(skb);
-		if (err)
+		struct sec_path *sp = secpath_set(skb);
+
+		if (!sp)
 			goto out;
 
-		if (skb->sp->len == XFRM_MAX_DEPTH)
+		if (sp->len == XFRM_MAX_DEPTH)
 			goto out;
 
 		x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
@@ -59,8 +60,8 @@ static struct sk_buff *esp4_gro_receive(struct list_head *head,
 		if (!x)
 			goto out;
 
-		skb->sp->xvec[skb->sp->len++] = x;
-		skb->sp->olen++;
+		sp->xvec[sp->len++] = x;
+		sp->olen++;
 
 		xo = xfrm_offload(skb);
 		if (!xo) {
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 6177e2171171..01a97f5dfa4e 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -68,11 +68,12 @@ static struct sk_buff *esp6_gro_receive(struct list_head *head,
 
 	xo = xfrm_offload(skb);
 	if (!xo || !(xo->flags & CRYPTO_DONE)) {
-		err = secpath_set(skb);
-		if (err)
+		struct sec_path *sp = secpath_set(skb);
+
+		if (!sp)
 			goto out;
 
-		if (skb->sp->len == XFRM_MAX_DEPTH)
+		if (sp->len == XFRM_MAX_DEPTH)
 			goto out;
 
 		x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
@@ -81,8 +82,8 @@ static struct sk_buff *esp6_gro_receive(struct list_head *head,
 		if (!x)
 			goto out;
 
-		skb->sp->xvec[skb->sp->len++] = x;
-		skb->sp->olen++;
+		sp->xvec[sp->len++] = x;
+		sp->olen++;
 
 		xo = xfrm_offload(skb);
 		if (!xo) {
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 9ef490dddcea..97c69df1b329 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -86,14 +86,16 @@ int xfrm6_input_addr(struct sk_buff *skb, xfrm_address_t *daddr,
 {
 	struct net *net = dev_net(skb->dev);
 	struct xfrm_state *x = NULL;
+	struct sec_path *sp;
 	int i = 0;
 
-	if (secpath_set(skb)) {
+	sp = secpath_set(skb);
+	if (!sp) {
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINERROR);
 		goto drop;
 	}
 
-	if (1 + skb->sp->len == XFRM_MAX_DEPTH) {
+	if (1 + sp->len == XFRM_MAX_DEPTH) {
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
 		goto drop;
 	}
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 684c0bc01e2c..bda929b9ff35 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -145,21 +145,22 @@ struct sec_path *secpath_dup(struct sec_path *src)
 }
 EXPORT_SYMBOL(secpath_dup);
 
-int secpath_set(struct sk_buff *skb)
+struct sec_path *secpath_set(struct sk_buff *skb)
 {
-	struct sec_path *sp;
+	struct sec_path *sp = skb->sp;
 
 	/* Allocate new secpath or COW existing one. */
-	if (!skb->sp || refcount_read(&skb->sp->refcnt) != 1) {
+	if (!sp || refcount_read(&sp->refcnt) != 1) {
 		sp = secpath_dup(skb->sp);
 		if (!sp)
-			return -ENOMEM;
+			return NULL;
 
 		if (skb->sp)
 			secpath_put(skb->sp);
 		skb->sp = sp;
 	}
-	return 0;
+
+	return sp;
 }
 EXPORT_SYMBOL(secpath_set);
 
@@ -236,6 +237,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 	bool xfrm_gro = false;
 	bool crypto_done = false;
 	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct sec_path *sp;
 
 	if (encap_type < 0) {
 		x = xfrm_input_state(skb);
@@ -312,8 +314,8 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 		break;
 	}
 
-	err = secpath_set(skb);
-	if (err) {
+	sp = secpath_set(skb);
+	if (!sp) {
 		XFRM_INC_STATS(net, LINUX_MIB_XFRMINERROR);
 		goto drop;
 	}
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 05/13] net: move secpath_exist helper to sk_buff.h
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (3 preceding siblings ...)
  2018-12-10 14:49 ` [PATCH net-next 04/13] xfrm: change secpath_set to return secpath struct, not error value Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
  2018-12-10 14:49 ` [PATCH net-next 06/13] net: use skb_sec_path helper in more places Florian Westphal
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h   | 13 ++++++++++---
 include/net/xfrm.h       |  9 ---------
 net/netfilter/nft_meta.c |  2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d2d32c9fa715..da5f8226b55a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4074,12 +4074,19 @@ static inline void skb_init_secmark(struct sk_buff *skb)
 { }
 #endif
 
+static inline int secpath_exists(const struct sk_buff *skb)
+{
+#ifdef CONFIG_XFRM
+	return skb->sp != NULL;
+#else
+	return 0;
+#endif
+}
+
 static inline bool skb_irq_freeable(const struct sk_buff *skb)
 {
 	return !skb->destructor &&
-#if IS_ENABLED(CONFIG_XFRM)
-		!skb->sp &&
-#endif
+		!secpath_exists(skb) &&
 		!skb_nfct(skb) &&
 		!skb->_skb_refdst &&
 		!skb_has_frag_list(skb);
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 71b88b539f37..4e1d074dc69e 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1101,15 +1101,6 @@ struct sec_path {
 	struct xfrm_offload	ovec[XFRM_MAX_OFFLOAD_DEPTH];
 };
 
-static inline int secpath_exists(struct sk_buff *skb)
-{
-#ifdef CONFIG_XFRM
-	return skb->sp != NULL;
-#else
-	return 0;
-#endif
-}
-
 static inline struct sec_path *
 secpath_get(struct sec_path *sp)
 {
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 6180626c3f80..6df486c5ebd3 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -229,7 +229,7 @@ void nft_meta_get_eval(const struct nft_expr *expr,
 	}
 #ifdef CONFIG_XFRM
 	case NFT_META_SECPATH:
-		nft_reg_store8(dest, !!skb->sp);
+		nft_reg_store8(dest, secpath_exists(skb));
 		break;
 #endif
 #ifdef CONFIG_NF_TABLES_BRIDGE
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 06/13] net: use skb_sec_path helper in more places
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (4 preceding siblings ...)
  2018-12-10 14:49 ` [PATCH net-next 05/13] net: move secpath_exist helper to sk_buff.h Florian Westphal
@ 2018-12-10 14:49 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 07/13] drivers: net: intel: use secpath helpers " Florian Westphal
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:49 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type

same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h    |  2 +-
 include/net/xfrm.h        |  6 ++++--
 net/ipv4/esp4.c           |  9 ++++++---
 net/ipv4/esp4_offload.c   |  4 +++-
 net/ipv6/esp6.c           |  9 ++++++---
 net/ipv6/esp6_offload.c   |  4 +++-
 net/ipv6/xfrm6_input.c    |  2 +-
 net/netfilter/nft_xfrm.c  |  2 +-
 net/netfilter/xt_policy.c |  2 +-
 net/xfrm/xfrm_device.c    |  4 +++-
 net/xfrm/xfrm_input.c     | 16 ++++++++++------
 net/xfrm/xfrm_policy.c    | 19 +++++++++++--------
 security/selinux/xfrm.c   |  4 ++--
 13 files changed, 52 insertions(+), 31 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da5f8226b55a..3ad9fbeb4ac4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4132,7 +4132,7 @@ static inline bool skb_get_dst_pending_confirm(const struct sk_buff *skb)
 	return skb->dst_pending_confirm != 0;
 }
 
-static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
+static inline struct sec_path *skb_sec_path(const struct sk_buff *skb)
 {
 #ifdef CONFIG_XFRM
 	return skb->sp;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4e1d074dc69e..71411ebaf765 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1893,14 +1893,16 @@ static inline void xfrm_states_delete(struct xfrm_state **states, int n)
 #ifdef CONFIG_XFRM
 static inline struct xfrm_state *xfrm_input_state(struct sk_buff *skb)
 {
-	return skb->sp->xvec[skb->sp->len - 1];
+	struct sec_path *sp = skb_sec_path(skb);
+
+	return sp->xvec[sp->len - 1];
 }
 #endif
 
 static inline struct xfrm_offload *xfrm_offload(struct sk_buff *skb)
 {
 #ifdef CONFIG_XFRM
-	struct sec_path *sp = skb->sp;
+	struct sec_path *sp = skb_sec_path(skb);
 
 	if (!sp || !sp->olen || sp->len != sp->olen)
 		return NULL;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 9e1c840596c5..5459f41fc26f 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -125,10 +125,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
 	void *tmp;
 	struct xfrm_state *x;
 
-	if (xo && (xo->flags & XFRM_DEV_RESUME))
-		x = skb->sp->xvec[skb->sp->len - 1];
-	else
+	if (xo && (xo->flags & XFRM_DEV_RESUME)) {
+		struct sec_path *sp = skb_sec_path(skb);
+
+		x = sp->xvec[sp->len - 1];
+	} else {
 		x = skb_dst(skb)->xfrm;
+	}
 
 	tmp = ESP_SKB_CB(skb)->tmp;
 	esp_ssg_unref(x, tmp);
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 19bd22aa05f9..8756e0e790d2 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -115,6 +115,7 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
 	struct crypto_aead *aead;
 	netdev_features_t esp_features = features;
 	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct sec_path *sp;
 
 	if (!xo)
 		return ERR_PTR(-EINVAL);
@@ -122,7 +123,8 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
 	if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP))
 		return ERR_PTR(-EINVAL);
 
-	x = skb->sp->xvec[skb->sp->len - 1];
+	sp = skb_sec_path(skb);
+	x = sp->xvec[sp->len - 1];
 	aead = x->data;
 	esph = ip_esp_hdr(skb);
 
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 63b2b66f9dfa..5afe9f83374d 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -145,10 +145,13 @@ static void esp_output_done(struct crypto_async_request *base, int err)
 	void *tmp;
 	struct xfrm_state *x;
 
-	if (xo && (xo->flags & XFRM_DEV_RESUME))
-		x = skb->sp->xvec[skb->sp->len - 1];
-	else
+	if (xo && (xo->flags & XFRM_DEV_RESUME)) {
+		struct sec_path *sp = skb_sec_path(skb);
+
+		x = sp->xvec[sp->len - 1];
+	} else {
 		x = skb_dst(skb)->xfrm;
+	}
 
 	tmp = ESP_SKB_CB(skb)->tmp;
 	esp_ssg_unref(x, tmp);
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 01a97f5dfa4e..d46b4eb645c2 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -142,6 +142,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
 	struct crypto_aead *aead;
 	netdev_features_t esp_features = features;
 	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct sec_path *sp;
 
 	if (!xo)
 		return ERR_PTR(-EINVAL);
@@ -149,7 +150,8 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
 	if (!(skb_shinfo(skb)->gso_type & SKB_GSO_ESP))
 		return ERR_PTR(-EINVAL);
 
-	x = skb->sp->xvec[skb->sp->len - 1];
+	sp = skb_sec_path(skb);
+	x = sp->xvec[sp->len - 1];
 	aead = x->data;
 	esph = ip_esp_hdr(skb);
 
diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c
index 97c69df1b329..a52cb3fc6df5 100644
--- a/net/ipv6/xfrm6_input.c
+++ b/net/ipv6/xfrm6_input.c
@@ -147,7 +147,7 @@ int xfrm6_input_addr(struct sk_buff *skb, xfrm_address_t *daddr,
 		goto drop;
 	}
 
-	skb->sp->xvec[skb->sp->len++] = x;
+	sp->xvec[sp->len++] = x;
 
 	spin_lock(&x->lock);
 
diff --git a/net/netfilter/nft_xfrm.c b/net/netfilter/nft_xfrm.c
index 5322609f7662..b08865ec5ed3 100644
--- a/net/netfilter/nft_xfrm.c
+++ b/net/netfilter/nft_xfrm.c
@@ -161,7 +161,7 @@ static void nft_xfrm_get_eval_in(const struct nft_xfrm *priv,
 				    struct nft_regs *regs,
 				    const struct nft_pktinfo *pkt)
 {
-	const struct sec_path *sp = pkt->skb->sp;
+	const struct sec_path *sp = skb_sec_path(pkt->skb);
 	const struct xfrm_state *state;
 
 	if (sp == NULL || sp->len <= priv->spnum) {
diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c
index 13f8ccf946d6..aa84e8121c93 100644
--- a/net/netfilter/xt_policy.c
+++ b/net/netfilter/xt_policy.c
@@ -56,7 +56,7 @@ match_policy_in(const struct sk_buff *skb, const struct xt_policy_info *info,
 		unsigned short family)
 {
 	const struct xt_policy_elem *e;
-	const struct sec_path *sp = skb->sp;
+	const struct sec_path *sp = skb_sec_path(skb);
 	int strict = info->flags & XT_POLICY_MATCH_STRICT;
 	int i, pos;
 
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 144c137886b1..b8736f56e7f7 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -32,6 +32,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
 	struct softnet_data *sd;
 	netdev_features_t esp_features = features;
 	struct xfrm_offload *xo = xfrm_offload(skb);
+	struct sec_path *sp;
 
 	if (!xo)
 		return skb;
@@ -39,7 +40,8 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
 	if (!(features & NETIF_F_HW_ESP))
 		esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
 
-	x = skb->sp->xvec[skb->sp->len - 1];
+	sp = skb_sec_path(skb);
+	x = sp->xvec[sp->len - 1];
 	if (xo->flags & XFRM_GRO || x->xso.flags & XFRM_OFFLOAD_INBOUND)
 		return skb;
 
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index bda929b9ff35..b4db25b244fa 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -330,7 +330,9 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 	daddr = (xfrm_address_t *)(skb_network_header(skb) +
 				   XFRM_SPI_SKB_CB(skb)->daddroff);
 	do {
-		if (skb->sp->len == XFRM_MAX_DEPTH) {
+		sp = skb_sec_path(skb);
+
+		if (sp->len == XFRM_MAX_DEPTH) {
 			secpath_reset(skb);
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
 			goto drop;
@@ -346,7 +348,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 
 		skb->mark = xfrm_smark_get(skb->mark, x);
 
-		skb->sp->xvec[skb->sp->len++] = x;
+		sp->xvec[sp->len++] = x;
 
 lock:
 		spin_lock(&x->lock);
@@ -470,8 +472,9 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 	nf_reset(skb);
 
 	if (decaps) {
-		if (skb->sp)
-			skb->sp->olen = 0;
+		sp = skb_sec_path(skb);
+		if (sp)
+			sp->olen = 0;
 		skb_dst_drop(skb);
 		gro_cells_receive(&gro_cells, skb);
 		return 0;
@@ -482,8 +485,9 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 
 		err = x->inner_mode->afinfo->transport_finish(skb, xfrm_gro || async);
 		if (xfrm_gro) {
-			if (skb->sp)
-				skb->sp->olen = 0;
+			sp = skb_sec_path(skb);
+			if (sp)
+				sp->olen = 0;
 			skb_dst_drop(skb);
 			gro_cells_receive(&gro_cells, skb);
 			return err;
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 119a427d9b2b..49bc8d69ee87 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2225,11 +2225,12 @@ EXPORT_SYMBOL(xfrm_lookup_route);
 static inline int
 xfrm_secpath_reject(int idx, struct sk_buff *skb, const struct flowi *fl)
 {
+	struct sec_path *sp = skb_sec_path(skb);
 	struct xfrm_state *x;
 
-	if (!skb->sp || idx < 0 || idx >= skb->sp->len)
+	if (!sp || idx < 0 || idx >= sp->len)
 		return 0;
-	x = skb->sp->xvec[idx];
+	x = sp->xvec[idx];
 	if (!x->type->reject)
 		return 0;
 	return x->type->reject(x, skb, fl);
@@ -2329,6 +2330,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 	struct flowi fl;
 	int xerr_idx = -1;
 	const struct xfrm_if_cb *ifcb;
+	struct sec_path *sp;
 	struct xfrm_if *xi;
 	u32 if_id = 0;
 
@@ -2353,11 +2355,12 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 	nf_nat_decode_session(skb, &fl, family);
 
 	/* First, check used SA against their selectors. */
-	if (skb->sp) {
+	sp = skb_sec_path(skb);
+	if (sp) {
 		int i;
 
-		for (i = skb->sp->len-1; i >= 0; i--) {
-			struct xfrm_state *x = skb->sp->xvec[i];
+		for (i = sp->len - 1; i >= 0; i--) {
+			struct xfrm_state *x = sp->xvec[i];
 			if (!xfrm_selector_match(&x->sel, &fl, family)) {
 				XFRM_INC_STATS(net, LINUX_MIB_XFRMINSTATEMISMATCH);
 				return 0;
@@ -2384,7 +2387,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 	}
 
 	if (!pol) {
-		if (skb->sp && secpath_has_nontransport(skb->sp, 0, &xerr_idx)) {
+		if (sp && secpath_has_nontransport(sp, 0, &xerr_idx)) {
 			xfrm_secpath_reject(xerr_idx, skb, &fl);
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMINNOPOLS);
 			return 0;
@@ -2413,7 +2416,6 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 #endif
 
 	if (pol->action == XFRM_POLICY_ALLOW) {
-		struct sec_path *sp;
 		static struct sec_path dummy;
 		struct xfrm_tmpl *tp[XFRM_MAX_DEPTH];
 		struct xfrm_tmpl *stp[XFRM_MAX_DEPTH];
@@ -2421,7 +2423,8 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct sk_buff *skb,
 		int ti = 0;
 		int i, k;
 
-		if ((sp = skb->sp) == NULL)
+		sp = skb_sec_path(skb);
+		if (!sp)
 			sp = &dummy;
 
 		for (pi = 0; pi < npols; pi++) {
diff --git a/security/selinux/xfrm.c b/security/selinux/xfrm.c
index 91dc3783ed94..bd7d18bdb147 100644
--- a/security/selinux/xfrm.c
+++ b/security/selinux/xfrm.c
@@ -230,7 +230,7 @@ static int selinux_xfrm_skb_sid_ingress(struct sk_buff *skb,
 					u32 *sid, int ckall)
 {
 	u32 sid_session = SECSID_NULL;
-	struct sec_path *sp = skb->sp;
+	struct sec_path *sp = skb_sec_path(skb);
 
 	if (sp) {
 		int i;
@@ -408,7 +408,7 @@ int selinux_xfrm_sock_rcv_skb(u32 sk_sid, struct sk_buff *skb,
 			      struct common_audit_data *ad)
 {
 	int i;
-	struct sec_path *sp = skb->sp;
+	struct sec_path *sp = skb_sec_path(skb);
 	u32 peer_sid = SECINITSID_UNLABELED;
 
 	if (sp) {
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 07/13] drivers: net: intel: use secpath helpers in more places
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (5 preceding siblings ...)
  2018-12-10 14:49 ` [PATCH net-next 06/13] net: use skb_sec_path helper in more places Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 08/13] drivers: net: ethernet: mellanox: use skb_sec_path helper Florian Westphal
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Use skb_sec_path and secpath_exists helpers where possible.
This reduces noise in followup patch that removes skb->sp pointer.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c    | 6 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     | 5 +++--
 drivers/net/ethernet/intel/ixgbevf/ipsec.c        | 6 ++++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 4d77f42e035c..8befc7a50f8c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -1065,11 +1065,13 @@ int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
 	struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
 	struct ixgbe_ipsec *ipsec = adapter->ipsec;
 	struct xfrm_state *xs;
+	struct sec_path *sp;
 	struct tx_sa *tsa;
 
-	if (unlikely(!first->skb->sp->len)) {
+	sp = skb_sec_path(first->skb);
+	if (unlikely(!sp->len)) {
 		netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
-			   __func__, first->skb->sp->len);
+			   __func__, sp->len);
 		return 0;
 	}
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 49a4ea38eb07..8ff48dd5a90a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8695,7 +8695,8 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 #endif /* IXGBE_FCOE */
 
 #ifdef CONFIG_IXGBE_IPSEC
-	if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
+	if (secpath_exists(skb) &&
+	    !ixgbe_ipsec_tx(tx_ring, first, &ipsec_tx))
 		goto out_drop;
 #endif
 	tso = ixgbe_tso(tx_ring, first, &hdr_len, &ipsec_tx);
@@ -10191,7 +10192,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev,
 	 */
 	if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) {
 #ifdef CONFIG_IXGBE_IPSEC
-		if (!skb->sp)
+		if (!secpath_exists(skb))
 #endif
 			features &= ~NETIF_F_TSO;
 	}
diff --git a/drivers/net/ethernet/intel/ixgbevf/ipsec.c b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
index e8a3231be0bf..07644e6bf498 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
@@ -450,12 +450,14 @@ int ixgbevf_ipsec_tx(struct ixgbevf_ring *tx_ring,
 	struct ixgbevf_adapter *adapter = netdev_priv(tx_ring->netdev);
 	struct ixgbevf_ipsec *ipsec = adapter->ipsec;
 	struct xfrm_state *xs;
+	struct sec_path *sp;
 	struct tx_sa *tsa;
 	u16 sa_idx;
 
-	if (unlikely(!first->skb->sp->len)) {
+	sp = skb_sec_path(first->skb);
+	if (unlikely(!sp->len)) {
 		netdev_err(tx_ring->netdev, "%s: no xfrm state len = %d\n",
-			   __func__, first->skb->sp->len);
+			   __func__, sp->len);
 		return 0;
 	}
 
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 2de81f046fb5..49e23afa05a2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -4157,7 +4157,7 @@ static int ixgbevf_xmit_frame_ring(struct sk_buff *skb,
 	first->protocol = vlan_get_protocol(skb);
 
 #ifdef CONFIG_IXGBEVF_IPSEC
-	if (skb->sp && !ixgbevf_ipsec_tx(tx_ring, first, &ipsec_tx))
+	if (secpath_exists(skb) && !ixgbevf_ipsec_tx(tx_ring, first, &ipsec_tx))
 		goto out_drop;
 #endif
 	tso = ixgbevf_tso(tx_ring, first, &hdr_len, &ipsec_tx);
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 08/13] drivers: net: ethernet: mellanox: use skb_sec_path helper
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (6 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 07/13] drivers: net: intel: use secpath helpers " Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 09/13] drivers: net: netdevsim: " Florian Westphal
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Will avoid touching this when sp pointer is removed from sk_buff struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c    | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index 128a82b1dbfc..f6717c287ff4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -254,11 +254,13 @@ struct sk_buff *mlx5e_ipsec_handle_tx_skb(struct net_device *netdev,
 	struct mlx5e_ipsec_metadata *mdata;
 	struct mlx5e_ipsec_sa_entry *sa_entry;
 	struct xfrm_state *x;
+	struct sec_path *sp;
 
 	if (!xo)
 		return skb;
 
-	if (unlikely(skb->sp->len != 1)) {
+	sp = skb_sec_path(skb);
+	if (unlikely(sp->len != 1)) {
 		atomic64_inc(&priv->ipsec->sw_stats.ipsec_tx_drop_bundle);
 		goto drop;
 	}
@@ -372,10 +374,11 @@ struct sk_buff *mlx5e_ipsec_handle_rx_skb(struct net_device *netdev,
 bool mlx5e_ipsec_feature_check(struct sk_buff *skb, struct net_device *netdev,
 			       netdev_features_t features)
 {
+	struct sec_path *sp = skb_sec_path(skb);
 	struct xfrm_state *x;
 
-	if (skb->sp && skb->sp->len) {
-		x = skb->sp->xvec[0];
+	if (sp && sp->len) {
+		x = sp->xvec[0];
 		if (x && x->xso.offload_handle)
 			return true;
 	}
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 09/13] drivers: net: netdevsim: use skb_sec_path helper
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (7 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 08/13] drivers: net: ethernet: mellanox: use skb_sec_path helper Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 10/13] xfrm: use secpath_exist where applicable Florian Westphal
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/net/netdevsim/ipsec.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
index 2dcf6cc269d0..76e11d889bb6 100644
--- a/drivers/net/netdevsim/ipsec.c
+++ b/drivers/net/netdevsim/ipsec.c
@@ -227,18 +227,19 @@ static const struct xfrmdev_ops nsim_xfrmdev_ops = {
 
 bool nsim_ipsec_tx(struct netdevsim *ns, struct sk_buff *skb)
 {
+	struct sec_path *sp = skb_sec_path(skb);
 	struct nsim_ipsec *ipsec = &ns->ipsec;
 	struct xfrm_state *xs;
 	struct nsim_sa *tsa;
 	u32 sa_idx;
 
 	/* do we even need to check this packet? */
-	if (!skb->sp)
+	if (!sp)
 		return true;
 
-	if (unlikely(!skb->sp->len)) {
+	if (unlikely(!sp->len)) {
 		netdev_err(ns->netdev, "no xfrm state len = %d\n",
-			   skb->sp->len);
+			   sp->len);
 		return false;
 	}
 
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 10/13] xfrm: use secpath_exist where applicable
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (8 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 09/13] drivers: net: netdevsim: " Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 11/13] drivers: chelsio: use skb_sec_path helper Florian Westphal
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Will reduce noise when skb->sp is removed later in this series.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/xfrm.h        | 2 +-
 net/xfrm/xfrm_interface.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 71411ebaf765..da781735c21c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1182,7 +1182,7 @@ static inline int __xfrm_policy_check2(struct sock *sk, int dir,
 	if (sk && sk->sk_policy[XFRM_POLICY_IN])
 		return __xfrm_policy_check(sk, ndir, skb, family);
 
-	return	(!net->xfrm.policy_count[dir] && !skb->sp) ||
+	return	(!net->xfrm.policy_count[dir] && !secpath_exists(skb)) ||
 		(skb_dst(skb)->flags & DST_NOPOLICY) ||
 		__xfrm_policy_check(sk, ndir, skb, family);
 }
diff --git a/net/xfrm/xfrm_interface.c b/net/xfrm/xfrm_interface.c
index d679fa0f44b3..6be8c7df15bb 100644
--- a/net/xfrm/xfrm_interface.c
+++ b/net/xfrm/xfrm_interface.c
@@ -251,7 +251,7 @@ static int xfrmi_rcv_cb(struct sk_buff *skb, int err)
 	struct xfrm_if *xi;
 	bool xnet;
 
-	if (err && !skb->sp)
+	if (err && !secpath_exists(skb))
 		return 0;
 
 	x = xfrm_input_state(skb);
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 11/13] drivers: chelsio: use skb_sec_path helper
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (9 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 10/13] xfrm: use secpath_exist where applicable Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 12/13] xfrm: prefer secpath_set over secpath_dup Florian Westphal
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

reduce noise when skb->sp is removed later in the series.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/crypto/chelsio/chcr_ipsec.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/chelsio/chcr_ipsec.c b/drivers/crypto/chelsio/chcr_ipsec.c
index 461b97e2f1fd..ceaa16b8f72e 100644
--- a/drivers/crypto/chelsio/chcr_ipsec.c
+++ b/drivers/crypto/chelsio/chcr_ipsec.c
@@ -570,6 +570,7 @@ int chcr_ipsec_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct sge_eth_txq *q;
 	struct port_info *pi;
 	dma_addr_t addr[MAX_SKB_FRAGS + 1];
+	struct sec_path *sp;
 	bool immediate = false;
 
 	if (!x->xso.offload_handle)
@@ -578,7 +579,8 @@ int chcr_ipsec_xmit(struct sk_buff *skb, struct net_device *dev)
 	sa_entry = (struct ipsec_sa_entry *)x->xso.offload_handle;
 	kctx_len = sa_entry->kctx_len;
 
-	if (skb->sp->len != 1) {
+	sp = skb_sec_path(skb);
+	if (sp->len != 1) {
 out_free:       dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 12/13] xfrm: prefer secpath_set over secpath_dup
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (10 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 11/13] drivers: chelsio: use skb_sec_path helper Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-10 14:50 ` [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure Florian Westphal
  2018-12-13  4:08 ` [PATCH net-next 0/13] sk_buff: add " Shannon Nelson
  13 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

secpath_set is a wrapper for secpath_dup that will not perform
an allocation if the secpath attached to the skb has a reference count
of one, i.e., it doesn't need to be COW'ed.

Also, secpath_dup doesn't attach the secpath to the skb, it leaves
this to the caller.

Use secpath_set in places that immediately assign the return value to
skb.

This allows to remove skb->sp without touching these spots again.

secpath_dup can eventually be removed in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c         |  9 +++++----
 drivers/net/ethernet/intel/ixgbevf/ipsec.c             |  9 +++++----
 .../ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c  | 10 ++++++----
 net/xfrm/Kconfig                                       |  1 +
 net/xfrm/xfrm_output.c                                 |  7 ++-----
 5 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 8befc7a50f8c..ff85ce5791a3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -1161,6 +1161,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
 	struct xfrm_state *xs = NULL;
 	struct ipv6hdr *ip6 = NULL;
 	struct iphdr *ip4 = NULL;
+	struct sec_path *sp;
 	void *daddr;
 	__be32 spi;
 	u8 *c_hdr;
@@ -1200,12 +1201,12 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
 	if (unlikely(!xs))
 		return;
 
-	skb->sp = secpath_dup(skb->sp);
-	if (unlikely(!skb->sp))
+	sp = secpath_set(skb);
+	if (unlikely(!sp))
 		return;
 
-	skb->sp->xvec[skb->sp->len++] = xs;
-	skb->sp->olen++;
+	sp->xvec[sp->len++] = xs;
+	sp->olen++;
 	xo = xfrm_offload(skb);
 	xo->flags = CRYPTO_DONE;
 	xo->status = CRYPTO_SUCCESS;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ipsec.c b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
index 07644e6bf498..5170dd9d8705 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ipsec.c
@@ -548,6 +548,7 @@ void ixgbevf_ipsec_rx(struct ixgbevf_ring *rx_ring,
 	struct xfrm_state *xs = NULL;
 	struct ipv6hdr *ip6 = NULL;
 	struct iphdr *ip4 = NULL;
+	struct sec_path *sp;
 	void *daddr;
 	__be32 spi;
 	u8 *c_hdr;
@@ -587,12 +588,12 @@ void ixgbevf_ipsec_rx(struct ixgbevf_ring *rx_ring,
 	if (unlikely(!xs))
 		return;
 
-	skb->sp = secpath_dup(skb->sp);
-	if (unlikely(!skb->sp))
+	sp = secpath_set(skb);
+	if (unlikely(!sp))
 		return;
 
-	skb->sp->xvec[skb->sp->len++] = xs;
-	skb->sp->olen++;
+	sp->xvec[sp->len++] = xs;
+	sp->olen++;
 	xo = xfrm_offload(skb);
 	xo->flags = CRYPTO_DONE;
 	xo->status = CRYPTO_SUCCESS;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
index f6717c287ff4..53608afd39b6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c
@@ -307,10 +307,11 @@ mlx5e_ipsec_build_sp(struct net_device *netdev, struct sk_buff *skb,
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct xfrm_offload *xo;
 	struct xfrm_state *xs;
+	struct sec_path *sp;
 	u32 sa_handle;
 
-	skb->sp = secpath_dup(skb->sp);
-	if (unlikely(!skb->sp)) {
+	sp = secpath_set(skb);
+	if (unlikely(!sp)) {
 		atomic64_inc(&priv->ipsec->sw_stats.ipsec_rx_drop_sp_alloc);
 		return NULL;
 	}
@@ -322,8 +323,9 @@ mlx5e_ipsec_build_sp(struct net_device *netdev, struct sk_buff *skb,
 		return NULL;
 	}
 
-	skb->sp->xvec[skb->sp->len++] = xs;
-	skb->sp->olen++;
+	sp = skb_sec_path(skb);
+	sp->xvec[sp->len++] = xs;
+	sp->olen++;
 
 	xo = xfrm_offload(skb);
 	xo->flags = CRYPTO_DONE;
diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index 140270a13d54..5d43aaa17027 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -5,6 +5,7 @@ config XFRM
        bool
        depends on NET
        select GRO_CELLS
+       select SKB_EXTENSIONS
 
 config XFRM_OFFLOAD
        bool
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 4ae87c5ce2e3..757c4d11983b 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -218,19 +218,16 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
 	if (xfrm_dev_offload_ok(skb, x)) {
 		struct sec_path *sp;
 
-		sp = secpath_dup(skb->sp);
+		sp = secpath_set(skb);
 		if (!sp) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
 			kfree_skb(skb);
 			return -ENOMEM;
 		}
-		if (skb->sp)
-			secpath_put(skb->sp);
-		skb->sp = sp;
 		skb->encapsulation = 1;
 
 		sp->olen++;
-		sp->xvec[skb->sp->len++] = x;
+		sp->xvec[sp->len++] = x;
 		xfrm_state_hold(x);
 
 		if (skb_is_gso(skb)) {
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (11 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 12/13] xfrm: prefer secpath_set over secpath_dup Florian Westphal
@ 2018-12-10 14:50 ` Florian Westphal
  2018-12-11  8:06   ` Steffen Klassert
  2018-12-13  4:08 ` [PATCH net-next 0/13] sk_buff: add " Shannon Nelson
  13 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-10 14:50 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal

Remove skb->sp and allocate secpath storage via extension
infrastructure.  This also reduces sk_buff bu 8 bytes on x86_64.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Documentation/networking/xfrm_device.txt |  7 +--
 include/linux/skbuff.h                   | 10 ++---
 include/net/xfrm.h                       | 22 +---------
 net/core/skbuff.c                        | 54 +++++++++++++++++++----
 net/xfrm/xfrm_input.c                    | 56 ++++--------------------
 5 files changed, 64 insertions(+), 85 deletions(-)

diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.txt
index 267f55b5f54a..a1c904dc70dc 100644
--- a/Documentation/networking/xfrm_device.txt
+++ b/Documentation/networking/xfrm_device.txt
@@ -111,9 +111,10 @@ the stack in xfrm_input().
 		xfrm_state_hold(xs);
 
 	store the state information into the skb
-		skb->sp = secpath_dup(skb->sp);
-		skb->sp->xvec[skb->sp->len++] = xs;
-		skb->sp->olen++;
+		sp = secpath_set(skb);
+		if (!sp) return;
+		sp->xvec[sp->len++] = xs;
+		sp->olen++;
 
 	indicate the success and/or error status of the offload
 		xo = xfrm_offload(skb);
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3ad9fbeb4ac4..7e8962810066 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -714,9 +714,6 @@ struct sk_buff {
 		struct list_head	tcp_tsorted_anchor;
 	};
 
-#ifdef CONFIG_XFRM
-	struct	sec_path	*sp;
-#endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 	unsigned long		 _nfct;
 #endif
@@ -3905,6 +3902,9 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct)
 
 #ifdef CONFIG_SKB_EXTENSIONS
 enum skb_ext_id {
+#ifdef CONFIG_XFRM
+	SKB_EXT_SEC_PATH,
+#endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	SKB_EXT_BRIDGE_NF,
 #endif
@@ -4077,7 +4077,7 @@ static inline void skb_init_secmark(struct sk_buff *skb)
 static inline int secpath_exists(const struct sk_buff *skb)
 {
 #ifdef CONFIG_XFRM
-	return skb->sp != NULL;
+	return skb_ext_exist(skb, SKB_EXT_SEC_PATH);
 #else
 	return 0;
 #endif
@@ -4135,7 +4135,7 @@ static inline bool skb_get_dst_pending_confirm(const struct sk_buff *skb)
 static inline struct sec_path *skb_sec_path(const struct sk_buff *skb)
 {
 #ifdef CONFIG_XFRM
-	return skb->sp;
+	return skb_ext_find(skb, SKB_EXT_SEC_PATH);
 #else
 	return NULL;
 #endif
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index da781735c21c..72cdbe8e9939 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1093,7 +1093,6 @@ struct xfrm_offload {
 };
 
 struct sec_path {
-	refcount_t		refcnt;
 	int			len;
 	int			olen;
 
@@ -1101,32 +1100,13 @@ struct sec_path {
 	struct xfrm_offload	ovec[XFRM_MAX_OFFLOAD_DEPTH];
 };
 
-static inline struct sec_path *
-secpath_get(struct sec_path *sp)
-{
-	if (sp)
-		refcount_inc(&sp->refcnt);
-	return sp;
-}
-
-void __secpath_destroy(struct sec_path *sp);
-
-static inline void
-secpath_put(struct sec_path *sp)
-{
-	if (sp && refcount_dec_and_test(&sp->refcnt))
-		__secpath_destroy(sp);
-}
-
-struct sec_path *secpath_dup(struct sec_path *src);
 struct sec_path *secpath_set(struct sk_buff *skb);
 
 static inline void
 secpath_reset(struct sk_buff *skb)
 {
 #ifdef CONFIG_XFRM
-	secpath_put(skb->sp);
-	skb->sp = NULL;
+	skb_ext_del(skb, SKB_EXT_SEC_PATH);
 #endif
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 960406724970..a1181d6eab3b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -606,7 +606,6 @@ static void kfree_skbmem(struct sk_buff *skb)
 void skb_release_head_state(struct sk_buff *skb)
 {
 	skb_dst_drop(skb);
-	secpath_reset(skb);
 	if (skb->destructor) {
 		WARN_ON(in_irq());
 		skb->destructor(skb);
@@ -795,9 +794,6 @@ static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	memcpy(new->cb, old->cb, sizeof(old->cb));
 	skb_dst_copy(new, old);
 	__skb_ext_copy(new, old);
-#ifdef CONFIG_XFRM
-	new->sp			= secpath_get(old->sp);
-#endif
 	__nf_copy(new, old, false);
 
 	/* Note : this field could be in headers_start/headers_end section
@@ -5562,6 +5558,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	[SKB_EXT_BRIDGE_NF] = SKB_EXT_CHUNKSIZEOF(struct nf_bridge_info),
 #endif
+#ifdef CONFIG_XFRM
+	[SKB_EXT_SEC_PATH] = SKB_EXT_CHUNKSIZEOF(struct sec_path),
+#endif
 };
 
 static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
@@ -5570,7 +5569,9 @@ static void *skb_ext_get_ptr(struct skb_ext *ext, enum skb_ext_id id)
 }
 
 static struct skb_ext *skb_ext_cow(unsigned int len,
-				   struct skb_ext *old)
+				   struct skb_ext *old,
+				   unsigned int old_active)
+
 {
 	struct skb_ext *new = kmalloc(len, GFP_ATOMIC);
 
@@ -5585,6 +5586,15 @@ static struct skb_ext *skb_ext_cow(unsigned int len,
 
 	memcpy(new, old, old->chunks * SKB_EXT_ALIGN_VALUE);
 	refcount_set(&new->refcnt, 1);
+
+	if (old_active & (1 << SKB_EXT_SEC_PATH)) {
+		struct sec_path *sp = skb_ext_get_ptr(old, SKB_EXT_SEC_PATH);
+		unsigned int i;
+
+		for (i = 0; i < sp->len; i++)
+			xfrm_state_hold(sp->xvec[i]);
+	}
+
 	__skb_ext_put(old);
 	return new;
 }
@@ -5594,6 +5604,9 @@ static __always_inline unsigned int skb_ext_total_length(void)
 	return 0 +
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 		skb_ext_type_len[SKB_EXT_BRIDGE_NF] +
+#endif
+#ifdef CONFIG_XFRM
+		skb_ext_type_len[SKB_EXT_SEC_PATH] +
 #endif
 		0;
 }
@@ -5635,7 +5648,8 @@ void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
 			/* extension was allocated previously and it
 			 * might be used by a cloned skb. COW needed.
 			 */
-			new = skb_ext_cow(old->chunks * SKB_EXT_ALIGN_VALUE, old);
+			new = skb_ext_cow(old->chunks * SKB_EXT_ALIGN_VALUE, old,
+					  skb->active_extensions);
 			if (!new)
 				return NULL;
 
@@ -5650,7 +5664,8 @@ void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
 	newlen = newoff + skb_ext_type_len[id];
 
 	if (cow_needed)
-		new = skb_ext_cow(newlen * SKB_EXT_ALIGN_VALUE, old);
+		new = skb_ext_cow(newlen * SKB_EXT_ALIGN_VALUE, old,
+				  skb->active_extensions);
 	else
 		new = krealloc(old, newlen * SKB_EXT_ALIGN_VALUE, GFP_ATOMIC);
 	if (!new)
@@ -5665,21 +5680,44 @@ void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id)
 }
 EXPORT_SYMBOL(skb_ext_add);
 
+#ifdef CONFIG_XFRM
+static void skb_ext_put_sp(struct sec_path *sp)
+{
+	unsigned int i;
+
+	for (i = 0; i < sp->len; i++)
+		xfrm_state_put(sp->xvec[i]);
+}
+#endif
+
 void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
 {
 	struct skb_ext *ext;
 
 	skb->active_extensions &= ~(1 << id);
+	ext = skb->extensions;
 	if (skb->active_extensions == 0) {
-		ext = skb->extensions;
 		skb->extensions = NULL;
 		__skb_ext_put(ext);
+#ifdef CONFIG_XFRM
+	} else if (id == SKB_EXT_SEC_PATH &&
+		   refcount_read(&ext->refcnt) == 1) {
+		struct sec_path *sp = skb_ext_get_ptr(ext, SKB_EXT_SEC_PATH);
+
+		skb_ext_put_sp(sp);
+		sp->len = 0;
+#endif
 	}
 }
 EXPORT_SYMBOL(__skb_ext_del);
 
 void __skb_ext_free(struct skb_ext *ext)
 {
+#ifdef CONFIG_XFRM
+	if (__skb_ext_exist(ext, SKB_EXT_SEC_PATH))
+		skb_ext_put_sp(skb_ext_get_ptr(ext, SKB_EXT_SEC_PATH));
+#endif
+
 	kfree(ext);
 }
 EXPORT_SYMBOL(__skb_ext_free);
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index b4db25b244fa..6bc817359b58 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -38,8 +38,6 @@ struct xfrm_trans_cb {
 
 #define XFRM_TRANS_SKB_CB(__skb) ((struct xfrm_trans_cb *)&((__skb)->cb[0]))
 
-static struct kmem_cache *secpath_cachep __ro_after_init;
-
 static DEFINE_SPINLOCK(xfrm_input_afinfo_lock);
 static struct xfrm_input_afinfo const __rcu *xfrm_input_afinfo[AF_INET6 + 1];
 
@@ -111,54 +109,21 @@ static int xfrm_rcv_cb(struct sk_buff *skb, unsigned int family, u8 protocol,
 	return ret;
 }
 
-void __secpath_destroy(struct sec_path *sp)
-{
-	int i;
-	for (i = 0; i < sp->len; i++)
-		xfrm_state_put(sp->xvec[i]);
-	kmem_cache_free(secpath_cachep, sp);
-}
-EXPORT_SYMBOL(__secpath_destroy);
-
-struct sec_path *secpath_dup(struct sec_path *src)
+struct sec_path *secpath_set(struct sk_buff *skb)
 {
-	struct sec_path *sp;
+	struct sec_path *sp, *tmp = skb_ext_find(skb, SKB_EXT_SEC_PATH);
 
-	sp = kmem_cache_alloc(secpath_cachep, GFP_ATOMIC);
+	sp = skb_ext_add(skb, SKB_EXT_SEC_PATH);
 	if (!sp)
 		return NULL;
 
-	sp->len = 0;
-	sp->olen = 0;
+	if (tmp) /* reused existing one (was COW'd if needed) */
+		return sp;
 
+	/* allocated new secpath */
 	memset(sp->ovec, 0, sizeof(sp->ovec));
-
-	if (src) {
-		int i;
-
-		memcpy(sp, src, sizeof(*sp));
-		for (i = 0; i < sp->len; i++)
-			xfrm_state_hold(sp->xvec[i]);
-	}
-	refcount_set(&sp->refcnt, 1);
-	return sp;
-}
-EXPORT_SYMBOL(secpath_dup);
-
-struct sec_path *secpath_set(struct sk_buff *skb)
-{
-	struct sec_path *sp = skb->sp;
-
-	/* Allocate new secpath or COW existing one. */
-	if (!sp || refcount_read(&sp->refcnt) != 1) {
-		sp = secpath_dup(skb->sp);
-		if (!sp)
-			return NULL;
-
-		if (skb->sp)
-			secpath_put(skb->sp);
-		skb->sp = sp;
-	}
+	sp->olen = 0;
+	sp->len = 0;
 
 	return sp;
 }
@@ -552,11 +517,6 @@ void __init xfrm_input_init(void)
 	if (err)
 		gro_cells.cells = NULL;
 
-	secpath_cachep = kmem_cache_create("secpath_cache",
-					   sizeof(struct sec_path),
-					   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
-					   NULL);
-
 	for_each_possible_cpu(i) {
 		struct xfrm_trans_tasklet *trans;
 
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure
  2018-12-10 14:50 ` [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure Florian Westphal
@ 2018-12-11  8:06   ` Steffen Klassert
  2018-12-11 10:18     ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2018-12-11  8:06 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev

On Mon, Dec 10, 2018 at 03:50:06PM +0100, Florian Westphal wrote:
>  }
> @@ -552,11 +517,6 @@ void __init xfrm_input_init(void)
>  	if (err)
>  		gro_cells.cells = NULL;
>  
> -	secpath_cachep = kmem_cache_create("secpath_cache",
> -					   sizeof(struct sec_path),
> -					   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> -					   NULL);

This is not so nice. Usually we need a secpath per packet for IPsec.
With removing the cache, we have to kmalloc a secpath for each packet.
This might have some performance impact.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure
  2018-12-11  8:06   ` Steffen Klassert
@ 2018-12-11 10:18     ` Florian Westphal
  2018-12-11 10:20       ` Steffen Klassert
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-11 10:18 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Florian Westphal, netdev

Steffen Klassert <steffen.klassert@secunet.com> wrote:
> On Mon, Dec 10, 2018 at 03:50:06PM +0100, Florian Westphal wrote:
> >  }
> > @@ -552,11 +517,6 @@ void __init xfrm_input_init(void)
> >  	if (err)
> >  		gro_cells.cells = NULL;
> >  
> > -	secpath_cachep = kmem_cache_create("secpath_cache",
> > -					   sizeof(struct sec_path),
> > -					   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> > -					   NULL);
> 
> This is not so nice. Usually we need a secpath per packet for IPsec.
> With removing the cache, we have to kmalloc a secpath for each packet.
> This might have some performance impact.

I would expect that the extension allocations come from
kmalloc-96 cache in 'ipsec only' case.

I can run a few IPSEC benchmark tests to see if there is measureable
impact.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure
  2018-12-11 10:18     ` Florian Westphal
@ 2018-12-11 10:20       ` Steffen Klassert
  2018-12-12 11:52         ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Steffen Klassert @ 2018-12-11 10:20 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev

On Tue, Dec 11, 2018 at 11:18:41AM +0100, Florian Westphal wrote:
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > On Mon, Dec 10, 2018 at 03:50:06PM +0100, Florian Westphal wrote:
> > >  }
> > > @@ -552,11 +517,6 @@ void __init xfrm_input_init(void)
> > >  	if (err)
> > >  		gro_cells.cells = NULL;
> > >  
> > > -	secpath_cachep = kmem_cache_create("secpath_cache",
> > > -					   sizeof(struct sec_path),
> > > -					   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> > > -					   NULL);
> > 
> > This is not so nice. Usually we need a secpath per packet for IPsec.
> > With removing the cache, we have to kmalloc a secpath for each packet.
> > This might have some performance impact.
> 
> I would expect that the extension allocations come from
> kmalloc-96 cache in 'ipsec only' case.
> 
> I can run a few IPSEC benchmark tests to see if there is measureable
> impact.

That would be good, thanks!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
       [not found]   ` <CAPUCuiADYwjY4kpq76-w9BKL3uiRvNjnmzKG29mCrb=b8YeesA@mail.gmail.com>
@ 2018-12-12  0:07     ` Mat Martineau
  2018-12-12  0:11       ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Mat Martineau @ 2018-12-12  0:07 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 4496 bytes --]


On Mon, 11 Dec 2018, Florian Westphal wrote:

...

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b1831a5ca173..d715736eb734 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h

...

> @@ -3896,6 +3906,113 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct)
>                 atomic_inc(&nfct->use);
>  }
>  #endif
> +
> +#ifdef CONFIG_SKB_EXTENSIONS
> +enum skb_ext_id {
> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> +       SKB_EXT_BRIDGE_NF,
> +#endif
> +       SKB_EXT_NUM, /* must be last */
> +};

Could enum skb_ext_id always be defined, with none of the SKB_EXT_* values 
conditionally excluded? In combination with some alternate function 
definitions below (see later comments) I think this could reduce the need 
for CONFIG_SKB_EXTENSIONS preprocessor conditionals throughout the net 
code.

> +
> +/**
> + *     struct skb_ext - sk_buff extensions
> + *     @refcnt: 1 on allocation, deallocated on 0
> + *     @offset: offset to add to @data to obtain extension address
> + *     @chunks: size currently allocated, stored in SKB_EXT_ALIGN_SHIFT units
> + *     @data: start of extension data, variable sized
> + *
> + *     Note: offsets/lengths are stored in chunks of 8 bytes, this allows
> + *     to use 'u8' types while allowing up to 2kb worth of extension data.
> + */
> +struct skb_ext {
> +       refcount_t refcnt;
> +       u8 offset[SKB_EXT_NUM]; /* in chunks of 8 bytes */
> +       u8 chunks;              /* same */
> +       char data[0] __aligned(8);
> +};
> +
> +void *skb_ext_add(struct sk_buff *skb, enum skb_ext_id id);
> +void __skb_ext_del(struct sk_buff *skb, enum skb_ext_id id);
> +void __skb_ext_free(struct skb_ext *ext);
> +
> +static inline void __skb_ext_put(struct skb_ext *ext)
> +{
> +       if (ext && refcount_dec_and_test(&ext->refcnt))
> +               __skb_ext_free(ext);
> +}
> +
> +static inline void skb_ext_put(struct sk_buff *skb)
> +{
> +       if (skb->active_extensions)
> +               __skb_ext_put(skb->extensions);
> +}
> +
> +static inline void skb_ext_get(struct sk_buff *skb)
> +{
> +       if (skb->active_extensions) {
> +               struct skb_ext *ext = skb->extensions;
> +
> +               if (ext)
> +                       refcount_inc(&ext->refcnt);
> +       }
> +}
> +
> +static inline void __skb_ext_copy(struct sk_buff *dst,
> +                                 const struct sk_buff *src)
> +{
> +       dst->active_extensions = src->active_extensions;
> +
> +       if (src->active_extensions) {
> +               struct skb_ext *ext = src->extensions;
> +
> +               if (ext)
> +                       refcount_inc(&ext->refcnt);
> +               dst->extensions = ext;
> +       }
> +}
> +
> +static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *src)
> +{
> +       skb_ext_put(dst);
> +       __skb_ext_copy(dst, src);
> +}
> +
> +static inline bool __skb_ext_exist(const struct skb_ext *ext, enum skb_ext_id i)
> +{
> +       return !!ext->offset[i];
> +}
> +
> +static inline bool skb_ext_exist(const struct sk_buff *skb, enum skb_ext_id id)
> +{
> +       return skb->active_extensions & (1 << id);
> +}
> +
> +static inline void skb_ext_del(struct sk_buff *skb, enum skb_ext_id id)
> +{
> +       if (skb_ext_exist(skb, id))
> +               __skb_ext_del(skb, id);
> +}
> +
> +static inline void *skb_ext_find(const struct sk_buff *skb, enum skb_ext_id id)
> +{
> +       if (skb_ext_exist(skb, id)) {
> +               struct skb_ext *ext = skb->extensions;
> +
> +               if (ext && __skb_ext_exist(ext, id))
> +                       return (void *)ext + (ext->offset[id] << 3);
> +       }
> +
> +       return NULL;
> +}
> +#else
> +static inline void skb_ext_put(struct sk_buff *skb) {}
> +static inline void skb_ext_get(struct sk_buff *skb) {}
> +static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
> +static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
> +static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}

For the !CONFIG_SKB_EXTENSIONS case, an alternate definition of 
skb_ext_exist() that always returns false would be useful to reduce the 
need for preprocessor conditionals. A similar skb_ext_find() that always 
returns NULL might also be helpful.

> +#endif /* CONFIG_SKB_EXTENSIONS */
> +
>  #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>  static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
>  {

Thanks,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12  0:07     ` Mat Martineau
@ 2018-12-12  0:11       ` Florian Westphal
  2018-12-12 11:59         ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-12  0:11 UTC (permalink / raw)
  To: Mat Martineau; +Cc: Florian Westphal, netdev

Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
> 
> On Mon, 11 Dec 2018, Florian Westphal wrote:
> 
> ...
> 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index b1831a5ca173..d715736eb734 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> 
> ...
> 
> > @@ -3896,6 +3906,113 @@ static inline void nf_conntrack_get(struct nf_conntrack *nfct)
> >                 atomic_inc(&nfct->use);
> >  }
> >  #endif
> > +
> > +#ifdef CONFIG_SKB_EXTENSIONS
> > +enum skb_ext_id {
> > +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > +       SKB_EXT_BRIDGE_NF,
> > +#endif
> > +       SKB_EXT_NUM, /* must be last */
> > +};
> 
> Could enum skb_ext_id always be defined, with none of the SKB_EXT_* values
> conditionally excluded?

Yes, this is certainly possible.

> In combination with some alternate function
> definitions below (see later comments) I think this could reduce the need
> for CONFIG_SKB_EXTENSIONS preprocessor conditionals throughout the net code.

Perhaps.  I'll give this a shot.

> > +static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
> > +static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
> > +static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
> 
> For the !CONFIG_SKB_EXTENSIONS case, an alternate definition of
> skb_ext_exist() that always returns false would be useful to reduce the need
> for preprocessor conditionals. A similar skb_ext_find() that always returns
> NULL might also be helpful.

I'll do this and double-check that gcc removes the branches accordingly.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure
  2018-12-11 10:20       ` Steffen Klassert
@ 2018-12-12 11:52         ` Florian Westphal
  0 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 11:52 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: Florian Westphal, netdev

Steffen Klassert <steffen.klassert@secunet.com> wrote:
> > I can run a few IPSEC benchmark tests to see if there is measureable
> > impact.
> 
> That would be good, thanks!

Will do this later today.

One alternative would be to always allocate the entire maximum possible
extension length when the first extension is to be added.

In this case we could 'replace' secpath cache with a 'extension cache'.

The only downside is the increase in allocation size.
The upside is that it simplifies extension management -- no need
to reallocate, only action needed on 'extension add' is a possible
COW action.

Allocated memory is left un-initialized aside from 8 byte metadata
space.

We could later move to on-demand/reallocation if the full allocation
size would become too large in the future.

I will have a look at 'allocate everything' too, unless someone
tells me that this won't be acceptable (cost including future
mptcp extension would be ~160 bytes or so).

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12  0:11       ` Florian Westphal
@ 2018-12-12 11:59         ` Florian Westphal
  2018-12-12 16:59           ` Mat Martineau
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 11:59 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Mat Martineau, netdev

Florian Westphal <fw@strlen.de> wrote:
> Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
> > > +#ifdef CONFIG_SKB_EXTENSIONS
> > > +enum skb_ext_id {
> > > +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > > +       SKB_EXT_BRIDGE_NF,
> > > +#endif
> > > +       SKB_EXT_NUM, /* must be last */
> > > +};
> > 
> > Could enum skb_ext_id always be defined, with none of the SKB_EXT_* values
> > conditionally excluded?
> 
> Yes, this is certainly possible.

Did this today, did not like the result (see below).

> > In combination with some alternate function
> > definitions below (see later comments) I think this could reduce the need
> > for CONFIG_SKB_EXTENSIONS preprocessor conditionals throughout the net code.
> 
> Perhaps.  I'll give this a shot.
> 
> > > +static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
> > > +static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
> > > +static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
> > 
> > For the !CONFIG_SKB_EXTENSIONS case, an alternate definition of
> > skb_ext_exist() that always returns false would be useful to reduce the need
> > for preprocessor conditionals. A similar skb_ext_find() that always returns
> > NULL might also be helpful.
> 
> I'll do this and double-check that gcc removes the branches accordingly.

Problem is that the alternate skb_ext_exist() becomes bloated, and
instead gets the #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)/XFRM, etc.
junk.

Otherwise, gcc can't remove the branch in case e.g. CONFIG_XFRM=y
BRIDGE_NF=n.

With current approach, SKB_EXT_BRIDGE_NF id is hidden, i.e. all
calls to skb_ext_add/exists/del(, SKB_EXT_BRIDGE_NF) result in a build
error unless they appear in bridge-netfilter-only sections of the code
(this is intentional).

So, no extra code is generated.
The empty stubs are there only in case no subsystem requires the
extension infrastructure, and in that case, there should be no calls
to skb_ext_exist() in the first place.

Gist is that CONFIG_SKB_EXTENSIONS=y can still mean that all the
'remove bridge' or 'add secpath' etc. calls are not needed, but thats
something the compiler can't know unless called function explicitly
considers this in the inlined section.

Unfortunately, the existing 'id not defined in enum' seems the best
approach to ensure that.

If you have a better idea, let me know.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
       [not found]   ` <CAPUCuiADYwjY4kpq76-w9BKL3uiRvNjnmzKG29mCrb=b8YeesA@mail.gmail.com>
@ 2018-12-12 14:44   ` Willem de Bruijn
  2018-12-12 15:40     ` Florian Westphal
  2018-12-12 17:23   ` Eric Dumazet
  2018-12-12 18:16   ` Stephen Suryaputra
  3 siblings, 1 reply; 45+ messages in thread
From: Willem de Bruijn @ 2018-12-12 14:44 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Network Development

On Mon, Dec 10, 2018 at 11:21 AM Florian Westphal <fw@strlen.de> wrote:
>
> The (out-of-tree) Multipath-TCP implementation needs a significant amount
> of extra space in the skb control buffer.
>
> Increasing skb->cb[] size in mainline is a non-starter for memory and
> and performance reasons (f.e. increase in cb size also moves several
> frequently-accessed fields to other cache lines).
>
> One approach that might work for MPTCP is to extend skb_shared_info instead
> of sk_buff.  However, this comes with other drawbacks, e.g.  it either
> needs special skb allocation to make sure there is enough space for such
> 'extended shinfo' at the end of data buffer (which would make this only
> useable for the MPTCP tx path) or such a change would increase size of
> skb_shared_info.
>
> This work adds an extension infrastructure for sk_buff:
> 1. extension memory is released when the sk_buff is free'd.
> 2. data is shared after cloning an skb.
>
> This is also how xfrm and bridge netfilter skb-associated data
> (skb->sp and skb->nf_bridge) is handled.
>
> This series removes the nf_bridge pointer and makes bridge netfilter
> use the extension infrastructure instead.
>
> Two new members are added to sk_buff:
> 1. 'active_extensions' byte (filling a hole), telling which extensions
>    have been allocated for the skb.
> 2. extension pointer, located at the end of the sk_buff.
>    If active_extensions is 0, its content is undefined.
>
> The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.
>
> This adds extra code to skb clone and free paths (to deal with
> refcount/free of extension area) but replaces the existing code that
> deals with skb->nf_bridge.
>
> This patch only adds the basic infrastructure, the nf_bridge conversion
> is added in the next patch.
>
> Conversion of skb->sp (ipsec/xfrm secpath) to an skb extension is planned
> as a followup.
>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/linux/skbuff.h | 119 +++++++++++++++++++++++++++++++++++-
>  net/Kconfig            |   3 +
>  net/core/skbuff.c      | 133 +++++++++++++++++++++++++++++++++++++++++
>  net/ipv4/ip_output.c   |   1 +
>  net/ipv6/ip6_output.c  |   1 +
>  5 files changed, 256 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b1831a5ca173..d715736eb734 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -245,6 +245,7 @@ struct iov_iter;
>  struct napi_struct;
>  struct bpf_prog;
>  union bpf_attr;
> +struct skb_ext;
>
>  #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
>  struct nf_conntrack {
> @@ -636,6 +637,7 @@ typedef unsigned char *sk_buff_data_t;
>   *     @queue_mapping: Queue mapping for multiqueue devices
>   *     @xmit_more: More SKBs are pending for this queue
>   *     @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
> + *     @active_extensions: active extensions (skb_ext_id types)
>   *     @ndisc_nodetype: router type (from link layer)
>   *     @ooo_okay: allow the mapping of a socket to a queue to be changed
>   *     @l4_hash: indicate hash is a canonical 4-tuple hash over transport
> @@ -665,6 +667,7 @@ typedef unsigned char *sk_buff_data_t;
>   *     @data: Data head pointer
>   *     @truesize: Buffer size
>   *     @users: User count - see {datagram,tcp}.c
> + *     @extensions: allocated extensions, valid if active_extensions is nonzero
>   */
>
>  struct sk_buff {
> @@ -747,7 +750,9 @@ struct sk_buff {
>                                 head_frag:1,
>                                 xmit_more:1,
>                                 pfmemalloc:1;
> -
> +#ifdef CONFIG_SKB_EXTENSIONS
> +       __u8                    active_extensions;
> +#endif

This byte could be saved by moving the bits to the first byte of the
new extension. The likely cold cacheline now needs to be checked, but
only if the pointer is non-NULL. If exclusively using accessor
functions to access the struct, even this can be avoided if encoding
the first 3 or 7 active extensions in the pointer itself.

>         /* fields enclosed in headers_start/headers_end are copied
>          * using a single memcpy() in __copy_skb_header()
>          */
> @@ -869,6 +874,11 @@ struct sk_buff {
>                                 *data;
>         unsigned int            truesize;
>         refcount_t              users;
> +
> +#ifdef CONFIG_SKB_EXTENSIONS
> +       /* only useable after checking ->active_extensions != 0 */
> +       struct skb_ext          *extensions;
> +#endif
>  };

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 14:44   ` Willem de Bruijn
@ 2018-12-12 15:40     ` Florian Westphal
  2018-12-12 15:45       ` Willem de Bruijn
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 15:40 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Florian Westphal, Network Development

Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > +#ifdef CONFIG_SKB_EXTENSIONS
> > +       __u8                    active_extensions;
> > +#endif
> 
> This byte could be saved by moving the bits to the first byte of the
> new extension.

I tried to do this, but could not resolve following problem:
- extensions a and b are active
- skb is cloned
- extension a is to be disabled

In this patch series, we can just clear the 'a' bit from
from clone->active_extensions.

But if its part of the extension itself, we must first need to
be able to duplicate the extension blob before we can disable
the extension, which means that attempt to disable an extension
would now fail on -ENOMEM.  I think disabling should always work.
(its not a problem at the moment for either secpath or bridge
 as both use distinct pointers in sk_buff).

> The likely cold cacheline now needs to be checked, but
> only if the pointer is non-NULL.

The upside of the 'active_extension' member is that we can keep the
pointer in uninitialized state for the active_extensions == 0 case,
i.e., when allocating a new skb skb->extensions is not initialized.

> If exclusively using accessor
> functions to access the struct, even this can be avoided if encoding
> the first 3 or 7 active extensions in the pointer itself.

Yes, that would work, but requires to init the pointer on skb allocation
plus a few tricks to align the allocation so we have three bits to use
on arches where kmalloc doesn't guarantee 8-byte-aligned pointers.

Would also need a 'plan b' in place when extension number 4 arrives.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 15:40     ` Florian Westphal
@ 2018-12-12 15:45       ` Willem de Bruijn
  0 siblings, 0 replies; 45+ messages in thread
From: Willem de Bruijn @ 2018-12-12 15:45 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Network Development

On Wed, Dec 12, 2018 at 10:40 AM Florian Westphal <fw@strlen.de> wrote:
>
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > > +#ifdef CONFIG_SKB_EXTENSIONS
> > > +       __u8                    active_extensions;
> > > +#endif
> >
> > This byte could be saved by moving the bits to the first byte of the
> > new extension.
>
> I tried to do this, but could not resolve following problem:
> - extensions a and b are active
> - skb is cloned
> - extension a is to be disabled
>
> In this patch series, we can just clear the 'a' bit from
> from clone->active_extensions.
>
> But if its part of the extension itself, we must first need to
> be able to duplicate the extension blob before we can disable
> the extension, which means that attempt to disable an extension
> would now fail on -ENOMEM.  I think disabling should always work.

Absolutely. I certainly overlooked that complicating factor.

> (its not a problem at the moment for either secpath or bridge
>  as both use distinct pointers in sk_buff).
>
> > The likely cold cacheline now needs to be checked, but
> > only if the pointer is non-NULL.
>
> The upside of the 'active_extension' member is that we can keep the
> pointer in uninitialized state for the active_extensions == 0 case,
> i.e., when allocating a new skb skb->extensions is not initialized.
>
> > If exclusively using accessor
> > functions to access the struct, even this can be avoided if encoding
> > the first 3 or 7 active extensions in the pointer itself.
>
> Yes, that would work, but requires to init the pointer on skb allocation
> plus a few tricks to align the allocation so we have three bits to use
> on arches where kmalloc doesn't guarantee 8-byte-aligned pointers.
>
> Would also need a 'plan b' in place when extension number 4 arrives.

My assumed fallback was the first byte(s) of the extension, but as you
point out above, we cannot rely on that.

Okay, in that case the current solution is definitely preferable.

Ack also on the zeroing and alignment. Thanks for pointing out the
various pros and cons.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 11:59         ` Florian Westphal
@ 2018-12-12 16:59           ` Mat Martineau
  0 siblings, 0 replies; 45+ messages in thread
From: Mat Martineau @ 2018-12-12 16:59 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]


On Wed, 12 Dec 2018, Florian Westphal wrote:

> Florian Westphal <fw@strlen.de> wrote:
>> Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
>>>> +#ifdef CONFIG_SKB_EXTENSIONS
>>>> +enum skb_ext_id {
>>>> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>>>> +       SKB_EXT_BRIDGE_NF,
>>>> +#endif
>>>> +       SKB_EXT_NUM, /* must be last */
>>>> +};
>>>
>>> Could enum skb_ext_id always be defined, with none of the SKB_EXT_* values
>>> conditionally excluded?
>>
>> Yes, this is certainly possible.
>
> Did this today, did not like the result (see below).
>
>>> In combination with some alternate function
>>> definitions below (see later comments) I think this could reduce the need
>>> for CONFIG_SKB_EXTENSIONS preprocessor conditionals throughout the net code.
>>
>> Perhaps.  I'll give this a shot.
>>
>>>> +static inline void skb_ext_del(struct sk_buff *skb, int unused) {}
>>>> +static inline void __skb_ext_copy(struct sk_buff *d, const struct sk_buff *s) {}
>>>> +static inline void skb_ext_copy(struct sk_buff *dst, const struct sk_buff *s) {}
>>>
>>> For the !CONFIG_SKB_EXTENSIONS case, an alternate definition of
>>> skb_ext_exist() that always returns false would be useful to reduce the need
>>> for preprocessor conditionals. A similar skb_ext_find() that always returns
>>> NULL might also be helpful.
>>
>> I'll do this and double-check that gcc removes the branches accordingly.
>
> Problem is that the alternate skb_ext_exist() becomes bloated, and
> instead gets the #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)/XFRM, etc.
> junk.
>
> Otherwise, gcc can't remove the branch in case e.g. CONFIG_XFRM=y
> BRIDGE_NF=n.
>

Thanks for looking at this. You're definitely right, there will be cases 
where some CONFIG_SKB_EXTENSION users will be enabled and not others, and 
that needs to be handled cleanly.

> If you have a better idea, let me know.
>

I think the best thing is to keep the skb_ext_* functions you have, and 
CONFIG_SKB_EXTENSION-dependent features like XFRM/BRIDGE_NF/etc can define 
alternate inline functions or macros that call skb_ext_exist() if they 
benefit from that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
       [not found]   ` <CAPUCuiADYwjY4kpq76-w9BKL3uiRvNjnmzKG29mCrb=b8YeesA@mail.gmail.com>
  2018-12-12 14:44   ` Willem de Bruijn
@ 2018-12-12 17:23   ` Eric Dumazet
  2018-12-12 18:44     ` Florian Westphal
  2018-12-12 18:16   ` Stephen Suryaputra
  3 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-12 17:23 UTC (permalink / raw)
  To: Florian Westphal, netdev



On 12/10/2018 06:49 AM, Florian Westphal wrote:
> The (out-of-tree) Multipath-TCP implementation needs a significant amount
> of extra space in the skb control buffer.

Which skbs ? Input or output path ?

> 
> Increasing skb->cb[] size in mainline is a non-starter for memory and
> and performance reasons (f.e. increase in cb size also moves several
> frequently-accessed fields to other cache lines).
> 
> One approach that might work for MPTCP is to extend skb_shared_info instead
> of sk_buff.  However, this comes with other drawbacks, e.g.  it either
> needs special skb allocation to make sure there is enough space for such
> 'extended shinfo' at the end of data buffer (which would make this only
> useable for the MPTCP tx path) or such a change would increase size of
> skb_shared_info.
> 
> This work adds an extension infrastructure for sk_buff:
> 1. extension memory is released when the sk_buff is free'd.
> 2. data is shared after cloning an skb.
> 

This seems additional atomic increments and decrements all over the places,
and code bloat for a very precise reason :

      skb->cb[] is too small.

We do not want to increase skb->cb[] for two reasons, the first one being the killer.

1) we clear it at skb allocation, and copy it at skb cloning.
2) extra memory cost.

Why can't we have another skb->cb2[] field that is not cleared/copied by skb functions at all ?

Each layer using skb->cb2[] would be responsible to fully manage it.

I do not know what are MPTCP needs, but I doubt adding XX bytes to skb will
have serious memory cost impact for any MPTCP users.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
                     ` (2 preceding siblings ...)
  2018-12-12 17:23   ` Eric Dumazet
@ 2018-12-12 18:16   ` Stephen Suryaputra
  2018-12-12 18:38     ` Florian Westphal
  2018-12-13  0:38     ` David Miller
  3 siblings, 2 replies; 45+ messages in thread
From: Stephen Suryaputra @ 2018-12-12 18:16 UTC (permalink / raw)
  To: fw; +Cc: netdev

On Mon, Dec 10, 2018 at 11:20 AM Florian Westphal <fw@strlen.de> wrote:

> +#ifdef CONFIG_SKB_EXTENSIONS
> +enum skb_ext_id {
> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> +       SKB_EXT_BRIDGE_NF,
> +#endif
> +       SKB_EXT_NUM, /* must be last */
> +};

How about when proprietary extensions is desired? There are cases
where our system has to pass special metadata from input to output in
skbs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 18:16   ` Stephen Suryaputra
@ 2018-12-12 18:38     ` Florian Westphal
  2018-12-13  0:38     ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 18:38 UTC (permalink / raw)
  To: Stephen Suryaputra; +Cc: fw, netdev

Stephen Suryaputra <ssuryaextr@gmail.com> wrote:
> On Mon, Dec 10, 2018 at 11:20 AM Florian Westphal <fw@strlen.de> wrote:
> 
> > +#ifdef CONFIG_SKB_EXTENSIONS
> > +enum skb_ext_id {
> > +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> > +       SKB_EXT_BRIDGE_NF,
> > +#endif
> > +       SKB_EXT_NUM, /* must be last */
> > +};
> 
> How about when proprietary extensions is desired? There are cases
> where our system has to pass special metadata from input to output in
> skbs.

Sorry, I don't care about out-of-tree work.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 17:23   ` Eric Dumazet
@ 2018-12-12 18:44     ` Florian Westphal
  2018-12-12 20:17       ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 18:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 12/10/2018 06:49 AM, Florian Westphal wrote:
> > The (out-of-tree) Multipath-TCP implementation needs a significant amount
> > of extra space in the skb control buffer.
> 
> Which skbs ? Input or output path ?

Both.

> > This work adds an extension infrastructure for sk_buff:
> > 1. extension memory is released when the sk_buff is free'd.
> > 2. data is shared after cloning an skb.
> > 
> 
> This seems additional atomic increments and decrements all over the places,
> and code bloat for a very precise reason :

No, it replaces the atomic dec/inc of nf_bridge and secpath.
If skb passes neither bridge netfilter nor ipsec, no atomic op is added.

If it passes either or, one atomic inc and dec is done, just like
current kernel (skb->sp or skb->nf_bridge refcount).

If it passes both (unlikely but possible) its now one instead of two,
or two (if one operation happens in different netns for instance, due to
skb scrubbing).

>       skb->cb[] is too small.
> 
> We do not want to increase skb->cb[] for two reasons, the first one being the killer.
> 
> 1) we clear it at skb allocation, and copy it at skb cloning.

Right.

> 2) extra memory cost.
> 
> Why can't we have another skb->cb2[] field that is not cleared/copied by skb functions at all ?
> 
> Each layer using skb->cb2[] would be responsible to fully manage it.

We could do that, its probably enough for mptcp needs.
This would keep nf_bridge and secpath pointers as-is and increase
skb truesize.

If you prefer that, ok, but I don't see why we can't unify them behind
a single layer?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 18:44     ` Florian Westphal
@ 2018-12-12 20:17       ` Eric Dumazet
  2018-12-12 20:52         ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-12 20:17 UTC (permalink / raw)
  To: Florian Westphal, Eric Dumazet; +Cc: netdev



On 12/12/2018 10:44 AM, Florian Westphal wrote:

> We could do that, its probably enough for mptcp needs.
> This would keep nf_bridge and secpath pointers as-is and increase
> skb truesize.
> 
> If you prefer that, ok, but I don't see why we can't unify them behind
> a single layer?
> 

Well, for a start we do not use nf_brifge or secpath.

XDP is all about not unifying because unifying has a cost.

Do we really want to slow down the stack just because MPTCP is coming ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 20:17       ` Eric Dumazet
@ 2018-12-12 20:52         ` Florian Westphal
  2018-12-13  5:40           ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-12 20:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > If you prefer that, ok, but I don't see why we can't unify them behind
> > a single layer?
> 
> Well, for a start we do not use nf_brifge or secpath.

Then the extension framework isn't built and the result
is exactly the same as before these patches: helpers
turn into empty inline stubs.

If its built, sk_buff size is reduced by up to one pointer,
and the added one isn't initialised at allocation time.

sk_buff layout with the extension patches is almost same
as with XFRM=n BRIDGE_NF=n (because nf_bridge and sp pointers
get removed), there is only one additional pointer at the end,
not inited at alloc time.  The other new member fills a hole.

> XDP is all about not unifying because unifying has a cost.
> 
> Do we really want to slow down the stack just because MPTCP is coming ?

I don't want to slow down the stack.  All places that gain a helper call
do so instead of the nf_bridge/secpath one.  They only expand into code
if at least one of secpath or nf_bridge are built in.
Both struct secpath and nf_bridge have their refcounts removed, so no
additional reference counts are added.

If you still think this is proposal is a bad idea, ok, let me know
and I will stop working on this.

MPTCP can then just add skb->cb2[], but neither nf_bridge nor secpath can
use it.

If not, I will send another iteration that just allocates the entire
extension space if first extension is added, it simplifies allocation
handling a little.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 18:16   ` Stephen Suryaputra
  2018-12-12 18:38     ` Florian Westphal
@ 2018-12-13  0:38     ` David Miller
  1 sibling, 0 replies; 45+ messages in thread
From: David Miller @ 2018-12-13  0:38 UTC (permalink / raw)
  To: ssuryaextr; +Cc: fw, netdev

From: Stephen Suryaputra <ssuryaextr@gmail.com>
Date: Wed, 12 Dec 2018 13:16:57 -0500

> On Mon, Dec 10, 2018 at 11:20 AM Florian Westphal <fw@strlen.de> wrote:
> 
>> +#ifdef CONFIG_SKB_EXTENSIONS
>> +enum skb_ext_id {
>> +#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
>> +       SKB_EXT_BRIDGE_NF,
>> +#endif
>> +       SKB_EXT_NUM, /* must be last */
>> +};
> 
> How about when proprietary extensions is desired? There are cases
> where our system has to pass special metadata from input to output in
> skbs.

Proprietary extensions will absolutely not be taken into consideration
when designing upstream features in the Linux kernel.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 0/13] sk_buff: add extension infrastructure
  2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
                   ` (12 preceding siblings ...)
  2018-12-10 14:50 ` [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure Florian Westphal
@ 2018-12-13  4:08 ` Shannon Nelson
  13 siblings, 0 replies; 45+ messages in thread
From: Shannon Nelson @ 2018-12-13  4:08 UTC (permalink / raw)
  To: fw; +Cc: netdev

On Mon, Dec 10, 2018 at 8:21 AM Florian Westphal <fw@strlen.de> wrote:
>
> The (out-of-tree) Multipath-TCP implementation needs to map logical mptcp
> sequence numbers to the tcp sequence numbers used by individual subflows.
> This DSS mapping is read/written from tcp option space on receive and
> written to tcp option space on transmitted tcp packets that are part of
> and MPTCP connection.
>
> Increasing skb->cb[] size in mainline to store the DSS mapping
> is a non-starter for memory and and performance reasons
> (f.e. increase in cb size also moves several frequently-accessed fields
>  to other cache lines).
>
> Extend skb_shared_info or adding a private data field to skb fclones
> doesn't work for incoming skb, so a different DSS propagation method
> would be required for the receive side.
>
> The current MPTCP implementation adds an additional mptcp specific
> pointer to sk_buff.
>
> This series adds an extension infrastructure for sk_buff instead:
> 1. extension memory is released when the sk_buff is free'd.
> 2. data is shared after cloning an skb.
> 3. adding extension to an skb will COW the extension buffer if needed.
>
> This is also how xfrm and bridge_nf extra data (skb->sp, skb->nf_bridge)
> are handled.
>
> MPTCP could then add a new SKB_EXT_MPTCP_DSS (or similar) to store the
> mapping for tx and rx processing.
>
> Two new members are added to sk_buff:
> 1. 'active_extensions' byte (filling a hole), telling which extensions
>    are available for this skb.
> 2. extension pointer, located at the end of the sk_buff.
>    If the active_extensions byte is 0, the pointer is undefined.
>
> Third patch converts nf_bridge to use the extension infrastructure:
> The 'nf_bridge' pointer is removed, i.e. sk_buff size remains the same.
>
> After this, there are a few preparation patches to reduce "skb->sp"
> usage by using the secpath helper functions instead.
>
> Last patch converts skb->sp, secpath information gets stored as
> new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.
>
> Extra code added to skb clone and free paths (to deal with refcount/free
> of extension area) replace the existing code that does the same for
> skb->nf_bridge and skb->secpath.
>
> I don't see any other in-tree users that could benefit from this
> infrastructure, it doesn't make sense to add an extension just for the sake
> of a single flag bit (like skb->nf_trace).
>
> Changes since RFC:
>
> Convert secpath.
>
> Unlike nf_bridge, the secpath struct needs to hold reference on the
> xfrm state structure(s), thus handling gets more complicated when
> an existing secpath extension has to be COW'd (we need to take additional
> reference count on the xfrm states contained in the new copy).
>
> Florian Westphal (13):
>       netfilter: avoid using skb->nf_bridge directly
>       sk_buff: add skb extension infrastructure
>       net: convert bridge_nf to use skb extension infrastructure
>       xfrm: change secpath_set to return secpath struct, not error value
>       net: move secpath_exist helper to sk_buff.h
>       net: use skb_sec_path helper in more places
>       drivers: net: intel: use secpath helpers in more places
>       drivers: net: ethernet: mellanox: use skb_sec_path helper
>       drivers: net: netdevsim: use skb_sec_path helper
>       xfrm: use secpath_exist where applicable
>       drivers: chelsio: use skb_sec_path helper
>       xfrm: prefer secpath_set over secpath_dup
>       net: switch secpath to use skb extension infrastructure
>
>  Documentation/networking/xfrm_device.txt                      |    7
>  drivers/crypto/chelsio/chcr_ipsec.c                           |    4
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c                |   15
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c                 |    5
>  drivers/net/ethernet/intel/ixgbevf/ipsec.c                    |   15
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c             |    2
>  drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c |   19 -
>  drivers/net/netdevsim/ipsec.c                                 |    7
>  include/linux/netfilter_bridge.h                              |   33 +
>  include/linux/skbuff.h                                        |  160 +++++++-
>  include/net/netfilter/br_netfilter.h                          |   14
>  include/net/xfrm.h                                            |   40 --
>  net/Kconfig                                                   |    4
>  net/bridge/br_netfilter_hooks.c                               |   39 --
>  net/bridge/br_netfilter_ipv6.c                                |    4
>  net/core/skbuff.c                                             |  182 +++++++++-
>  net/ipv4/esp4.c                                               |    9
>  net/ipv4/esp4_offload.c                                       |   15
>  net/ipv4/ip_output.c                                          |    1
>  net/ipv4/netfilter/nf_reject_ipv4.c                           |    6
>  net/ipv6/esp6.c                                               |    9
>  net/ipv6/esp6_offload.c                                       |   15
>  net/ipv6/ip6_output.c                                         |    1
>  net/ipv6/netfilter/nf_reject_ipv6.c                           |   10
>  net/ipv6/xfrm6_input.c                                        |    8
>  net/netfilter/nf_log_common.c                                 |   20 -
>  net/netfilter/nf_queue.c                                      |   50 +-
>  net/netfilter/nfnetlink_queue.c                               |   23 -
>  net/netfilter/nft_meta.c                                      |    2
>  net/netfilter/nft_xfrm.c                                      |    2
>  net/netfilter/xt_physdev.c                                    |    2
>  net/netfilter/xt_policy.c                                     |    2
>  net/xfrm/Kconfig                                              |    1
>  net/xfrm/xfrm_device.c                                        |    4
>  net/xfrm/xfrm_input.c                                         |   76 +---
>  net/xfrm/xfrm_interface.c                                     |    2
>  net/xfrm/xfrm_output.c                                        |    7
>  net/xfrm/xfrm_policy.c                                        |   19 -
>  security/selinux/xfrm.c                                       |    4
>  39 files changed, 553 insertions(+), 285 deletions(-)
>

For the changes in ixgbe and netdevsim:
Acked-by: Shannon Nelson <shannon.lee.nelson@gmail.com>

-- 
==============================================
Mr. Shannon Nelson         Parents can't afford to be squeamish.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-12 20:52         ` Florian Westphal
@ 2018-12-13  5:40           ` Eric Dumazet
  2018-12-13  9:27             ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-13  5:40 UTC (permalink / raw)
  To: Florian Westphal, Eric Dumazet; +Cc: netdev



On 12/12/2018 12:52 PM, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> If you prefer that, ok, but I don't see why we can't unify them behind
>>> a single layer?
>>
>> Well, for a start we do not use nf_brifge or secpath.
> 
> Then the extension framework isn't built and the result
> is exactly the same as before these patches: helpers
> turn into empty inline stubs.
> 
> If its built, sk_buff size is reduced by up to one pointer,
> and the added one isn't initialised at allocation time.
> 
> sk_buff layout with the extension patches is almost same
> as with XFRM=n BRIDGE_NF=n (because nf_bridge and sp pointers
> get removed), there is only one additional pointer at the end,
> not inited at alloc time.  The other new member fills a hole.
> 
>> XDP is all about not unifying because unifying has a cost.
>>
>> Do we really want to slow down the stack just because MPTCP is coming ?
> 
> I don't want to slow down the stack.  All places that gain a helper call
> do so instead of the nf_bridge/secpath one.  They only expand into code
> if at least one of secpath or nf_bridge are built in.
> Both struct secpath and nf_bridge have their refcounts removed, so no
> additional reference counts are added.
> 
> If you still think this is proposal is a bad idea, ok, let me know
> and I will stop working on this.

Certainly not a bad idea.

> 
> MPTCP can then just add skb->cb2[], but neither nf_bridge nor secpath can
> use it.
> 
> If not, I will send another iteration that just allocates the entire
> extension space if first extension is added, it simplifies allocation
> handling a little.
> 

I am still unsure of the added extra costs, but for a start, TCP xmit clones
the skbs. 

Does it mean an additional pair of atomic operations if the skb comes from MPTCP enabled flow ?

The clone is used in the lower layers (IP , qdisc, device xmit), so I do not see why a clone
should share the extdata from the master.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13  5:40           ` Eric Dumazet
@ 2018-12-13  9:27             ` Florian Westphal
  2018-12-13 10:18               ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-13  9:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, cpaasch, peter.krystad, mathew.j.martineau

Eric Dumazet <eric.dumazet@gmail.com> wrote:

[ CC Christoph, Mathew, Peter ]

> > If not, I will send another iteration that just allocates the entire
> > extension space if first extension is added, it simplifies allocation
> > handling a little.
> > 
> 
> I am still unsure of the added extra costs, but for a start, TCP xmit clones
> the skbs. 

Yes.

> Does it mean an additional pair of atomic operations if the skb comes from MPTCP enabled flow ?

Depends.

If its going to be used as I expect, then the extension could be
discarded after the DSS mapping has been written to the tcp option
space, i.e. before cloning occurs.

So skb_clone would not cause a refcount increase, as the skb
does not have extension (anymore) by that time.

If its presevered because it has to be used again
after the cloning, then you are right, and we get refcont_inc() cost.

As for the freeing, this currently has an atomic op:

static inline void __skb_ext_put(struct skb_ext *ext)
{
	if (ext && refcount_dec_and_test(&ext->refcnt))
		__skb_ext_free(ext);
}

static inline void skb_ext_put(struct sk_buff *skb)
{
	if (skb->active_extensions)
		__skb_ext_put(skb->extensions);
}

I think this could be avoided by using same trick as
fclones->fclone_ref one in kfree_skbmem():

static inline void skb_ext_put(struct sk_buff *skb)
{
	if (skb->active_extensions)
		__skb_ext_put(skb->extensions);
}

void __skb_ext_put(struct skb_ext *ext)
{
	if (refcount_read(&ext->refcnt) == 1)
		goto free_now;

	if (!refcount_dec_and_test(&ext->refcnt))
		return;
free_now:
#ifdef CONFIG_XFRM
        if (__skb_ext_exist(ext, SKB_EXT_SEC_PATH))
		skb_ext_put_sp(skb_ext_get_ptr(ext, SKB_EXT_SEC_PATH));
#endif
	kfree(ext);
}

I think this could work, if refcount is 1, then there is no clone, i.e.
nothing else can inc/dec reference after the "== 1" check is true.

I will change it to above version.

> The clone is used in the lower layers (IP , qdisc, device xmit), so I do not see why a clone
> should share the extdata from the master.

For TCP, thats true.  But there are other places that could clone, e.g.
when bridge has to flood-forward.

At least in bridge case the 'preseve on clone' is needed, else required
information is missing from the cloned skb.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13  9:27             ` Florian Westphal
@ 2018-12-13 10:18               ` Eric Dumazet
  2018-12-13 10:39                 ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-13 10:18 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, cpaasch, peter.krystad, mathew.j.martineau



On 12/13/2018 01:27 AM, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> [ CC Christoph, Mathew, Peter ]
> 
>>> If not, I will send another iteration that just allocates the entire
>>> extension space if first extension is added, it simplifies allocation
>>> handling a little.
>>>
>>
>> I am still unsure of the added extra costs, but for a start, TCP xmit clones
>> the skbs. 
> 
> Yes.
> 
>> Does it mean an additional pair of atomic operations if the skb comes from MPTCP enabled flow ?
> 
> Depends.
> 
> If its going to be used as I expect, then the extension could be
> discarded after the DSS mapping has been written to the tcp option
> space, i.e. before cloning occurs.

I do not see how this would work, without also discarding on the master skb
the needed info.

> 
> So skb_clone would not cause a refcount increase, as the skb
> does not have extension (anymore) by that time.
> 
> If its presevered because it has to be used again
> after the cloning, then you are right, and we get refcont_inc() cost.
> 
> As for the freeing, this currently has an atomic op:
> 
> static inline void __skb_ext_put(struct skb_ext *ext)
> {
> 	if (ext && refcount_dec_and_test(&ext->refcnt))
> 		__skb_ext_free(ext);
> }
> 
> static inline void skb_ext_put(struct sk_buff *skb)
> {
> 	if (skb->active_extensions)
> 		__skb_ext_put(skb->extensions);
> }
> 
> I think this could be avoided by using same trick as
> fclones->fclone_ref one in kfree_skbmem():
> 
> static inline void skb_ext_put(struct sk_buff *skb)
> {
> 	if (skb->active_extensions)
> 		__skb_ext_put(skb->extensions);
> }
> 
> void __skb_ext_put(struct skb_ext *ext)
> {
> 	if (refcount_read(&ext->refcnt) == 1)
> 		goto free_now;
> 
> 	if (!refcount_dec_and_test(&ext->refcnt))
> 		return;
> free_now:
> #ifdef CONFIG_XFRM
>         if (__skb_ext_exist(ext, SKB_EXT_SEC_PATH))
> 		skb_ext_put_sp(skb_ext_get_ptr(ext, SKB_EXT_SEC_PATH));
> #endif
> 	kfree(ext);
> }
> 
> I think this could work, if refcount is 1, then there is no clone, i.e.
> nothing else can inc/dec reference after the "== 1" check is true.
> 
> I will change it to above version.
> 
>> The clone is used in the lower layers (IP , qdisc, device xmit), so I do not see why a clone
>> should share the extdata from the master.
> 
> For TCP, thats true.  But there are other places that could clone, e.g.
> when bridge has to flood-forward.
> 

So you propose a mechanism that forces a preserve on clone, base on existing needs
for bridging.

> At least in bridge case the 'preseve on clone' is needed, else required
> information is missing from the cloned skb.
> 

We need something where MPTCP info does not need to be propagated all the way to the NIC...

This skb extension is an incentive for adding more sticky things in the skbs
to violate layering of networking stacks :/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 10:18               ` Eric Dumazet
@ 2018-12-13 10:39                 ` Florian Westphal
  2018-12-13 10:58                   ` Eric Dumazet
  2018-12-13 17:00                   ` Christoph Paasch
  0 siblings, 2 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-13 10:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, cpaasch, peter.krystad, mathew.j.martineau

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > If its going to be used as I expect, then the extension could be
> > discarded after the DSS mapping has been written to the tcp option
> > space, i.e. before cloning occurs.
> 
> I do not see how this would work, without also discarding on the master skb
> the needed info.

Ok, so lets assume this would result in one atomic_inc/dec due to clone
for now for skbs coming from mptcp socket.

But I don't see why this would have to be.

> > For TCP, thats true.  But there are other places that could clone, e.g.
> > when bridge has to flood-forward.
> > 
>
> So you propose a mechanism that forces a preserve on clone, base on existing needs
> for bridging.

secpath does the same thing:

static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
{
...
#ifdef CONFIG_XFRM
        new->sp                 = secpath_get(old->sp);
#endif
...

So I am not proposing anything new.

> > At least in bridge case the 'preseve on clone' is needed, else required
> > information is missing from the cloned skb.
> > 
> 
> We need something where MPTCP info does not need to be propagated all the way to the NIC...

Thats whats done in the MPTCP out-of-tree implementation, but I don't
think its needed.

It could just delete the extension before ->queue_xmit() AFAIU.

> This skb extension is an incentive for adding more sticky things in the skbs
> to violate layering of networking stacks :/

8-(

Where do you see "layering violations"?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 10:39                 ` Florian Westphal
@ 2018-12-13 10:58                   ` Eric Dumazet
  2018-12-13 11:03                     ` Florian Westphal
  2018-12-13 17:00                   ` Christoph Paasch
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-13 10:58 UTC (permalink / raw)
  To: Florian Westphal, Eric Dumazet
  Cc: netdev, cpaasch, peter.krystad, mathew.j.martineau



On 12/13/2018 02:39 AM, Florian Westphal wrote:
>
> Thats whats done in the MPTCP out-of-tree implementation, but I don't
> think its needed.
> 
> It could just delete the extension before ->queue_xmit() AFAIU.

So, cloning would do an refcount_inc(), and deleting the extension would do an refcount_dec_and_test() ?

That is what I called an extra pair of atomic operations.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 10:58                   ` Eric Dumazet
@ 2018-12-13 11:03                     ` Florian Westphal
  2018-12-13 11:16                       ` Eric Dumazet
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-13 11:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, cpaasch, peter.krystad, mathew.j.martineau

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> On 12/13/2018 02:39 AM, Florian Westphal wrote:
> >
> > Thats whats done in the MPTCP out-of-tree implementation, but I don't
> > think its needed.
> > 
> > It could just delete the extension before ->queue_xmit() AFAIU.
> 
> So, cloning would do an refcount_inc(), and deleting the extension would do an refcount_dec_and_test() ?
> 
> That is what I called an extra pair of atomic operations.

If it replaces 1:1 current mptcp skb->private out-of-tree storage, then
yes.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 11:03                     ` Florian Westphal
@ 2018-12-13 11:16                       ` Eric Dumazet
  2018-12-13 11:44                         ` Florian Westphal
  0 siblings, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2018-12-13 11:16 UTC (permalink / raw)
  To: Florian Westphal, Eric Dumazet
  Cc: netdev, cpaasch, peter.krystad, mathew.j.martineau



On 12/13/2018 03:03 AM, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>> On 12/13/2018 02:39 AM, Florian Westphal wrote:
>>>
>>> Thats whats done in the MPTCP out-of-tree implementation, but I don't
>>> think its needed.
>>>
>>> It could just delete the extension before ->queue_xmit() AFAIU.
>>
>> So, cloning would do an refcount_inc(), and deleting the extension would do an refcount_dec_and_test() ?
>>
>> That is what I called an extra pair of atomic operations.
> 
> If it replaces 1:1 current mptcp skb->private out-of-tree storage, then
> yes.
> 

One day I will write a book on the number of atomic operations done on a TCP sendmsg() system call :/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 11:16                       ` Eric Dumazet
@ 2018-12-13 11:44                         ` Florian Westphal
  0 siblings, 0 replies; 45+ messages in thread
From: Florian Westphal @ 2018-12-13 11:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, cpaasch, peter.krystad, mathew.j.martineau

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 12/13/2018 03:03 AM, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> So, cloning would do an refcount_inc(), and deleting the extension would do an refcount_dec_and_test() ?
> >>
> >> That is what I called an extra pair of atomic operations.
> > 
> > If it replaces 1:1 current mptcp skb->private out-of-tree storage, then
> > yes.
> > 
> 
> One day I will write a book on the number of atomic operations done on a TCP sendmsg() system call :/

Just to clarify: mptcp skb->private out-of-tree storage is maintened
using atomic_ops too, so if its converted 1:1 nothing changes in the
*mptcp tree*.

If its possible to re-arrange the out-of-tree so that it doesn't need
that (only used while skb is not cloned), then the same applies to the
extension conversion.

So this, as far as *this series* is concerened, a "future problem" that
may or may not exist.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 02/13] sk_buff: add skb extension infrastructure
  2018-12-13 10:39                 ` Florian Westphal
  2018-12-13 10:58                   ` Eric Dumazet
@ 2018-12-13 17:00                   ` Christoph Paasch
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Paasch @ 2018-12-13 17:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Eric Dumazet, netdev, peter.krystad, mathew.j.martineau

On 13/12/18 - 11:39:18, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > If its going to be used as I expect, then the extension could be
> > > discarded after the DSS mapping has been written to the tcp option
> > > space, i.e. before cloning occurs.
> > 
> > I do not see how this would work, without also discarding on the master skb
> > the needed info.
> 
> Ok, so lets assume this would result in one atomic_inc/dec due to clone
> for now for skbs coming from mptcp socket.
> 
> But I don't see why this would have to be.
> 
> > > For TCP, thats true.  But there are other places that could clone, e.g.
> > > when bridge has to flood-forward.
> > > 
> >
> > So you propose a mechanism that forces a preserve on clone, base on existing needs
> > for bridging.
> 
> secpath does the same thing:
> 
> static void __copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> {
> ...
> #ifdef CONFIG_XFRM
>         new->sp                 = secpath_get(old->sp);
> #endif
> ...
> 
> So I am not proposing anything new.
> 
> > > At least in bridge case the 'preseve on clone' is needed, else required
> > > information is missing from the cloned skb.
> > > 
> > 
> > We need something where MPTCP info does not need to be propagated all the way to the NIC...
> 
> Thats whats done in the MPTCP out-of-tree implementation, but I don't
> think its needed.

Yes, it indeed does not need to go all the way down to the NIC.

The info basically "just" needs to be propagated from the MPTCP-layer down
to the TCP-option space. Thus, it needs to remain on the skbs that are
sitting in the TCP-subflow's send-queue and rexmit tree as we need it when
retransmitting.

In tcp_transmit_skb, the clone is done at the beginning. Thus, we could for
example not inc the refcount on the clone and simply pass a pointer to the
original skb to tcp_established_options.

That way it the DSS option stays within the MPTCP/TCP layer and does not
make it down to the NIC.


Christoph

> 
> It could just delete the extension before ->queue_xmit() AFAIU.
> 
> > This skb extension is an incentive for adding more sticky things in the skbs
> > to violate layering of networking stacks :/
> 
> 8-(
> 
> Where do you see "layering violations"?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 0/13] sk_buff: add extension infrastructure
  2018-12-19 19:02 ` David Miller
@ 2018-12-19 19:47   ` David Miller
  0 siblings, 0 replies; 45+ messages in thread
From: David Miller @ 2018-12-19 19:47 UTC (permalink / raw)
  To: fw; +Cc: netdev

From: David Miller <davem@davemloft.net>
Date: Wed, 19 Dec 2018 11:02:22 -0800 (PST)

> I have no major issues with the approach, but I just want to understand all
> of the details before I apply this.
> 
> And honestly, if this turns out to be the wrong direction we can
> revert and try doing this another way.

Ok, I applied everything.

I have to say the patches that do the actual conversions (#3 and #13)
were so nice.  I'm really impressed with how you approached and solved
this problem Florian.

Seriously, nice work.

We'll just see what real impacts this does or does not have for
performance for anyone.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 0/13] sk_buff: add extension infrastructure
  2018-12-18 16:15 Florian Westphal
@ 2018-12-19 19:02 ` David Miller
  2018-12-19 19:47   ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: David Miller @ 2018-12-19 19:02 UTC (permalink / raw)
  To: fw; +Cc: netdev

From: Florian Westphal <fw@strlen.de>
Date: Tue, 18 Dec 2018 17:15:14 +0100

> TL;DR:
>  - objdiff shows no change if CONFIG_XFRM=n && BR_NETFILTER=n
>  - small size reduction when one or both options are set
>  - no changes in ipsec performance
> 
>  Changes since v1:
>  - Allocate entire extension space from a kmem_cache.
>  - Avoid atomic_dec_and_test operation on skb_ext_put() for refcnt == 1 case.
>    (similar to kfree_skbmem() fclone_ref use).
> 
> This adds an optional extension infrastructure, with ispec (xfrm) and
> bridge netfilter as first users.
 ...

Hey Florian, I just wanted to let you know that I'm actively reviewing this.

I have no major issues with the approach, but I just want to understand all
of the details before I apply this.

And honestly, if this turns out to be the wrong direction we can
revert and try doing this another way.

Anyways, just FYI...

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH net-next 0/13] sk_buff: add extension infrastructure
@ 2018-12-18 16:15 Florian Westphal
  2018-12-19 19:02 ` David Miller
  0 siblings, 1 reply; 45+ messages in thread
From: Florian Westphal @ 2018-12-18 16:15 UTC (permalink / raw)
  To: netdev

TL;DR:
 - objdiff shows no change if CONFIG_XFRM=n && BR_NETFILTER=n
 - small size reduction when one or both options are set
 - no changes in ipsec performance

 Changes since v1:
 - Allocate entire extension space from a kmem_cache.
 - Avoid atomic_dec_and_test operation on skb_ext_put() for refcnt == 1 case.
   (similar to kfree_skbmem() fclone_ref use).

This adds an optional extension infrastructure, with ispec (xfrm) and
bridge netfilter as first users.

The third (future) user is Multipath TCP which is still out-of-tree.
MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
numbers used by individual subflows.

This DSS mapping is read/written from tcp option space on receive and
written to tcp option space on transmitted tcp packets that are part of
and MPTCP connection.

Extending skb_shared_info or adding a private data field to skb fclones
doesn't work for incoming skb, so a different DSS propagation method would
be required for the receive side.

mptcp has same requirements as secpath/bridge netfilter:

1. extension memory is released when the sk_buff is free'd.
2. data is shared after cloning an skb (clone inherits extension)
3. adding extension to an skb will COW the extension buffer if needed.

Two new members are added to sk_buff:
1. 'active_extensions' byte (filling a hole), telling which extensions
   are available for this skb.
   This has two purposes.
   a) avoids the need to initialize the pointer.
   b) allows to "delete" an extension by clearing its bit
   value in ->active_extensions.

   While it would be possible to store the active_extensions byte
   in the extension struct instead of sk_buff, there is one problem
   with this:
    When an extension has to be disabled, we can always clear the
    bit in skb->active_extensions.  But in case it would be stored in the
    extension buffer itself, we might have to COW it first, if
    we are dealing with a cloned skb.  On kmalloc failure we would
    be unable to turn an extension off.
2. extension pointer, located at the end of the sk_buff.
   If the active_extensions byte is 0, the pointer is undefined,
   it is not initialized on skb allocation.

This adds extra code to skb clone and free paths (to deal with
refcount/free of extension area) but this replaces similar code that
manages skb->nf_bridge and skb->sp structs in the followup patches of
the series.

It is possible to add support for extensions that are not preseved on
clones/copies:

1. define a bitmask of all extensions that need copy/cow on clone
2. change __skb_ext_copy() to check
   ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE
3. set clone->active_extensions to 0 if test is false.

This isn't done here because all extensions that get added here
need the copy/cow semantics.

Last patch converts skb->sp, secpath information gets stored as
new SKB_EXT_SEC_PATH, so the 'sp' pointer is removed from skbuff.

Extra code added to skb clone and free paths (to deal with refcount/free
of extension area) replaces the existing code that does the same for
skb->nf_bridge and skb->secpath.

I don't see any other in-tree users that could benefit from this
infrastructure, it doesn't make sense to add an extension just for the sake
of a single flag bit (like skb->nf_trace).

Adding a new extension is a good fit if all of the following are true:

1. Data is related to the skb/packet aggregate
2. Data should be freed when the skb is free'd
3. Data is not going to be relevant/needed in normal case (udp, tcp,
   forwarding workloads, ...)
4. There are no fancy action(s) needed on clone/free, such as callbacks
   into kernel modules.

Florian Westphal (13):
      netfilter: avoid using skb->nf_bridge directly
      sk_buff: add skb extension infrastructure
      net: convert bridge_nf to use skb extension infrastructure
      xfrm: change secpath_set to return secpath struct, not error value
      net: move secpath_exist helper to sk_buff.h
      net: use skb_sec_path helper in more places
      drivers: net: intel: use secpath helpers in more places
      drivers: net: ethernet: mellanox: use skb_sec_path helper
      drivers: net: netdevsim: use skb_sec_path helper
      xfrm: use secpath_exist where applicable
      drivers: chelsio: use skb_sec_path helper
      xfrm: prefer secpath_set over secpath_dup
      net: switch secpath to use skb extension infrastructure

 Documentation/networking/xfrm_device.txt                      |    7 
 drivers/crypto/chelsio/chcr_ipsec.c                           |    4 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c                |   15 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c                 |    5 
 drivers/net/ethernet/intel/ixgbevf/ipsec.c                    |   15 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c             |    2 
 drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_rxtx.c |   19 
 drivers/net/netdevsim/ipsec.c                                 |    7 
 include/linux/netfilter_bridge.h                              |   33 +
 include/linux/skbuff.h                                        |  152 ++++++-
 include/net/netfilter/br_netfilter.h                          |   14 
 include/net/xfrm.h                                            |   41 --
 net/Kconfig                                                   |    4 
 net/bridge/br_netfilter_hooks.c                               |   39 -
 net/bridge/br_netfilter_ipv6.c                                |    4 
 net/core/skbuff.c                                             |  201 +++++++++-
 net/ipv4/esp4.c                                               |    9 
 net/ipv4/esp4_offload.c                                       |   15 
 net/ipv4/ip_output.c                                          |    1 
 net/ipv4/netfilter/nf_reject_ipv4.c                           |    6 
 net/ipv6/esp6.c                                               |    9 
 net/ipv6/esp6_offload.c                                       |   15 
 net/ipv6/ip6_output.c                                         |    1 
 net/ipv6/netfilter/nf_reject_ipv6.c                           |   10 
 net/ipv6/xfrm6_input.c                                        |    8 
 net/netfilter/nf_log_common.c                                 |   20 
 net/netfilter/nf_queue.c                                      |   50 +-
 net/netfilter/nfnetlink_queue.c                               |   23 -
 net/netfilter/nft_meta.c                                      |    2 
 net/netfilter/nft_xfrm.c                                      |    2 
 net/netfilter/xt_physdev.c                                    |    2 
 net/netfilter/xt_policy.c                                     |    2 
 net/xfrm/Kconfig                                              |    1 
 net/xfrm/xfrm_device.c                                        |    4 
 net/xfrm/xfrm_input.c                                         |   76 +--
 net/xfrm/xfrm_interface.c                                     |    2 
 net/xfrm/xfrm_output.c                                        |    7 
 net/xfrm/xfrm_policy.c                                        |   19 
 security/selinux/xfrm.c                                       |    4 
 39 files changed, 564 insertions(+), 286 deletions(-)

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-12-19 19:47 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-10 14:49 [PATCH net-next 0/13] sk_buff: add extension infrastructure Florian Westphal
2018-12-10 14:49 ` [PATCH net-next 01/13] netfilter: avoid using skb->nf_bridge directly Florian Westphal
2018-12-10 14:49 ` [PATCH net-next 02/13] sk_buff: add skb extension infrastructure Florian Westphal
     [not found]   ` <CAPUCuiADYwjY4kpq76-w9BKL3uiRvNjnmzKG29mCrb=b8YeesA@mail.gmail.com>
2018-12-12  0:07     ` Mat Martineau
2018-12-12  0:11       ` Florian Westphal
2018-12-12 11:59         ` Florian Westphal
2018-12-12 16:59           ` Mat Martineau
2018-12-12 14:44   ` Willem de Bruijn
2018-12-12 15:40     ` Florian Westphal
2018-12-12 15:45       ` Willem de Bruijn
2018-12-12 17:23   ` Eric Dumazet
2018-12-12 18:44     ` Florian Westphal
2018-12-12 20:17       ` Eric Dumazet
2018-12-12 20:52         ` Florian Westphal
2018-12-13  5:40           ` Eric Dumazet
2018-12-13  9:27             ` Florian Westphal
2018-12-13 10:18               ` Eric Dumazet
2018-12-13 10:39                 ` Florian Westphal
2018-12-13 10:58                   ` Eric Dumazet
2018-12-13 11:03                     ` Florian Westphal
2018-12-13 11:16                       ` Eric Dumazet
2018-12-13 11:44                         ` Florian Westphal
2018-12-13 17:00                   ` Christoph Paasch
2018-12-12 18:16   ` Stephen Suryaputra
2018-12-12 18:38     ` Florian Westphal
2018-12-13  0:38     ` David Miller
2018-12-10 14:49 ` [PATCH net-next 03/13] net: convert bridge_nf to use " Florian Westphal
2018-12-10 14:49 ` [PATCH net-next 04/13] xfrm: change secpath_set to return secpath struct, not error value Florian Westphal
2018-12-10 14:49 ` [PATCH net-next 05/13] net: move secpath_exist helper to sk_buff.h Florian Westphal
2018-12-10 14:49 ` [PATCH net-next 06/13] net: use skb_sec_path helper in more places Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 07/13] drivers: net: intel: use secpath helpers " Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 08/13] drivers: net: ethernet: mellanox: use skb_sec_path helper Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 09/13] drivers: net: netdevsim: " Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 10/13] xfrm: use secpath_exist where applicable Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 11/13] drivers: chelsio: use skb_sec_path helper Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 12/13] xfrm: prefer secpath_set over secpath_dup Florian Westphal
2018-12-10 14:50 ` [PATCH net-next 13/13] net: switch secpath to use skb extension infrastructure Florian Westphal
2018-12-11  8:06   ` Steffen Klassert
2018-12-11 10:18     ` Florian Westphal
2018-12-11 10:20       ` Steffen Klassert
2018-12-12 11:52         ` Florian Westphal
2018-12-13  4:08 ` [PATCH net-next 0/13] sk_buff: add " Shannon Nelson
2018-12-18 16:15 Florian Westphal
2018-12-19 19:02 ` David Miller
2018-12-19 19:47   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.