All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
@ 2015-05-23  3:55 Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 01/11] ipv6: Clean up ipv6_select_ident() and ip6_fragment() Martin KaFai Lau
                   ` (12 more replies)
  0 siblings, 13 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:55 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

v4 -> v5:
- Patch 1 is new. Clean up the ipv6_select_ident() and ip6_fragment().

- Further simplify the newly added rt6_get_pcpu_route().  If there is a
  'prev' after cmpxchg, return prev instead of the newly created percpu
  clone.

v3 -> v4:
- Patch 8 is new. It keeps track of the DST_NOCACHE routes in a list to handle
  the iface down/unregister event.

- Remove rcu from the newly added rt6i_pcpu variable.  It is not needed
  because it has already been protected by the existing reader/writer lock.

- Thanks to 'Julian Anastasov <ja@ssi.bg>' for testing the FLOWI_FLAG_KNOWN_NH
  patches.

v2 -> v3:
- Patch 5 to 7 are new.  They take care of cases where the daddr in
  skb is not the one used to do the route look-up.  There is also
  related changes to rt6_nexthop() since v2 which is in patch 2/9.
  Thanks to 'Julian Anastasov <ja@ssi.bg>' for pointing it out.

- Fix a few problems in __ip6_rt_update_pmtu(), like setting the expire
  and mtu before inserting to the tree and don't do dst_destroy() after
  tree insertion failure.  Also update the rt6i_pmtu in fib6_add_rt2node().
  Thanks to 'Steffen Klassert <steffen.klassert@secunet.com>' for pointing
  it out.

- Merge ip6_pmtu_rt_cache_alloc() into ip6_rt_cache_alloc().

v1 -> v2:
- Move the /128 route bug fixes to another series (accepted).
- Create a function for checking (rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)).
- Avoid shuffling the skb network_header.  Instead, change the function
  signature to take iph instead of skb.

- Many Thanks to 'Hannes Frederic Sowa <hannes@stressinduktion.org>' on
  reviewing v1 and v2 and giving advice.

--Martin

~~~ start: v1 compose message (with the out-dated parts removed) ~~~

This series is to avoid creating a RTF_CACHE route whenever we are consulting
the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
when we see a pmtu exception.

Out of all ipv6 RTF_CACHE routes that are created, the percentage that has a
different mtu is very small. In one of our end-user facing proxy server,
only 1k out of 80k RTF_CACHE routes have a smaller MTU.  For our DC
traffic, there is no mtu exception.

A large fib6 tree has problems like, 'ip -6 r show' takes a long time.
gc may kick in too often.  Also, when a service has restarted and a lot
of new TCP conn requests come in, it creates pressure on the tree by inserting
a lot of RTF_CACHE in a short time and it currently requires a write lock
to do that.

The first few patches are prep works to remove assumption that the
returned rt is always RTF_CACHE.

The patch 'ipv6: Only create RTF_CACHE routes after encountering pmtu exception'
do the lazy RTF_CACHE route creation.

The following patches added percpu rt to compensate the performance loss after
doing the RTF_CACHE lazy creation.

Here is some numbers of the udpflood test.  The udpflood has been
slightly modified to have a time limit instead of count limit.

A /64 via gateway route is used for the test. Each udpflood uses 10000 dst
addresses.  The dst addresses of different udpflood processes do not overlap
with each other.

# of udpflood        # of trans (patched)        # of trans (upstream)

1                    16M                          15M
10                   61M                          61M
20                   65M                          62M
40                   88M                          83M

~~~ end: v1 compose message ~~~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 01/11] ipv6: Clean up ipv6_select_ident() and ip6_fragment()
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
@ 2015-05-23  3:55 ` Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 02/11] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:55 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch changes the ipv6_select_ident() signature to return a
fragment id instead of taking a whole frag_hdr as a param to
only set the frag_hdr->identification.

It also cleans up ip6_fragment() to obtain the fragment id at the
beginning instead of using multiple "if" later to check fragment id
has been generated or not.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ipv6.h     |  3 +--
 net/ipv6/ip6_output.c  | 17 ++++++-----------
 net/ipv6/output_core.c |  5 ++---
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index aab8190..8c4f881 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -671,8 +671,7 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add
 	return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr));
 }
 
-void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt);
+u32 ipv6_select_ident(struct net *net, struct rt6_info *rt);
 void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb);
 
 int ip6_dst_hoplimit(struct dst_entry *dst);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c217775..e4772ab 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -551,7 +551,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	struct frag_hdr *fh;
 	unsigned int mtu, hlen, left, len;
 	int hroom, troom;
-	__be32 frag_id = 0;
+	__be32 frag_id;
 	int ptr, offset = 0, err = 0;
 	u8 *prevhdr, nexthdr = 0;
 	struct net *net = dev_net(skb_dst(skb)->dev);
@@ -584,6 +584,8 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	}
 	mtu -= hlen + sizeof(struct frag_hdr);
 
+	frag_id = ipv6_select_ident(net, rt);
+
 	if (skb_has_frag_list(skb)) {
 		int first_len = skb_pagelen(skb);
 		struct sk_buff *frag2;
@@ -632,11 +634,10 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 		skb_reset_network_header(skb);
 		memcpy(skb_network_header(skb), tmp_hdr, hlen);
 
-		ipv6_select_ident(net, fh, rt);
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
 		fh->frag_off = htons(IP6_MF);
-		frag_id = fh->identification;
+		fh->identification = frag_id;
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
@@ -778,11 +779,7 @@ slow_path:
 		 */
 		fh->nexthdr = nexthdr;
 		fh->reserved = 0;
-		if (!frag_id) {
-			ipv6_select_ident(net, fh, rt);
-			frag_id = fh->identification;
-		} else
-			fh->identification = frag_id;
+		fh->identification = frag_id;
 
 		/*
 		 *	Copy a block of the IP datagram.
@@ -1064,7 +1061,6 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 
 {
 	struct sk_buff *skb;
-	struct frag_hdr fhdr;
 	int err;
 
 	/* There is support for UDP large send offload by network
@@ -1106,8 +1102,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 	skb_shinfo(skb)->gso_size = (mtu - fragheaderlen -
 				     sizeof(struct frag_hdr)) & ~7;
 	skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-	ipv6_select_ident(sock_net(sk), &fhdr, rt);
-	skb_shinfo(skb)->ip6_frag_id = fhdr.identification;
+	skb_shinfo(skb)->ip6_frag_id = ipv6_select_ident(sock_net(sk), rt);
 
 append:
 	return skb_append_datato_frags(sk, skb, getfrag, from,
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 85892af..ef0e232 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -60,8 +60,7 @@ void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
 
-void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
-		       struct rt6_info *rt)
+u32 ipv6_select_ident(struct net *net, struct rt6_info *rt)
 {
 	static u32 ip6_idents_hashrnd __read_mostly;
 	u32 id;
@@ -70,7 +69,7 @@ void ipv6_select_ident(struct net *net, struct frag_hdr *fhdr,
 
 	id = __ipv6_select_ident(net, ip6_idents_hashrnd, &rt->rt6i_dst.addr,
 				 &rt->rt6i_src.addr);
-	fhdr->identification = htonl(id);
+	return htonl(id);
 }
 EXPORT_SYMBOL(ipv6_select_ident);
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 02/11] ipv6: Remove external dependency on rt6i_dst and rt6i_src
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 01/11] ipv6: Clean up ipv6_select_ident() and ip6_fragment() Martin KaFai Lau
@ 2015-05-23  3:55 ` Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 03/11] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:55 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch removes the assumptions that the returned rt is always
a RTF_CACHE entry with the rt6i_dst and rt6i_src containing the
destination and source address.  The dst and src can be recovered from
the calling site.

We may consider to rename (rt6i_dst, rt6i_src) to
(rt6i_key_dst, rt6i_key_src) later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 drivers/scsi/cxgbi/libcxgbi.c   |  2 +-
 include/net/ipv6.h              |  4 +++-
 net/ipv6/icmp.c                 |  2 +-
 net/ipv6/ip6_output.c           | 13 ++++++++-----
 net/ipv6/ndisc.c                |  2 +-
 net/ipv6/output_core.c          | 10 ++++++----
 net/ipv6/tcp_ipv6.c             |  2 +-
 net/netfilter/ipvs/ip_vs_xmit.c |  4 ++--
 net/sctp/ipv6.c                 |  3 ++-
 9 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
index eb58afc..45d3039 100644
--- a/drivers/scsi/cxgbi/libcxgbi.c
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -728,7 +728,7 @@ static struct cxgbi_sock *cxgbi_check_route6(struct sockaddr *dst_addr)
 	}
 	ndev = n->dev;
 
-	if (ipv6_addr_is_multicast(&rt->rt6i_dst.addr)) {
+	if (ipv6_addr_is_multicast(&daddr6->sin6_addr)) {
 		pr_info("multi-cast route %pI6 port %u, dev %s.\n",
 			daddr6->sin6_addr.s6_addr,
 			ntohs(daddr6->sin6_port), ndev->name);
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 8c4f881..b950a20 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -671,7 +671,9 @@ static inline int ipv6_addr_diff(const struct in6_addr *a1, const struct in6_add
 	return __ipv6_addr_diff(a1, a2, sizeof(struct in6_addr));
 }
 
-u32 ipv6_select_ident(struct net *net, struct rt6_info *rt);
+u32 ipv6_select_ident(struct net *net,
+		      const struct in6_addr *daddr,
+		      const struct in6_addr *saddr);
 void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb);
 
 int ip6_dst_hoplimit(struct dst_entry *dst);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 2c2b5d5..24b359d 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -207,7 +207,7 @@ static bool icmpv6_xrlim_allow(struct sock *sk, u8 type,
 			struct inet_peer *peer;
 
 			peer = inet_getpeer_v6(net->ipv6.peers,
-					       &rt->rt6i_dst.addr, 1);
+					       &fl6->daddr, 1);
 			res = inet_peer_xrlim_allow(peer, tmo);
 			if (peer)
 				inet_putpeer(peer);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index e4772ab..1b0e3ad 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -459,7 +459,7 @@ int ip6_forward(struct sk_buff *skb)
 		else
 			target = &hdr->daddr;
 
-		peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+		peer = inet_getpeer_v6(net->ipv6.peers, &hdr->daddr, 1);
 
 		/* Limit redirects both by destination (here)
 		   and by source (inside ndisc_send_redirect)
@@ -584,7 +584,8 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	}
 	mtu -= hlen + sizeof(struct frag_hdr);
 
-	frag_id = ipv6_select_ident(net, rt);
+	frag_id = ipv6_select_ident(net, &ipv6_hdr(skb)->daddr,
+				    &ipv6_hdr(skb)->saddr);
 
 	if (skb_has_frag_list(skb)) {
 		int first_len = skb_pagelen(skb);
@@ -1057,7 +1058,7 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 			int odd, struct sk_buff *skb),
 			void *from, int length, int hh_len, int fragheaderlen,
 			int transhdrlen, int mtu, unsigned int flags,
-			struct rt6_info *rt)
+			const struct flowi6 *fl6)
 
 {
 	struct sk_buff *skb;
@@ -1102,7 +1103,9 @@ static inline int ip6_ufo_append_data(struct sock *sk,
 	skb_shinfo(skb)->gso_size = (mtu - fragheaderlen -
 				     sizeof(struct frag_hdr)) & ~7;
 	skb_shinfo(skb)->gso_type = SKB_GSO_UDP;
-	skb_shinfo(skb)->ip6_frag_id = ipv6_select_ident(sock_net(sk), rt);
+	skb_shinfo(skb)->ip6_frag_id = ipv6_select_ident(sock_net(sk),
+							 &fl6->daddr,
+							 &fl6->saddr);
 
 append:
 	return skb_append_datato_frags(sk, skb, getfrag, from,
@@ -1325,7 +1328,7 @@ emsgsize:
 	    (sk->sk_type == SOCK_DGRAM)) {
 		err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
 					  hh_len, fragheaderlen,
-					  transhdrlen, mtu, flags, rt);
+					  transhdrlen, mtu, flags, fl6);
 		if (err)
 			goto error;
 		return 0;
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 96f153c..0a05b35 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1506,7 +1506,7 @@ void ndisc_send_redirect(struct sk_buff *skb, const struct in6_addr *target)
 			  "Redirect: destination is not a neighbour\n");
 		goto release;
 	}
-	peer = inet_getpeer_v6(net->ipv6.peers, &rt->rt6i_dst.addr, 1);
+	peer = inet_getpeer_v6(net->ipv6.peers, &ipv6_hdr(skb)->saddr, 1);
 	ret = inet_peer_xrlim_allow(peer, 1*HZ);
 	if (peer)
 		inet_putpeer(peer);
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index ef0e232..055e85c 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -10,7 +10,8 @@
 #include <net/secure_seq.h>
 
 static u32 __ipv6_select_ident(struct net *net, u32 hashrnd,
-			       struct in6_addr *dst, struct in6_addr *src)
+			       const struct in6_addr *dst,
+			       const struct in6_addr *src)
 {
 	u32 hash, id;
 
@@ -60,15 +61,16 @@ void ipv6_proxy_select_ident(struct net *net, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(ipv6_proxy_select_ident);
 
-u32 ipv6_select_ident(struct net *net, struct rt6_info *rt)
+u32 ipv6_select_ident(struct net *net,
+		      const struct in6_addr *daddr,
+		      const struct in6_addr *saddr)
 {
 	static u32 ip6_idents_hashrnd __read_mostly;
 	u32 id;
 
 	net_get_random_once(&ip6_idents_hashrnd, sizeof(ip6_idents_hashrnd));
 
-	id = __ipv6_select_ident(net, ip6_idents_hashrnd, &rt->rt6i_dst.addr,
-				 &rt->rt6i_src.addr);
+	id = __ipv6_select_ident(net, ip6_idents_hashrnd, daddr, saddr);
 	return htonl(id);
 }
 EXPORT_SYMBOL(ipv6_select_ident);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index beac6bf..2275999 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -262,7 +262,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	rt = (struct rt6_info *) dst;
 	if (tcp_death_row.sysctl_tw_recycle &&
 	    !tp->rx_opt.ts_recent_stamp &&
-	    ipv6_addr_equal(&rt->rt6i_dst.addr, &sk->sk_v6_daddr))
+	    ipv6_addr_equal(&fl6.daddr, &sk->sk_v6_daddr))
 		tcp_fetch_timewait_stamp(sk, dst);
 
 	icsk->icsk_ext_hdr_len = 0;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 19986ec..38f8627 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -781,7 +781,7 @@ ip_vs_nat_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG_RL_PKT(1, AF_INET6, pp, skb, 0,
 				 "ip_vs_nat_xmit_v6(): "
 				 "stopping DNAT to loopback address");
@@ -1346,7 +1346,7 @@ ip_vs_icmp_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 
 	/* From world but DNAT to loopback address? */
 	if (local && skb->dev && !(skb->dev->flags & IFF_LOOPBACK) &&
-	    ipv6_addr_type(&rt->rt6i_dst.addr) & IPV6_ADDR_LOOPBACK) {
+	    ipv6_addr_type(&cp->daddr.in6) & IPV6_ADDR_LOOPBACK) {
 		IP_VS_DBG(1, "%s(): "
 			  "stopping DNAT to loopback %pI6\n",
 			  __func__, &cp->daddr.in6);
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index e703ff7..17a0120 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -332,7 +332,8 @@ out:
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
 		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
-		pr_debug("rt6_dst:%pI6 rt6_src:%pI6\n", &rt->rt6i_dst.addr,
+		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
+			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
 	} else {
 		t->dst = NULL;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 03/11] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 01/11] ipv6: Clean up ipv6_select_ident() and ip6_fragment() Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 02/11] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
@ 2015-05-23  3:55 ` Martin KaFai Lau
  2015-05-23  3:55 ` [PATCH net-next v5 04/11] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:55 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

When creating a RTF_CACHE route, RTF_ANYCAST is set based on rt6i_dst.
Also, rt6i_gateway is always set to the nexthop while the nexthop
could be a gateway or the rt6i_dst.addr.

After removing the rt6i_dst and rt6i_src dependency in the last patch,
we also need to stop the caller from depending on rt6i_gateway and
RTF_ANYCAST.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip6_route.h                | 19 ++++++++++++++-----
 net/bluetooth/6lowpan.c                |  2 +-
 net/ipv6/icmp.c                        |  4 ++--
 net/ipv6/ip6_output.c                  |  5 +++--
 net/ipv6/route.c                       |  6 +-----
 net/netfilter/nf_conntrack_h323_main.c |  4 ++--
 net/netfilter/xt_addrtype.c            |  2 +-
 7 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 5e19206..4caf7d6 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -163,11 +163,14 @@ static inline bool ipv6_unicast_destination(const struct sk_buff *skb)
 	return rt->rt6i_flags & RTF_LOCAL;
 }
 
-static inline bool ipv6_anycast_destination(const struct sk_buff *skb)
+static inline bool ipv6_anycast_destination(const struct dst_entry *dst,
+					    const struct in6_addr *daddr)
 {
-	struct rt6_info *rt = (struct rt6_info *) skb_dst(skb);
+	struct rt6_info *rt = (struct rt6_info *)dst;
 
-	return rt->rt6i_flags & RTF_ANYCAST;
+	return rt->rt6i_flags & RTF_ANYCAST ||
+		(rt->rt6i_dst.plen != 128 &&
+		 ipv6_addr_equal(&rt->rt6i_dst.addr, daddr));
 }
 
 int ip6_fragment(struct sock *sk, struct sk_buff *skb,
@@ -194,9 +197,15 @@ static inline bool ip6_sk_ignore_df(const struct sock *sk)
 	       inet6_sk(sk)->pmtudisc == IPV6_PMTUDISC_OMIT;
 }
 
-static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt)
+static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt,
+					   struct in6_addr *daddr)
 {
-	return &rt->rt6i_gateway;
+	if (rt->rt6i_flags & RTF_GATEWAY)
+		return &rt->rt6i_gateway;
+	else if (rt->rt6i_flags & RTF_CACHE)
+		return &rt->rt6i_dst.addr;
+	else
+		return daddr;
 }
 
 #endif
diff --git a/net/bluetooth/6lowpan.c b/net/bluetooth/6lowpan.c
index 1742b84..f3d6046 100644
--- a/net/bluetooth/6lowpan.c
+++ b/net/bluetooth/6lowpan.c
@@ -192,7 +192,7 @@ static inline struct lowpan_peer *peer_lookup_dst(struct lowpan_dev *dev,
 		if (ipv6_addr_any(nexthop))
 			return NULL;
 	} else {
-		nexthop = rt6_nexthop(rt);
+		nexthop = rt6_nexthop(rt, daddr);
 
 		/* We need to remember the address because it is needed
 		 * by bt_xmit() when sending the packet. In bt_xmit(), the
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 24b359d..713d743 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -337,7 +337,7 @@ static struct dst_entry *icmpv6_route_lookup(struct net *net,
 	 * We won't send icmp if the destination is known
 	 * anycast.
 	 */
-	if (((struct rt6_info *)dst)->rt6i_flags & RTF_ANYCAST) {
+	if (ipv6_anycast_destination(dst, &fl6->daddr)) {
 		net_dbg_ratelimited("icmp6_send: acast source\n");
 		dst_release(dst);
 		return ERR_PTR(-EINVAL);
@@ -564,7 +564,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 
 	if (!ipv6_unicast_destination(skb) &&
 	    !(net->ipv6.sysctl.anycast_src_echo_reply &&
-	      ipv6_anycast_destination(skb)))
+	      ipv6_anycast_destination(skb_dst(skb), saddr)))
 		saddr = NULL;
 
 	memcpy(&tmp_hdr, icmph, sizeof(tmp_hdr));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 1b0e3ad..d823abf 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -105,7 +105,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff *skb)
 	}
 
 	rcu_read_lock_bh();
-	nexthop = rt6_nexthop((struct rt6_info *)dst);
+	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
 	neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
 	if (unlikely(!neigh))
 		neigh = __neigh_create(&nd_tbl, nexthop, dst->dev, false);
@@ -934,7 +934,8 @@ static int ip6_dst_lookup_tail(struct sock *sk,
 	 */
 	rt = (struct rt6_info *) *dst;
 	rcu_read_lock_bh();
-	n = __ipv6_neigh_lookup_noref(rt->dst.dev, rt6_nexthop(rt));
+	n = __ipv6_neigh_lookup_noref(rt->dst.dev,
+				      rt6_nexthop(rt, &fl6->daddr));
 	err = n && !(n->nud_state & NUD_VALID) ? -EINVAL : 0;
 	rcu_read_unlock_bh();
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 98fce6f..34180f1 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1945,11 +1945,7 @@ static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
 		if (rt->rt6i_idev)
 			in6_dev_hold(rt->rt6i_idev);
 		rt->dst.lastuse = jiffies;
-
-		if (ort->rt6i_flags & RTF_GATEWAY)
-			rt->rt6i_gateway = ort->rt6i_gateway;
-		else
-			rt->rt6i_gateway = *dest;
+		rt->rt6i_gateway = ort->rt6i_gateway;
 		rt->rt6i_flags = ort->rt6i_flags;
 		rt6_set_from(rt, ort);
 		rt->rt6i_metric = 0;
diff --git a/net/netfilter/nf_conntrack_h323_main.c b/net/netfilter/nf_conntrack_h323_main.c
index 1d69f5b..9511af0 100644
--- a/net/netfilter/nf_conntrack_h323_main.c
+++ b/net/netfilter/nf_conntrack_h323_main.c
@@ -779,8 +779,8 @@ static int callforward_do_filter(struct net *net,
 				   flowi6_to_flowi(&fl1), false)) {
 			if (!afinfo->route(net, (struct dst_entry **)&rt2,
 					   flowi6_to_flowi(&fl2), false)) {
-				if (ipv6_addr_equal(rt6_nexthop(rt1),
-						    rt6_nexthop(rt2)) &&
+				if (ipv6_addr_equal(rt6_nexthop(rt1, &fl1.daddr),
+						    rt6_nexthop(rt2, &fl2.daddr)) &&
 				    rt1->dst.dev == rt2->dst.dev)
 					ret = 1;
 				dst_release(&rt2->dst);
diff --git a/net/netfilter/xt_addrtype.c b/net/netfilter/xt_addrtype.c
index fab6eea..5b4743c 100644
--- a/net/netfilter/xt_addrtype.c
+++ b/net/netfilter/xt_addrtype.c
@@ -73,7 +73,7 @@ static u32 match_lookup_rt6(struct net *net, const struct net_device *dev,
 
 	if (dev == NULL && rt->rt6i_flags & RTF_LOCAL)
 		ret |= XT_ADDRTYPE_LOCAL;
-	if (rt->rt6i_flags & RTF_ANYCAST)
+	if (ipv6_anycast_destination((struct dst_entry *)rt, addr))
 		ret |= XT_ADDRTYPE_ANYCAST;
 
 	dst_release(&rt->dst);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 04/11] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2015-05-23  3:55 ` [PATCH net-next v5 03/11] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
@ 2015-05-23  3:55 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 05/11] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:55 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

A prep work for creating RTF_CACHE on exception only.  After this
patch, the same condition (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY))
is checked twice. This redundancy will be removed in the later patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 net/ipv6/route.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 34180f1..575d112 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -655,6 +655,11 @@ static struct rt6_info *rt6_select(struct fib6_node *fn, int oif, int strict)
 	return match ? match : net->ipv6.ip6_null_entry;
 }
 
+static bool rt6_is_gw_or_nonexthop(const struct rt6_info *rt)
+{
+	return (rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY));
+}
+
 #ifdef CONFIG_IPV6_ROUTE_INFO
 int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
 		  const struct in6_addr *gwaddr)
@@ -833,9 +838,9 @@ int ip6_ins_rt(struct rt6_info *rt)
 	return __ip6_ins_rt(rt, &info, &mxc);
 }
 
-static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
-				      const struct in6_addr *daddr,
-				      const struct in6_addr *saddr)
+static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
+					   const struct in6_addr *daddr,
+					   const struct in6_addr *saddr)
 {
 	struct rt6_info *rt;
 
@@ -846,33 +851,24 @@ static struct rt6_info *rt6_alloc_cow(struct rt6_info *ort,
 	rt = ip6_rt_copy(ort, daddr);
 
 	if (rt) {
-		if (ort->rt6i_dst.plen != 128 &&
-		    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
-			rt->rt6i_flags |= RTF_ANYCAST;
-
 		rt->rt6i_flags |= RTF_CACHE;
 
+		if (!rt6_is_gw_or_nonexthop(ort)) {
+			if (ort->rt6i_dst.plen != 128 &&
+			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
+				rt->rt6i_flags |= RTF_ANYCAST;
 #ifdef CONFIG_IPV6_SUBTREES
-		if (rt->rt6i_src.plen && saddr) {
-			rt->rt6i_src.addr = *saddr;
-			rt->rt6i_src.plen = 128;
-		}
+			if (rt->rt6i_src.plen && saddr) {
+				rt->rt6i_src.addr = *saddr;
+				rt->rt6i_src.plen = 128;
+			}
 #endif
+		}
 	}
 
 	return rt;
 }
 
-static struct rt6_info *rt6_alloc_clone(struct rt6_info *ort,
-					const struct in6_addr *daddr)
-{
-	struct rt6_info *rt = ip6_rt_copy(ort, daddr);
-
-	if (rt)
-		rt->rt6i_flags |= RTF_CACHE;
-	return rt;
-}
-
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
@@ -918,10 +914,9 @@ redo_rt6_select:
 	if (rt->rt6i_flags & RTF_CACHE)
 		goto out2;
 
-	if (!(rt->rt6i_flags & (RTF_NONEXTHOP | RTF_GATEWAY)))
-		nrt = rt6_alloc_cow(rt, &fl6->daddr, &fl6->saddr);
-	else if (!(rt->dst.flags & DST_HOST) || !(rt->rt6i_flags & RTF_LOCAL))
-		nrt = rt6_alloc_clone(rt, &fl6->daddr);
+	if (!rt6_is_gw_or_nonexthop(rt) ||
+	    !(rt->dst.flags & DST_HOST) || !(rt->rt6i_flags & RTF_LOCAL))
+		nrt = ip6_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
 	else
 		goto out2;
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 05/11] ipv6: Only create RTF_CACHE routes after encountering pmtu exception
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2015-05-23  3:55 ` [PATCH net-next v5 04/11] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 06/11] ipv6: Add rt6_get_cookie() function Martin KaFai Lau
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch creates a RTF_CACHE routes only after encountering a pmtu
exception.

After ip6_rt_update_pmtu() has inserted the RTF_CACHE route to the fib6
tree, the rt->rt6i_node->fn_sernum is bumped which will fail the
ip6_dst_check() and trigger a relookup.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip6_route.h |   2 +-
 net/ipv6/ip6_fib.c      |   1 +
 net/ipv6/route.c        | 100 ++++++++++++++++++++++++------------------------
 3 files changed, 53 insertions(+), 50 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4caf7d6..784ee3d 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -202,7 +202,7 @@ static inline struct in6_addr *rt6_nexthop(struct rt6_info *rt,
 {
 	if (rt->rt6i_flags & RTF_GATEWAY)
 		return &rt->rt6i_gateway;
-	else if (rt->rt6i_flags & RTF_CACHE)
+	else if (unlikely(rt->rt6i_flags & RTF_CACHE))
 		return &rt->rt6i_dst.addr;
 	else
 		return daddr;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 96dbfff..7d66490 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -732,6 +732,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 					rt6_clean_expires(iter);
 				else
 					rt6_set_expires(iter, rt->dst.expires);
+				iter->rt6i_pmtu = rt->rt6i_pmtu;
 				return -EEXIST;
 			}
 			/* If we have the same destination and the same metric,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 575d112..77a2c1a 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -873,16 +873,13 @@ static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 				      struct flowi6 *fl6, int flags)
 {
 	struct fib6_node *fn, *saved_fn;
-	struct rt6_info *rt, *nrt;
+	struct rt6_info *rt;
 	int strict = 0;
-	int attempts = 3;
-	int err;
 
 	strict |= flags & RT6_LOOKUP_F_IFACE;
 	if (net->ipv6.devconf_all->forwarding == 0)
 		strict |= RT6_LOOKUP_F_REACHABLE;
 
-redo_fib6_lookup_lock:
 	read_lock_bh(&table->tb6_lock);
 
 	fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
@@ -901,46 +898,12 @@ redo_rt6_select:
 			strict &= ~RT6_LOOKUP_F_REACHABLE;
 			fn = saved_fn;
 			goto redo_rt6_select;
-		} else {
-			dst_hold(&rt->dst);
-			read_unlock_bh(&table->tb6_lock);
-			goto out2;
 		}
 	}
 
 	dst_hold(&rt->dst);
 	read_unlock_bh(&table->tb6_lock);
 
-	if (rt->rt6i_flags & RTF_CACHE)
-		goto out2;
-
-	if (!rt6_is_gw_or_nonexthop(rt) ||
-	    !(rt->dst.flags & DST_HOST) || !(rt->rt6i_flags & RTF_LOCAL))
-		nrt = ip6_rt_cache_alloc(rt, &fl6->daddr, &fl6->saddr);
-	else
-		goto out2;
-
-	ip6_rt_put(rt);
-	rt = nrt ? : net->ipv6.ip6_null_entry;
-
-	dst_hold(&rt->dst);
-	if (nrt) {
-		err = ip6_ins_rt(nrt);
-		if (!err)
-			goto out2;
-	}
-
-	if (--attempts <= 0)
-		goto out2;
-
-	/*
-	 * Race condition! In the gap, when table->tb6_lock was
-	 * released someone could insert this route.  Relookup.
-	 */
-	ip6_rt_put(rt);
-	goto redo_fib6_lookup_lock;
-
-out2:
 	rt6_dst_from_metrics_check(rt);
 	rt->dst.lastuse = jiffies;
 	rt->dst.__use++;
@@ -1113,24 +1076,63 @@ static void ip6_link_failure(struct sk_buff *skb)
 	}
 }
 
-static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
-			       struct sk_buff *skb, u32 mtu)
+static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
+{
+	struct net *net = dev_net(rt->dst.dev);
+
+	rt->rt6i_flags |= RTF_MODIFIED;
+	rt->rt6i_pmtu = mtu;
+	rt6_update_expires(rt, net->ipv6.sysctl.ip6_rt_mtu_expires);
+}
+
+static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
+				 const struct ipv6hdr *iph, u32 mtu)
 {
 	struct rt6_info *rt6 = (struct rt6_info *)dst;
 
-	dst_confirm(dst);
-	if (mtu < dst_mtu(dst) && (rt6->rt6i_flags & RTF_CACHE)) {
-		struct net *net = dev_net(dst->dev);
+	if (rt6->rt6i_flags & RTF_LOCAL)
+		return;
 
-		rt6->rt6i_flags |= RTF_MODIFIED;
-		if (mtu < IPV6_MIN_MTU)
-			mtu = IPV6_MIN_MTU;
+	dst_confirm(dst);
+	mtu = max_t(u32, mtu, IPV6_MIN_MTU);
+	if (mtu >= dst_mtu(dst))
+		return;
 
-		rt6->rt6i_pmtu = mtu;
-		rt6_update_expires(rt6, net->ipv6.sysctl.ip6_rt_mtu_expires);
+	if (rt6->rt6i_flags & RTF_CACHE) {
+		rt6_do_update_pmtu(rt6, mtu);
+	} else {
+		const struct in6_addr *daddr, *saddr;
+		struct rt6_info *nrt6;
+
+		if (iph) {
+			daddr = &iph->daddr;
+			saddr = &iph->saddr;
+		} else if (sk) {
+			daddr = &sk->sk_v6_daddr;
+			saddr = &inet6_sk(sk)->saddr;
+		} else {
+			return;
+		}
+		nrt6 = ip6_rt_cache_alloc(rt6, daddr, saddr);
+		if (nrt6) {
+			rt6_do_update_pmtu(nrt6, mtu);
+
+			/* ip6_ins_rt(nrt6) will bump the
+			 * rt6->rt6i_node->fn_sernum
+			 * which will fail the next rt6_check() and
+			 * invalidate the sk->sk_dst_cache.
+			 */
+			ip6_ins_rt(nrt6);
+		}
 	}
 }
 
+static void ip6_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
+			       struct sk_buff *skb, u32 mtu)
+{
+	__ip6_rt_update_pmtu(dst, sk, skb ? ipv6_hdr(skb) : NULL, mtu);
+}
+
 void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
 		     int oif, u32 mark)
 {
@@ -1147,7 +1149,7 @@ void ip6_update_pmtu(struct sk_buff *skb, struct net *net, __be32 mtu,
 
 	dst = ip6_route_output(net, NULL, &fl6);
 	if (!dst->error)
-		ip6_rt_update_pmtu(dst, NULL, skb, ntohl(mtu));
+		__ip6_rt_update_pmtu(dst, NULL, iph, ntohl(mtu));
 	dst_release(dst);
 }
 EXPORT_SYMBOL_GPL(ip6_update_pmtu);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 06/11] ipv6: Add rt6_get_cookie() function
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (4 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 05/11] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 07/11] ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags Martin KaFai Lau
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

Instead of doing the rt6->rt6i_node check whenever we need
to get the route's cookie.  Refactor it into rt6_get_cookie().
It is a prep work to handle FLOWI_FLAG_KNOWN_NH and also
percpu rt6_info later.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip6_fib.h           | 5 +++++
 include/net/ip6_route.h         | 2 +-
 net/ipv6/ip6_tunnel.c           | 2 +-
 net/ipv6/tcp_ipv6.c             | 3 +--
 net/ipv6/xfrm6_policy.c         | 6 ++----
 net/netfilter/ipvs/ip_vs_xmit.c | 2 +-
 net/sctp/ipv6.c                 | 2 +-
 7 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index e000180..a4bece6 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -159,6 +159,11 @@ static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
 	rt0->rt6i_flags |= RTF_EXPIRES;
 }
 
+static inline u32 rt6_get_cookie(const struct rt6_info *rt)
+{
+	return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+}
+
 static inline void ip6_rt_put(struct rt6_info *rt)
 {
 	/* dst_release() accepts a NULL parameter.
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 784ee3d..297629a 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -145,7 +145,7 @@ static inline void __ip6_dst_store(struct sock *sk, struct dst_entry *dst,
 #ifdef CONFIG_IPV6_SUBTREES
 	np->saddr_cache = saddr;
 #endif
-	np->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	np->dst_cookie = rt6_get_cookie(rt);
 }
 
 static inline void ip6_dst_store(struct sock *sk, struct dst_entry *dst,
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 5cafd92..2e67b66 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -151,7 +151,7 @@ EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
-	t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+	t->dst_cookie = rt6_get_cookie(rt);
 	dst_release(t->dst_cache);
 	t->dst_cache = dst;
 }
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2275999..c656c03 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -99,8 +99,7 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 		dst_hold(dst);
 		sk->sk_rx_dst = dst;
 		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
-		if (rt->rt6i_node)
-			inet6_sk(sk)->rx_dst_cookie = rt->rt6i_node->fn_sernum;
+		inet6_sk(sk)->rx_dst_cookie = rt6_get_cookie(rt);
 	}
 }
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 6ae256b..ed0583c 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -76,8 +76,7 @@ static int xfrm6_init_path(struct xfrm_dst *path, struct dst_entry *dst,
 {
 	if (dst->ops->family == AF_INET6) {
 		struct rt6_info *rt = (struct rt6_info *)dst;
-		if (rt->rt6i_node)
-			path->path_cookie = rt->rt6i_node->fn_sernum;
+		path->path_cookie = rt6_get_cookie(rt);
 	}
 
 	path->u.rt6.rt6i_nfheader_len = nfheader_len;
@@ -105,8 +104,7 @@ static int xfrm6_fill_dst(struct xfrm_dst *xdst, struct net_device *dev,
 						   RTF_LOCAL);
 	xdst->u.rt6.rt6i_metric = rt->rt6i_metric;
 	xdst->u.rt6.rt6i_node = rt->rt6i_node;
-	if (rt->rt6i_node)
-		xdst->route_cookie = rt->rt6i_node->fn_sernum;
+	xdst->route_cookie = rt6_get_cookie(rt);
 	xdst->u.rt6.rt6i_gateway = rt->rt6i_gateway;
 	xdst->u.rt6.rt6i_dst = rt->rt6i_dst;
 	xdst->u.rt6.rt6i_src = rt->rt6i_src;
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 38f8627..5eff9f6 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -435,7 +435,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 				goto err_unreach;
 			}
 			rt = (struct rt6_info *) dst;
-			cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+			cookie = rt6_get_cookie(rt);
 			__ip_vs_dst_set(dest, dest_dst, &rt->dst, cookie);
 			spin_unlock_bh(&dest->dst_lock);
 			IP_VS_DBG(10, "new dst %pI6, src %pI6, refcnt=%d\n",
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 17a0120..e917d27 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -331,7 +331,7 @@ out:
 
 		rt = (struct rt6_info *)dst;
 		t->dst = dst;
-		t->dst_cookie = rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
+		t->dst_cookie = rt6_get_cookie(rt);
 		pr_debug("rt6_dst:%pI6/%d rt6_src:%pI6\n",
 			 &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			 &fl6->saddr);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 07/11] ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (5 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 06/11] ipv6: Add rt6_get_cookie() function Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 08/11] ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set Martin KaFai Lau
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

The neighbor look-up used to depend on the rt6i_gateway (if
there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone)
as the nexthop address.  Note that rt6i_dst is set to fl6->daddr
for the RTF_CACHE clone where fl6->daddr is the one used to do
the route look-up.

Now, we only create RTF_CACHE clone after encountering exception.
When doing the neighbor look-up with a route that is neither a gateway
nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop.

In some cases, the daddr in skb is not the one used to do
the route look-up.  One example is in ip_vs_dr_xmit_v6() where the
real nexthop server address is different from the one in the skb.

This patch is going to follow the IPv4 approach and ask the
ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly.

In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH
and create a RTF_CACHE clone.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 net/ipv6/raw.c                  |  3 +++
 net/netfilter/ipvs/ip_vs_xmit.c | 13 +++++++++----
 net/netfilter/xt_TEE.c          |  1 +
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 8072bd4..484a5c1 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -865,6 +865,9 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		fl6.flowi6_oif = np->ucast_oif;
 	security_sk_classify_flow(sk, flowi6_to_flowi(&fl6));
 
+	if (inet->hdrincl)
+		fl6.flowi6_flags |= FLOWI_FLAG_KNOWN_NH;
+
 	dst = ip6_dst_lookup_flow(sk, &fl6, final_p);
 	if (IS_ERR(dst)) {
 		err = PTR_ERR(dst);
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 5eff9f6..bf66a86 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -364,13 +364,16 @@ err_unreach:
 #ifdef CONFIG_IP_VS_IPV6
 static struct dst_entry *
 __ip_vs_route_output_v6(struct net *net, struct in6_addr *daddr,
-			struct in6_addr *ret_saddr, int do_xfrm)
+			struct in6_addr *ret_saddr, int do_xfrm, int rt_mode)
 {
 	struct dst_entry *dst;
 	struct flowi6 fl6 = {
 		.daddr = *daddr,
 	};
 
+	if (rt_mode & IP_VS_RT_MODE_KNOWN_NH)
+		fl6.flowi6_flags = FLOWI_FLAG_KNOWN_NH;
+
 	dst = ip6_route_output(net, NULL, &fl6);
 	if (dst->error)
 		goto out_err;
@@ -427,7 +430,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 			}
 			dst = __ip_vs_route_output_v6(net, &dest->addr.in6,
 						      &dest_dst->dst_saddr.in6,
-						      do_xfrm);
+						      do_xfrm, rt_mode);
 			if (!dst) {
 				__ip_vs_dst_set(dest, NULL, NULL, 0);
 				spin_unlock_bh(&dest->dst_lock);
@@ -446,7 +449,8 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 			*ret_saddr = dest_dst->dst_saddr.in6;
 	} else {
 		noref = 0;
-		dst = __ip_vs_route_output_v6(net, daddr, ret_saddr, do_xfrm);
+		dst = __ip_vs_route_output_v6(net, daddr, ret_saddr, do_xfrm,
+					      rt_mode);
 		if (!dst)
 			goto err_unreach;
 		rt = (struct rt6_info *) dst;
@@ -1164,7 +1168,8 @@ ip_vs_dr_xmit_v6(struct sk_buff *skb, struct ip_vs_conn *cp,
 	local = __ip_vs_get_out_rt_v6(cp->af, skb, cp->dest, &cp->daddr.in6,
 				      NULL, ipvsh, 0,
 				      IP_VS_RT_MODE_LOCAL |
-				      IP_VS_RT_MODE_NON_LOCAL);
+				      IP_VS_RT_MODE_NON_LOCAL |
+				      IP_VS_RT_MODE_KNOWN_NH);
 	if (local < 0)
 		goto tx_error;
 	if (local) {
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 292934d..a747eb4 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -152,6 +152,7 @@ tee_tg_route6(struct sk_buff *skb, const struct xt_tee_tginfo *info)
 	fl6.daddr = info->gw.in6;
 	fl6.flowlabel = ((iph->flow_lbl[0] & 0xF) << 16) |
 			   (iph->flow_lbl[1] << 8) | iph->flow_lbl[2];
+	fl6.flowi6_flags = FLOWI_FLAG_KNOWN_NH;
 	dst = ip6_route_output(net, NULL, &fl6);
 	if (dst->error) {
 		dst_release(dst);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 08/11] ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (6 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 07/11] ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 09/11] ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister Martin KaFai Lau
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch always creates RTF_CACHE clone with DST_NOCACHE
when FLOWI_FLAG_KNOWN_NH is set so that the rt6i_dst is set to
the fl6->daddr.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
---
 include/net/ip6_fib.h |  3 +++
 net/ipv6/route.c      | 59 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index a4bece6..5556111 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -161,6 +161,9 @@ static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
 
 static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
+	if (unlikely(rt->dst.flags & DST_NOCACHE))
+		rt = (struct rt6_info *)(rt->dst.from);
+
 	return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
 }
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 77a2c1a..6880378 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -901,13 +901,34 @@ redo_rt6_select:
 		}
 	}
 
-	dst_hold(&rt->dst);
+	dst_use(&rt->dst, jiffies);
 	read_unlock_bh(&table->tb6_lock);
 
-	rt6_dst_from_metrics_check(rt);
-	rt->dst.lastuse = jiffies;
-	rt->dst.__use++;
+	if (rt == net->ipv6.ip6_null_entry || (rt->rt6i_flags & RTF_CACHE)) {
+		goto done;
+	} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
+			    !(rt->rt6i_flags & RTF_GATEWAY))) {
+		/* Create a RTF_CACHE clone which will not be
+		 * owned by the fib6 tree.  It is for the special case where
+		 * the daddr in the skb during the neighbor look-up is different
+		 * from the fl6->daddr used to look-up route here.
+		 */
+
+		struct rt6_info *uncached_rt;
+
+		uncached_rt = ip6_rt_cache_alloc(rt, &fl6->daddr, NULL);
+		dst_release(&rt->dst);
+
+		if (uncached_rt)
+			uncached_rt->dst.flags |= DST_NOCACHE;
+		else
+			uncached_rt = net->ipv6.ip6_null_entry;
+		dst_hold(&uncached_rt->dst);
+		return uncached_rt;
+	}
 
+done:
+	rt6_dst_from_metrics_check(rt);
 	return rt;
 }
 
@@ -1019,6 +1040,26 @@ static void rt6_dst_from_metrics_check(struct rt6_info *rt)
 		dst_init_metrics(&rt->dst, dst_metrics_ptr(rt->dst.from), true);
 }
 
+static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
+{
+	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
+		return NULL;
+
+	if (rt6_check_expired(rt))
+		return NULL;
+
+	return &rt->dst;
+}
+
+static struct dst_entry *rt6_dst_from_check(struct rt6_info *rt, u32 cookie)
+{
+	if (rt->dst.obsolete == DST_OBSOLETE_FORCE_CHK &&
+	    rt6_check((struct rt6_info *)(rt->dst.from), cookie))
+		return &rt->dst;
+	else
+		return NULL;
+}
+
 static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 {
 	struct rt6_info *rt;
@@ -1029,15 +1070,13 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 	 * DST_OBSOLETE_FORCE_CHK which forces validation calls down
 	 * into this function always.
 	 */
-	if (!rt->rt6i_node || (rt->rt6i_node->fn_sernum != cookie))
-		return NULL;
-
-	if (rt6_check_expired(rt))
-		return NULL;
 
 	rt6_dst_from_metrics_check(rt);
 
-	return dst;
+	if (unlikely(dst->flags & DST_NOCACHE))
+		return rt6_dst_from_check(rt, cookie);
+	else
+		return rt6_check(rt, cookie);
 }
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 09/11] ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (7 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 08/11] ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 10/11] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch keeps track of the DST_NOCACHE routes in a list and replaces its
dev with loopback during the iface down/unregister event.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip6_fib.h |  3 ++
 net/ipv6/route.c      | 78 +++++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 5556111..cc8f03c 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -120,6 +120,9 @@ struct rt6_info {
 	struct rt6key			rt6i_src;
 	struct rt6key			rt6i_prefsrc;
 
+	struct list_head		rt6i_uncached;
+	struct uncached_list		*rt6i_uncached_list;
+
 	struct inet6_dev		*rt6i_idev;
 
 	u32				rt6i_metric;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 6880378..3efc147 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -105,6 +105,67 @@ static struct rt6_info *rt6_get_route_info(struct net *net,
 					   const struct in6_addr *gwaddr, int ifindex);
 #endif
 
+struct uncached_list {
+	spinlock_t		lock;
+	struct list_head	head;
+};
+
+static DEFINE_PER_CPU_ALIGNED(struct uncached_list, rt6_uncached_list);
+
+static void rt6_uncached_list_add(struct rt6_info *rt)
+{
+	struct uncached_list *ul = raw_cpu_ptr(&rt6_uncached_list);
+
+	rt->dst.flags |= DST_NOCACHE;
+	rt->rt6i_uncached_list = ul;
+
+	spin_lock_bh(&ul->lock);
+	list_add_tail(&rt->rt6i_uncached, &ul->head);
+	spin_unlock_bh(&ul->lock);
+}
+
+static void rt6_uncached_list_del(struct rt6_info *rt)
+{
+	if (!list_empty(&rt->rt6i_uncached)) {
+		struct uncached_list *ul = rt->rt6i_uncached_list;
+
+		spin_lock_bh(&ul->lock);
+		list_del(&rt->rt6i_uncached);
+		spin_unlock_bh(&ul->lock);
+	}
+}
+
+static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev)
+{
+	struct net_device *loopback_dev = net->loopback_dev;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
+		struct rt6_info *rt;
+
+		spin_lock_bh(&ul->lock);
+		list_for_each_entry(rt, &ul->head, rt6i_uncached) {
+			struct inet6_dev *rt_idev = rt->rt6i_idev;
+			struct net_device *rt_dev = rt->dst.dev;
+
+			if (rt_idev && (rt_idev->dev == dev || !dev) &&
+			    rt_idev->dev != loopback_dev) {
+				rt->rt6i_idev = in6_dev_get(loopback_dev);
+				in6_dev_put(rt_idev);
+			}
+
+			if (rt_dev && (rt_dev == dev || !dev) &&
+			    rt_dev != loopback_dev) {
+				rt->dst.dev = loopback_dev;
+				dev_hold(rt->dst.dev);
+				dev_put(rt_dev);
+			}
+		}
+		spin_unlock_bh(&ul->lock);
+	}
+}
+
 static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
@@ -262,6 +323,7 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 
 		memset(dst + 1, 0, sizeof(*rt) - sizeof(*dst));
 		INIT_LIST_HEAD(&rt->rt6i_siblings);
+		INIT_LIST_HEAD(&rt->rt6i_uncached);
 	}
 	return rt;
 }
@@ -269,11 +331,14 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
-	struct inet6_dev *idev = rt->rt6i_idev;
 	struct dst_entry *from = dst->from;
+	struct inet6_dev *idev;
 
 	dst_destroy_metrics_generic(dst);
 
+	rt6_uncached_list_del(rt);
+
+	idev = rt->rt6i_idev;
 	if (idev) {
 		rt->rt6i_idev = NULL;
 		in6_dev_put(idev);
@@ -920,7 +985,7 @@ redo_rt6_select:
 		dst_release(&rt->dst);
 
 		if (uncached_rt)
-			uncached_rt->dst.flags |= DST_NOCACHE;
+			rt6_uncached_list_add(uncached_rt);
 		else
 			uncached_rt = net->ipv6.ip6_null_entry;
 		dst_hold(&uncached_rt->dst);
@@ -2367,6 +2432,7 @@ void rt6_ifdown(struct net *net, struct net_device *dev)
 
 	fib6_clean_all(net, fib6_ifdown, &adn);
 	icmp6_clean_all(fib6_ifdown, &adn);
+	rt6_uncached_list_flush_dev(net, dev);
 }
 
 struct rt6_mtu_change_arg {
@@ -3259,6 +3325,7 @@ static struct notifier_block ip6_route_dev_notifier = {
 int __init ip6_route_init(void)
 {
 	int ret;
+	int cpu;
 
 	ret = -ENOMEM;
 	ip6_dst_ops_template.kmem_cachep =
@@ -3318,6 +3385,13 @@ int __init ip6_route_init(void)
 	if (ret)
 		goto out_register_late_subsys;
 
+	for_each_possible_cpu(cpu) {
+		struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
+
+		INIT_LIST_HEAD(&ul->head);
+		spin_lock_init(&ul->lock);
+	}
+
 out:
 	return ret;
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 10/11] ipv6: Break up ip6_rt_copy()
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (8 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 09/11] ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-23  3:56 ` [PATCH net-next v5 11/11] ipv6: Create percpu rt6_info Martin KaFai Lau
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

This patch breaks up ip6_rt_copy() into ip6_rt_copy_init() and
ip6_rt_cache_alloc().

In the later patch, we need to create a percpu rt6_info copy. Hence,
refactor the common rt6_info init codes to ip6_rt_copy_init().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 net/ipv6/route.c | 90 ++++++++++++++++++++++++++------------------------------
 1 file changed, 41 insertions(+), 49 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3efc147..3e33ddb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -72,8 +72,7 @@ enum rt6_nud_state {
 	RT6_NUD_SUCCEED = 1
 };
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest);
+static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort);
 static struct dst_entry	*ip6_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int	 ip6_default_advmss(const struct dst_entry *dst);
 static unsigned int	 ip6_mtu(const struct dst_entry *dst);
@@ -913,22 +912,32 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 	 *	Clone the route.
 	 */
 
-	rt = ip6_rt_copy(ort, daddr);
+	if (ort->rt6i_flags & RTF_CACHE)
+		ort = (struct rt6_info *)ort->dst.from;
 
-	if (rt) {
-		rt->rt6i_flags |= RTF_CACHE;
+	rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+			   0, ort->rt6i_table);
+
+	if (!rt)
+		return NULL;
+
+	ip6_rt_copy_init(rt, ort);
+	rt->rt6i_flags |= RTF_CACHE;
+	rt->rt6i_metric = 0;
+	rt->dst.flags |= DST_HOST;
+	rt->rt6i_dst.addr = *daddr;
+	rt->rt6i_dst.plen = 128;
 
-		if (!rt6_is_gw_or_nonexthop(ort)) {
-			if (ort->rt6i_dst.plen != 128 &&
-			    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
-				rt->rt6i_flags |= RTF_ANYCAST;
+	if (!rt6_is_gw_or_nonexthop(ort)) {
+		if (ort->rt6i_dst.plen != 128 &&
+		    ipv6_addr_equal(&ort->rt6i_dst.addr, daddr))
+			rt->rt6i_flags |= RTF_ANYCAST;
 #ifdef CONFIG_IPV6_SUBTREES
-			if (rt->rt6i_src.plen && saddr) {
-				rt->rt6i_src.addr = *saddr;
-				rt->rt6i_src.plen = 128;
-			}
-#endif
+		if (rt->rt6i_src.plen && saddr) {
+			rt->rt6i_src.addr = *saddr;
+			rt->rt6i_src.plen = 128;
 		}
+#endif
 	}
 
 	return rt;
@@ -1980,7 +1989,7 @@ static void rt6_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_bu
 				     NEIGH_UPDATE_F_ISROUTER))
 		     );
 
-	nrt = ip6_rt_copy(rt, &msg->dest);
+	nrt = ip6_rt_cache_alloc(rt, &msg->dest, NULL);
 	if (!nrt)
 		goto out;
 
@@ -2022,42 +2031,25 @@ static void rt6_set_from(struct rt6_info *rt, struct rt6_info *from)
 	dst_init_metrics(&rt->dst, dst_metrics_ptr(&from->dst), true);
 }
 
-static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
-				    const struct in6_addr *dest)
-{
-	struct net *net = dev_net(ort->dst.dev);
-	struct rt6_info *rt;
-
-	if (ort->rt6i_flags & RTF_CACHE)
-		ort = (struct rt6_info *)ort->dst.from;
-
-	rt = ip6_dst_alloc(net, ort->dst.dev, 0,
-			   ort->rt6i_table);
-
-	if (rt) {
-		rt->dst.input = ort->dst.input;
-		rt->dst.output = ort->dst.output;
-		rt->dst.flags |= DST_HOST;
-
-		rt->rt6i_dst.addr = *dest;
-		rt->rt6i_dst.plen = 128;
-		rt->dst.error = ort->dst.error;
-		rt->rt6i_idev = ort->rt6i_idev;
-		if (rt->rt6i_idev)
-			in6_dev_hold(rt->rt6i_idev);
-		rt->dst.lastuse = jiffies;
-		rt->rt6i_gateway = ort->rt6i_gateway;
-		rt->rt6i_flags = ort->rt6i_flags;
-		rt6_set_from(rt, ort);
-		rt->rt6i_metric = 0;
-
+static void ip6_rt_copy_init(struct rt6_info *rt, struct rt6_info *ort)
+{
+	rt->dst.input = ort->dst.input;
+	rt->dst.output = ort->dst.output;
+	rt->rt6i_dst = ort->rt6i_dst;
+	rt->dst.error = ort->dst.error;
+	rt->rt6i_idev = ort->rt6i_idev;
+	if (rt->rt6i_idev)
+		in6_dev_hold(rt->rt6i_idev);
+	rt->dst.lastuse = jiffies;
+	rt->rt6i_gateway = ort->rt6i_gateway;
+	rt->rt6i_flags = ort->rt6i_flags;
+	rt6_set_from(rt, ort);
+	rt->rt6i_metric = ort->rt6i_metric;
 #ifdef CONFIG_IPV6_SUBTREES
-		memcpy(&rt->rt6i_src, &ort->rt6i_src, sizeof(struct rt6key));
+	rt->rt6i_src = ort->rt6i_src;
 #endif
-		memcpy(&rt->rt6i_prefsrc, &ort->rt6i_prefsrc, sizeof(struct rt6key));
-		rt->rt6i_table = ort->rt6i_table;
-	}
-	return rt;
+	rt->rt6i_prefsrc = ort->rt6i_prefsrc;
+	rt->rt6i_table = ort->rt6i_table;
 }
 
 #ifdef CONFIG_IPV6_ROUTE_INFO
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH net-next v5 11/11] ipv6: Create percpu rt6_info
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (9 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 10/11] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
@ 2015-05-23  3:56 ` Martin KaFai Lau
  2015-05-25 17:34 ` [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
  2015-07-29  9:25 ` Alexander Holler
  12 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-23  3:56 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

After the patch
'ipv6: Only create RTF_CACHE routes after encountering pmtu exception',
we need to compensate the performance hit (bouncing dst->__refcnt).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Julian Anastasov <ja@ssi.bg>
---
 include/net/ip6_fib.h           |   3 +-
 include/uapi/linux/ipv6_route.h |   1 +
 net/ipv6/ip6_fib.c              |  24 +++++++-
 net/ipv6/route.c                | 132 +++++++++++++++++++++++++++++++++++-----
 4 files changed, 142 insertions(+), 18 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index cc8f03c..3b76849 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -124,6 +124,7 @@ struct rt6_info {
 	struct uncached_list		*rt6i_uncached_list;
 
 	struct inet6_dev		*rt6i_idev;
+	struct rt6_info * __percpu	*rt6i_pcpu;
 
 	u32				rt6i_metric;
 	u32				rt6i_pmtu;
@@ -164,7 +165,7 @@ static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
 
 static inline u32 rt6_get_cookie(const struct rt6_info *rt)
 {
-	if (unlikely(rt->dst.flags & DST_NOCACHE))
+	if (rt->rt6i_flags & RTF_PCPU || unlikely(rt->dst.flags & DST_NOCACHE))
 		rt = (struct rt6_info *)(rt->dst.from);
 
 	return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
diff --git a/include/uapi/linux/ipv6_route.h b/include/uapi/linux/ipv6_route.h
index 2be7bd1..f6598d1 100644
--- a/include/uapi/linux/ipv6_route.h
+++ b/include/uapi/linux/ipv6_route.h
@@ -34,6 +34,7 @@
 #define RTF_PREF(pref)	((pref) << 27)
 #define RTF_PREF_MASK	0x18000000
 
+#define RTF_PCPU	0x40000000
 #define RTF_LOCAL	0x80000000
 
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 7d66490..bab2490 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -154,10 +154,32 @@ static void node_free(struct fib6_node *fn)
 	kmem_cache_free(fib6_node_kmem, fn);
 }
 
+static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
+{
+	int cpu;
+
+	if (!non_pcpu_rt->rt6i_pcpu)
+		return;
+
+	for_each_possible_cpu(cpu) {
+		struct rt6_info **ppcpu_rt;
+		struct rt6_info *pcpu_rt;
+
+		ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
+		pcpu_rt = *ppcpu_rt;
+		if (pcpu_rt) {
+			dst_free(&pcpu_rt->dst);
+			*ppcpu_rt = NULL;
+		}
+	}
+}
+
 static void rt6_release(struct rt6_info *rt)
 {
-	if (atomic_dec_and_test(&rt->rt6i_ref))
+	if (atomic_dec_and_test(&rt->rt6i_ref)) {
+		rt6_free_pcpu(rt);
 		dst_free(&rt->dst);
+	}
 }
 
 static void fib6_link_table(struct net *net, struct fib6_table *tb)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3e33ddb..b3fbef8 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -165,11 +165,18 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev)
 	}
 }
 
+static u32 *rt6_pcpu_cow_metrics(struct rt6_info *rt)
+{
+	return dst_metrics_write_ptr(rt->dst.from);
+}
+
 static u32 *ipv6_cow_metrics(struct dst_entry *dst, unsigned long old)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
 
-	if (rt->rt6i_flags & RTF_CACHE)
+	if (rt->rt6i_flags & RTF_PCPU)
+		return rt6_pcpu_cow_metrics(rt);
+	else if (rt->rt6i_flags & RTF_CACHE)
 		return NULL;
 	else
 		return dst_cow_metrics_generic(dst, old);
@@ -309,10 +316,10 @@ static const struct rt6_info ip6_blk_hole_entry_template = {
 #endif
 
 /* allocate dst with ip6_dst_ops */
-static inline struct rt6_info *ip6_dst_alloc(struct net *net,
-					     struct net_device *dev,
-					     int flags,
-					     struct fib6_table *table)
+static struct rt6_info *__ip6_dst_alloc(struct net *net,
+					struct net_device *dev,
+					int flags,
+					struct fib6_table *table)
 {
 	struct rt6_info *rt = dst_alloc(&net->ipv6.ip6_dst_ops, dev,
 					0, DST_OBSOLETE_FORCE_CHK, flags);
@@ -327,6 +334,34 @@ static inline struct rt6_info *ip6_dst_alloc(struct net *net,
 	return rt;
 }
 
+static struct rt6_info *ip6_dst_alloc(struct net *net,
+				      struct net_device *dev,
+				      int flags,
+				      struct fib6_table *table)
+{
+	struct rt6_info *rt = __ip6_dst_alloc(net, dev, flags, table);
+
+	if (rt) {
+		rt->rt6i_pcpu = alloc_percpu_gfp(struct rt6_info *, GFP_ATOMIC);
+		if (rt->rt6i_pcpu) {
+			int cpu;
+
+			for_each_possible_cpu(cpu) {
+				struct rt6_info **p;
+
+				p = per_cpu_ptr(rt->rt6i_pcpu, cpu);
+				/* no one shares rt */
+				*p =  NULL;
+			}
+		} else {
+			dst_destroy((struct dst_entry *)rt);
+			return NULL;
+		}
+	}
+
+	return rt;
+}
+
 static void ip6_dst_destroy(struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *)dst;
@@ -335,6 +370,9 @@ static void ip6_dst_destroy(struct dst_entry *dst)
 
 	dst_destroy_metrics_generic(dst);
 
+	if (rt->rt6i_pcpu)
+		free_percpu(rt->rt6i_pcpu);
+
 	rt6_uncached_list_del(rt);
 
 	idev = rt->rt6i_idev;
@@ -912,11 +950,11 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 	 *	Clone the route.
 	 */
 
-	if (ort->rt6i_flags & RTF_CACHE)
+	if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU))
 		ort = (struct rt6_info *)ort->dst.from;
 
-	rt = ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
-			   0, ort->rt6i_table);
+	rt = __ip6_dst_alloc(dev_net(ort->dst.dev), ort->dst.dev,
+			     0, ort->rt6i_table);
 
 	if (!rt)
 		return NULL;
@@ -943,6 +981,54 @@ static struct rt6_info *ip6_rt_cache_alloc(struct rt6_info *ort,
 	return rt;
 }
 
+static struct rt6_info *ip6_rt_pcpu_alloc(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt;
+
+	pcpu_rt = __ip6_dst_alloc(dev_net(rt->dst.dev),
+				  rt->dst.dev, rt->dst.flags,
+				  rt->rt6i_table);
+
+	if (!pcpu_rt)
+		return NULL;
+	ip6_rt_copy_init(pcpu_rt, rt);
+	pcpu_rt->rt6i_protocol = rt->rt6i_protocol;
+	pcpu_rt->rt6i_flags |= RTF_PCPU;
+	return pcpu_rt;
+}
+
+/* It should be called with read_lock_bh(&tb6_lock) acquired */
+static struct rt6_info *rt6_get_pcpu_route(struct rt6_info *rt)
+{
+	struct rt6_info *pcpu_rt, *prev, **p;
+
+	p = this_cpu_ptr(rt->rt6i_pcpu);
+	pcpu_rt = *p;
+
+	if (pcpu_rt)
+		goto done;
+
+	pcpu_rt = ip6_rt_pcpu_alloc(rt);
+	if (!pcpu_rt) {
+		struct net *net = dev_net(rt->dst.dev);
+
+		pcpu_rt = net->ipv6.ip6_null_entry;
+		goto done;
+	}
+
+	prev = cmpxchg(p, NULL, pcpu_rt);
+	if (prev) {
+		/* If someone did it before us, return prev instead */
+		dst_destroy(&pcpu_rt->dst);
+		pcpu_rt = prev;
+	}
+
+done:
+	dst_hold(&pcpu_rt->dst);
+	rt6_dst_from_metrics_check(pcpu_rt);
+	return pcpu_rt;
+}
+
 static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, int oif,
 				      struct flowi6 *fl6, int flags)
 {
@@ -975,11 +1061,13 @@ redo_rt6_select:
 		}
 	}
 
-	dst_use(&rt->dst, jiffies);
-	read_unlock_bh(&table->tb6_lock);
 
 	if (rt == net->ipv6.ip6_null_entry || (rt->rt6i_flags & RTF_CACHE)) {
-		goto done;
+		dst_use(&rt->dst, jiffies);
+		read_unlock_bh(&table->tb6_lock);
+
+		rt6_dst_from_metrics_check(rt);
+		return rt;
 	} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
 			    !(rt->rt6i_flags & RTF_GATEWAY))) {
 		/* Create a RTF_CACHE clone which will not be
@@ -990,6 +1078,9 @@ redo_rt6_select:
 
 		struct rt6_info *uncached_rt;
 
+		dst_use(&rt->dst, jiffies);
+		read_unlock_bh(&table->tb6_lock);
+
 		uncached_rt = ip6_rt_cache_alloc(rt, &fl6->daddr, NULL);
 		dst_release(&rt->dst);
 
@@ -997,13 +1088,22 @@ redo_rt6_select:
 			rt6_uncached_list_add(uncached_rt);
 		else
 			uncached_rt = net->ipv6.ip6_null_entry;
+
 		dst_hold(&uncached_rt->dst);
 		return uncached_rt;
-	}
 
-done:
-	rt6_dst_from_metrics_check(rt);
-	return rt;
+	} else {
+		/* Get a percpu copy */
+
+		struct rt6_info *pcpu_rt;
+
+		rt->dst.lastuse = jiffies;
+		rt->dst.__use++;
+		pcpu_rt = rt6_get_pcpu_route(rt);
+		read_unlock_bh(&table->tb6_lock);
+
+		return pcpu_rt;
+	}
 }
 
 static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table *table,
@@ -1147,7 +1247,7 @@ static struct dst_entry *ip6_dst_check(struct dst_entry *dst, u32 cookie)
 
 	rt6_dst_from_metrics_check(rt);
 
-	if (unlikely(dst->flags & DST_NOCACHE))
+	if ((rt->rt6i_flags & RTF_PCPU) || unlikely(dst->flags & DST_NOCACHE))
 		return rt6_dst_from_check(rt, cookie);
 	else
 		return rt6_check(rt, cookie);
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (10 preceding siblings ...)
  2015-05-23  3:56 ` [PATCH net-next v5 11/11] ipv6: Create percpu rt6_info Martin KaFai Lau
@ 2015-05-25 17:34 ` David Miller
  2015-05-26 21:20   ` Hannes Frederic Sowa
  2015-07-29  9:25 ` Alexander Holler
  12 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2015-05-25 17:34 UTC (permalink / raw)
  To: kafai; +Cc: netdev, hannes, ja, steffen.klassert, Kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 22 May 2015 20:55:55 -0700

> This series is to avoid creating a RTF_CACHE route whenever we are consulting
> the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
> when we see a pmtu exception.

Looks great, nice work.

Series applied to net-next, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-05-25 17:34 ` [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
@ 2015-05-26 21:20   ` Hannes Frederic Sowa
  2015-05-26 21:34     ` Martin KaFai Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-26 21:20 UTC (permalink / raw)
  To: David Miller, kafai
  Cc: netdev, Julian Anastasov, Steffen Klassert, Kernel-team

On Mon, May 25, 2015, at 19:34, David Miller wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> Date: Fri, 22 May 2015 20:55:55 -0700
> 
> > This series is to avoid creating a RTF_CACHE route whenever we are consulting
> > the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
> > when we see a pmtu exception.
> 
> Looks great, nice work.
> 
> Series applied to net-next, thanks!

I also went over the changes to the last version and such, albeit a bit
late:
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-05-26 21:20   ` Hannes Frederic Sowa
@ 2015-05-26 21:34     ` Martin KaFai Lau
  0 siblings, 0 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-05-26 21:34 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, netdev, Julian Anastasov, Steffen Klassert, Kernel-team

On Tue, May 26, 2015 at 11:20:53PM +0200, Hannes Frederic Sowa wrote:
> I also went over the changes to the last version and such, albeit a bit
> late:
> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Thanks for your help and review, Hannes!

--Martin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
                   ` (11 preceding siblings ...)
  2015-05-25 17:34 ` [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
@ 2015-07-29  9:25 ` Alexander Holler
  2015-07-30 11:57   ` Alexander Holler
  12 siblings, 1 reply; 23+ messages in thread
From: Alexander Holler @ 2015-07-29  9:25 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

Am 23.05.2015 um 05:55 schrieb Martin KaFai Lau:

> This series is to avoid creating a RTF_CACHE route whenever we are consulting
> the fib6 tree with a new destination.  Instead, only create RTF_CACHE route
> when we see a pmtu exception.

That even helps on systems without an IPv6-connection to world because 
it avoids the IPv6 route add/delete pairs which happened before whenever 
an IPv6-connection was tried (e.g. by Happy Eyeballs algorithms).

I think that's worse a laud. thanks.

Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-07-29  9:25 ` Alexander Holler
@ 2015-07-30 11:57   ` Alexander Holler
  2015-08-15  7:48     ` Alexander Holler
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Holler @ 2015-07-30 11:57 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team

Am 29.07.2015 um 11:25 schrieb Alexander Holler:
> Am 23.05.2015 um 05:55 schrieb Martin KaFai Lau:
>
>> This series is to avoid creating a RTF_CACHE route whenever we are
>> consulting
>> the fib6 tree with a new destination.  Instead, only create RTF_CACHE
>> route
>> when we see a pmtu exception.
>
> That even helps on systems without an IPv6-connection to world because
> it avoids the IPv6 route add/delete pairs which happened before whenever
> an IPv6-connection was tried (e.g. by Happy Eyeballs algorithms).
>
> I think that's worse a laud. thanks.

Of course, I meant worth. Sorry, but the left part of my brain seems to 
be sometimes in a (maybe forced) power save mode. ;)

Also I wonder how the previous algorithm went into the kernel at all or 
why it wasn't fixed earlier. Anyway, it's great that someone took the 
time to fix that annoying behaviour (I've had on my radar since quiet 
some time).

Thanks again

> Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-07-30 11:57   ` Alexander Holler
@ 2015-08-15  7:48     ` Alexander Holler
  2015-08-17  9:43       ` Alexander Holler
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Holler @ 2015-08-15  7:48 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team, stable, linux-kernel

Am 30.07.2015 um 13:57 schrieb Alexander Holler:
> Am 29.07.2015 um 11:25 schrieb Alexander Holler:
>> Am 23.05.2015 um 05:55 schrieb Martin KaFai Lau:
>>
>>> This series is to avoid creating a RTF_CACHE route whenever we are
>>> consulting
>>> the fib6 tree with a new destination.  Instead, only create RTF_CACHE
>>> route
>>> when we see a pmtu exception.
>>
>> That even helps on systems without an IPv6-connection to world because
>> it avoids the IPv6 route add/delete pairs which happened before whenever
>> an IPv6-connection was tried (e.g. by Happy Eyeballs algorithms).
>>
>> I think that's worse a laud. thanks.
>
> Of course, I meant worth. Sorry, but the left part of my brain seems to
> be sometimes in a (maybe forced) power save mode. ;)
>
> Also I wonder how the previous algorithm went into the kernel at all or
> why it wasn't fixed earlier. Anyway, it's great that someone took the
> time to fix that annoying behaviour (I've had on my radar since quiet
> some time).

To complete the discussion, that "annoying behaviour" is also a big 
information leak.

Because routes aren't considered confidential and aren't subject to 
privacy, that broken behaviour enabled *everyone* on the same system to 
see *all* the remote IPv6 systems to which there have been connection 
establishment tries.

E.g. I can see the following on a system when browsing to facebook.com 
and google.com:

--------
[aholler@krabat snetmanmon.git]$ ./snetmanmon snetmanmon.conf.simple_example

snetmanmon V1.3-5-g9f06

(C) 2015 Alexander Holler

(...)
New route 2a00:1450:4001:80c::100a (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a03:2880:2130:cf05:face:b00c:0:1 (gateway 
fe80::21f:7bff:feb4:d13, type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1007 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:400f:803::101f (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1008 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1017 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4016:804::200d (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1000 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1016 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:400f:803::1013 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1006 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1018 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4016:804::2009 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
New route 2a00:1450:4001:80c::1005 (gateway fe80::21f:7bff:feb4:d13, 
type v6, scope universe) on interface 'virbr0'
Route 2a00:1450:4001:80c::100a (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a03:2880:2130:cf05:face:b00c:0:1 (gateway 
fe80::21f:7bff:feb4:d13, type v6, scope universe) on interface 'virbr0' 
was deleted
Route 2a00:1450:4001:80c::1000 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1005 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1006 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1007 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1008 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1016 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1017 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4001:80c::1018 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:400f:803::1013 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:400f:803::101f (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4016:804::2009 (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
Route 2a00:1450:4016:804::200d (gateway fe80::21f:7bff:feb4:d13, type 
v6, scope universe) on interface 'virbr0' was deleted
--------
(those deletes happen because I've no IPv6 connection to the outside 
world on that system)

Also this doesn't give me the used URLs (or the user). it gives me quiet 
some good idea about what happens on a system.

Therefor I think it's worse to think about backporting this patch series 
at least to the current long term stable kernel (4.1) too.

Regards,

Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-08-15  7:48     ` Alexander Holler
@ 2015-08-17  9:43       ` Alexander Holler
  2015-08-28  7:36         ` Martin KaFai Lau
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Holler @ 2015-08-17  9:43 UTC (permalink / raw)
  To: Martin KaFai Lau, netdev
  Cc: David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team, stable, linux-kernel

Am 15.08.2015 um 09:48 schrieb Alexander Holler:
> Am 30.07.2015 um 13:57 schrieb Alexander Holler:
>> Am 29.07.2015 um 11:25 schrieb Alexander Holler:
>>> Am 23.05.2015 um 05:55 schrieb Martin KaFai Lau:

> To complete the discussion, that "annoying behaviour" is also a big
> information leak.
>
> Because routes aren't considered confidential and aren't subject to
> privacy, that broken behaviour enabled *everyone* on the same system to
> see *all* the remote IPv6 systems to which there have been connection
> establishment tries.

Just in case I haven't described the problem I see clearly enough:

"Everyone" means everything (other SW) too, and if "Happy_Eyeballs" 
algorithms are used (see RFC 6555), this also affects systems which only 
have an IPv4 connection to the world, as long as IPv6 is enabled.

That means it does not only affect multiuser systems and the current 
behaviour of kernels < 4.2 renders e.g. the private mode of most 
browsers somewhat useless too (in regard to protection against other SW 
and/or users running on the same system).

That's why I vote to check out if it's possible/reasonable to backport 
this series to the stable kernels.

Regards,

Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-08-17  9:43       ` Alexander Holler
@ 2015-08-28  7:36         ` Martin KaFai Lau
  2015-08-28  9:27           ` Alexander Holler
  2015-08-28 18:27           ` David Miller
  0 siblings, 2 replies; 23+ messages in thread
From: Martin KaFai Lau @ 2015-08-28  7:36 UTC (permalink / raw)
  To: Alexander Holler
  Cc: netdev, David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team, stable, linux-kernel

On Mon, Aug 17, 2015 at 11:43:20AM +0200, Alexander Holler wrote:
> That's why I vote to check out if it's possible/reasonable to backport this
> series to the stable kernels.
I have backported to 4.0.y without major issue, so possible.

I did try on 3.1x and gave up.

It is a lot of changes,  so I don't think it is a good idea for -stable.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-08-28  7:36         ` Martin KaFai Lau
@ 2015-08-28  9:27           ` Alexander Holler
  2015-08-28  9:34             ` Alexander Holler
  2015-08-28 18:27           ` David Miller
  1 sibling, 1 reply; 23+ messages in thread
From: Alexander Holler @ 2015-08-28  9:27 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team, stable, linux-kernel

Am 28.08.2015 um 09:36 schrieb Martin KaFai Lau:
> On Mon, Aug 17, 2015 at 11:43:20AM +0200, Alexander Holler wrote:
>> That's why I vote to check out if it's possible/reasonable to backport this
>> series to the stable kernels.
> I have backported to 4.0.y without major issue, so possible.

Sure, as this was likely one of the versions they've used to create the 
patch.

> I did try on 3.1x and gave up.
>
> It is a lot of changes,  so I don't think it is a good idea for -stable.

Depends on what you're expecting from a (stable) kernel.

The patch description mentions what happens when a system deals with a 
lot of other ipv6-systems and that problem is easy to exercise and to value.

Rating the information leak is harder, some people even won't understand 
that this might be a problem.

And now look at which kernel-versions are now used in new devices 
(likely something <= 3.10, which is more than two years old), how long 
they will be used, and make a guess about IPv6 usage in 5 years.

Anyway, I've no insights about all the politics happening in the 
background (e.g. stuff like the LTSI tree) and I've just wanted raise 
awareness about that (imho important) patch series.

Regards,

Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-08-28  9:27           ` Alexander Holler
@ 2015-08-28  9:34             ` Alexander Holler
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Holler @ 2015-08-28  9:34 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, David Miller, Hannes Frederic Sowa, Julian Anastasov,
	Steffen Klassert, Kernel Team, stable, linux-kernel

Am 28.08.2015 um 11:27 schrieb Alexander Holler:
> Am 28.08.2015 um 09:36 schrieb Martin KaFai Lau:
>> On Mon, Aug 17, 2015 at 11:43:20AM +0200, Alexander Holler wrote:
>>> That's why I vote to check out if it's possible/reasonable to
>>> backport this
>>> series to the stable kernels.
>> I have backported to 4.0.y without major issue, so possible.
>
> Sure, as this was likely one of the versions they've used to create the
> patch.
>
>> I did try on 3.1x and gave up.
>>
>> It is a lot of changes,  so I don't think it is a good idea for -stable.
>
> Depends on what you're expecting from a (stable) kernel.
>
> The patch description mentions what happens when a system deals with a
> lot of other ipv6-systems and that problem is easy to exercise and to
> value.
>
> Rating the information leak is harder, some people even won't understand
> that this might be a problem.
>
> And now look at which kernel-versions are now used in new devices
> (likely something <= 3.10, which is more than two years old), how long
> they will be used, and make a guess about IPv6 usage in 5 years.
>
> Anyway, I've no insights about all the politics happening in the
> background (e.g. stuff like the LTSI tree) and I've just wanted raise
> awareness about that (imho important) patch series.

Not to speak about phones, but those are most likely a problem of one 
specific company  ;)

Regards,

Alexander Holler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception
  2015-08-28  7:36         ` Martin KaFai Lau
  2015-08-28  9:27           ` Alexander Holler
@ 2015-08-28 18:27           ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2015-08-28 18:27 UTC (permalink / raw)
  To: kafai
  Cc: holler, netdev, hannes, ja, steffen.klassert, Kernel-team,
	stable, linux-kernel

From: Martin KaFai Lau <kafai@fb.com>
Date: Fri, 28 Aug 2015 00:36:38 -0700

> On Mon, Aug 17, 2015 at 11:43:20AM +0200, Alexander Holler wrote:
>> That's why I vote to check out if it's possible/reasonable to backport this
>> series to the stable kernels.
> I have backported to 4.0.y without major issue, so possible.
> 
> I did try on 3.1x and gave up.
> 
> It is a lot of changes,  so I don't think it is a good idea for -stable.

I am absolutely, firmly, against any of this work going into -stable.

It is completely inappropriate, the potential for regressions is
enormous.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-08-28 18:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-23  3:55 [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception Martin KaFai Lau
2015-05-23  3:55 ` [PATCH net-next v5 01/11] ipv6: Clean up ipv6_select_ident() and ip6_fragment() Martin KaFai Lau
2015-05-23  3:55 ` [PATCH net-next v5 02/11] ipv6: Remove external dependency on rt6i_dst and rt6i_src Martin KaFai Lau
2015-05-23  3:55 ` [PATCH net-next v5 03/11] ipv6: Remove external dependency on rt6i_gateway and RTF_ANYCAST Martin KaFai Lau
2015-05-23  3:55 ` [PATCH net-next v5 04/11] ipv6: Combine rt6_alloc_cow and rt6_alloc_clone Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 05/11] ipv6: Only create RTF_CACHE routes after encountering pmtu exception Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 06/11] ipv6: Add rt6_get_cookie() function Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 07/11] ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 08/11] ipv6: Create RTF_CACHE clone when FLOWI_FLAG_KNOWN_NH is set Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 09/11] ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 10/11] ipv6: Break up ip6_rt_copy() Martin KaFai Lau
2015-05-23  3:56 ` [PATCH net-next v5 11/11] ipv6: Create percpu rt6_info Martin KaFai Lau
2015-05-25 17:34 ` [PATCH net-next v5 00/11] ipv6: Only create RTF_CACHE route after encountering pmtu exception David Miller
2015-05-26 21:20   ` Hannes Frederic Sowa
2015-05-26 21:34     ` Martin KaFai Lau
2015-07-29  9:25 ` Alexander Holler
2015-07-30 11:57   ` Alexander Holler
2015-08-15  7:48     ` Alexander Holler
2015-08-17  9:43       ` Alexander Holler
2015-08-28  7:36         ` Martin KaFai Lau
2015-08-28  9:27           ` Alexander Holler
2015-08-28  9:34             ` Alexander Holler
2015-08-28 18:27           ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.