netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
@ 2015-09-04 23:12 Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

v2:
- Add patch 4 and 5 to remove the spinlock

v1:
This patch series is to fix the dst refcnt bugs in ip6_tunnel.

Patch 1 and 2 are the prep works.  Patch 3 is the fix.

I can reproduce the bug by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.  I get the following
trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():

[  312.760432] ------------[ cut here ]------------
[  312.774664] WARNING: CPU: 2 PID: 10263 at net/core/dst.c:288 dst_release+0xf3/0x100()
[  312.776041] Modules linked in: k10temp coretemp hwmon ip6_gre ip6_tunnel tunnel6 ipmi_devintf ipmi_msgh\
andler ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_filter \
ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd grace mptctl netconsole autofs4 rpcsec_gss_krb5 auth_rp\
cgss oid_registry sunrpc ipv6 dm_mod loop iTCO_wdt iTCO_vendor_support serio_raw rtc_cmos pcspkr i2c_i801 \
i2c_core lpc_ich mfd_core ehci_pci ehci_hcd e1000e mlx4_en ptp pps_core vxlan udp_tunnel ip6_udp_tunnel ml\
x4_core sg button ext3 jbd mpt2sas raid_class
[  312.785302] CPU: 2 PID: 10263 Comm: netperf Not tainted 4.2.0-rc8-00046-g4db9b63-dirty #15
[  312.791695] Hardware name: Quanta Freedom /Windmill-EP, BIOS F03_3B04 09/12/2013
[  312.792965]  ffffffff819dca2c ffff8811dfbdf6f8 ffffffff816537de ffff88123788fdb8
[  312.794263]  0000000000000000 ffff8811dfbdf738 ffffffff81052646 ffff8811dfbdf768
[  312.795593]  ffff881203a98180 00000000ffffffff ffff88242927a000 ffff88120a2532e0
[  312.796946] Call Trace:
[  312.797380]  [<ffffffff816537de>] dump_stack+0x45/0x57
[  312.798288]  [<ffffffff81052646>] warn_slowpath_common+0x86/0xc0
[  312.799699]  [<ffffffff8105273a>] warn_slowpath_null+0x1a/0x20
[  312.800852]  [<ffffffff8159f9b3>] dst_release+0xf3/0x100
[  312.801834]  [<ffffffffa03f1308>] ip6_tnl_dst_store+0x48/0x70 [ip6_tunnel]
[  312.803738]  [<ffffffffa03fd0b6>] ip6gre_xmit2+0x536/0x720 [ip6_gre]
[  312.804774]  [<ffffffffa03fd40a>] ip6gre_tunnel_xmit+0x16a/0x410 [ip6_gre]
[  312.805986]  [<ffffffff8159934b>] dev_hard_start_xmit+0x23b/0x390
[  312.808810]  [<ffffffff815a2f5f>] ? neigh_destroy+0xef/0x140
[  312.809843]  [<ffffffff81599a6c>] __dev_queue_xmit+0x48c/0x4f0
[  312.813931]  [<ffffffff81599ae3>] dev_queue_xmit_sk+0x13/0x20
[  312.814993]  [<ffffffff815a0832>] neigh_direct_output+0x12/0x20
[  312.817448]  [<ffffffffa021d633>] ip6_finish_output2+0x183/0x460 [ipv6]
[  312.818762]  [<ffffffff81306fc5>] ? find_next_bit+0x15/0x20
[  312.819671]  [<ffffffffa021fd79>] ip6_finish_output+0x89/0xe0 [ipv6]
[  312.820720]  [<ffffffffa021fe14>] ip6_output+0x44/0xe0 [ipv6]
[  312.821762]  [<ffffffff815c8809>] ? nf_hook_slow+0x69/0xc0
[  312.823123]  [<ffffffffa021d232>] ip6_xmit+0x242/0x4c0 [ipv6]
[  312.824073]  [<ffffffffa021c9f0>] ? ac6_proc_exit+0x20/0x20 [ipv6]
[  312.825116]  [<ffffffffa024c751>] inet6_csk_xmit+0x61/0xa0 [ipv6]
[  312.826127]  [<ffffffff815eb590>] tcp_transmit_skb+0x4f0/0x9b0
[  312.827441]  [<ffffffff815ed267>] tcp_connect+0x637/0x7a0
[  312.828327]  [<ffffffffa0245906>] tcp_v6_connect+0x2d6/0x550 [ipv6]
[  312.829581]  [<ffffffff81606f05>] __inet_stream_connect+0x95/0x2f0
[  312.830600]  [<ffffffff810ae13a>] ? hrtimer_try_to_cancel+0x1a/0xf0
[  312.833456]  [<ffffffff812fba19>] ? timerqueue_add+0x59/0xb0
[  312.834407]  [<ffffffff81607198>] inet_stream_connect+0x38/0x50
[  312.835886]  [<ffffffff8157cb17>] SYSC_connect+0xb7/0xf0
[  312.840035]  [<ffffffff810af6d3>] ? do_setitimer+0x1b3/0x200
[  312.840983]  [<ffffffff810af75a>] ? alarm_setitimer+0x3a/0x70
[  312.841941]  [<ffffffff8157d7ae>] SyS_connect+0xe/0x10
[  312.842818]  [<ffffffff81659297>] entry_SYSCALL_64_fastpath+0x12/0x6a
[  312.844206] ---[ end trace 43f3ecd86c3b1313 ]---

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC v2 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes
  2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
@ 2015-09-04 23:12 ` Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.

This patch refactors some common init codes used by both
ip6gre_tunnel_init and ip6gre_tap_init.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv6/ip6_gre.c | 42 +++++++++++++++++++++++++-----------------
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 7b5758b..be6b0b9 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1244,7 +1244,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
 	netif_keep_dst(dev);
 }
 
-static int ip6gre_tunnel_init(struct net_device *dev)
+static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
 
@@ -1253,6 +1253,25 @@ static int ip6gre_tunnel_init(struct net_device *dev)
 	tunnel->dev = dev;
 	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
+	dev->iflink = tunnel->parms.link;
+
+	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int ip6gre_tunnel_init(struct net_device *dev)
+{
+	struct ip6_tnl *tunnel;
+	int ret;
+
+	ret = ip6gre_tunnel_init_common(dev);
+	if (ret)
+		return ret;
+
+	tunnel = netdev_priv(dev);
 
 	memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr));
 	memcpy(dev->broadcast, &tunnel->parms.raddr, sizeof(struct in6_addr));
@@ -1260,12 +1279,6 @@ static int ip6gre_tunnel_init(struct net_device *dev)
 	if (ipv6_addr_any(&tunnel->parms.raddr))
 		dev->header_ops = &ip6gre_header_ops;
 
-	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-	if (!dev->tstats)
-		return -ENOMEM;
-
-	dev->iflink = tunnel->parms.link;
-
 	return 0;
 }
 
@@ -1461,21 +1474,16 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 static int ip6gre_tap_init(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
+	int ret;
 
-	tunnel = netdev_priv(dev);
+	ret = ip6gre_tunnel_init_common(dev);
+	if (ret)
+		return ret;
 
-	tunnel->dev = dev;
-	tunnel->net = dev_net(dev);
-	strcpy(tunnel->parms.name, dev->name);
+	tunnel = netdev_priv(dev);
 
 	ip6gre_tnl_link_config(tunnel, 1);
 
-	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-	if (!dev->tstats)
-		return -ENOMEM;
-
-	dev->iflink = tunnel->parms.link;
-
 	return 0;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC v2 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel
  2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
@ 2015-09-04 23:12 ` Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.

This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
   reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
   conventional name matching with ip6_tnl_dst_get().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip6_tunnel.h |  4 ++--
 net/ipv6/ip6_gre.c       |  4 ++--
 net/ipv6/ip6_tunnel.c    | 12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 76c091b..6ebccce 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -60,9 +60,9 @@ struct ipv6_tlv_tnl_enc_lim {
 	__u8 encap_limit;	/* tunnel encapsulation limit   */
 } __packed;
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t);
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst);
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
 		const struct in6_addr *raddr);
 int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index be6b0b9..4903c9a 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -633,7 +633,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 	}
 
 	if (!fl6->flowi6_mark)
-		dst = ip6_tnl_dst_check(tunnel);
+		dst = ip6_tnl_dst_get(tunnel);
 
 	if (!dst) {
 		ndst = ip6_route_output(net, NULL, fl6);
@@ -762,7 +762,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 
 	ip6tunnel_xmit(skb, dev);
 	if (ndst)
-		ip6_tnl_dst_store(tunnel, ndst);
+		ip6_tnl_dst_set(tunnel, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 96abdd0..2c5a1e3 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -132,7 +132,7 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
 {
 	struct dst_entry *dst = t->dst_cache;
 
@@ -145,7 +145,7 @@ struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
 
 	return dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_check);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
 
 void ip6_tnl_dst_reset(struct ip6_tnl *t)
 {
@@ -154,14 +154,14 @@ void ip6_tnl_dst_reset(struct ip6_tnl *t)
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
 	t->dst_cookie = rt6_get_cookie(rt);
 	dst_release(t->dst_cache);
 	t->dst_cache = dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_store);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
 
 /**
  * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
@@ -1016,7 +1016,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
 		neigh_release(neigh);
 	} else if (!fl6->flowi6_mark)
-		dst = ip6_tnl_dst_check(t);
+		dst = ip6_tnl_dst_get(t);
 
 	if (!ip6_tnl_xmit_ctl(t, &fl6->saddr, &fl6->daddr))
 		goto tx_err_link_failure;
@@ -1108,7 +1108,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 	ipv6h->daddr = fl6->daddr;
 	ip6tunnel_xmit(skb, dev);
 	if (ndst)
-		ip6_tnl_dst_store(t, ndst);
+		ip6_tnl_dst_set(t, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC v2 net 3/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
@ 2015-09-04 23:12 ` Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau
  4 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

Problems in the current dst_entry cache in the ip6_tunnel:

1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
   - One major problem is that the dst refcnt gets messed up. F.e.
     the same dst_cache can be released multiple times and then
     triggering the infamous dst refcnt < 0 warning message.
   - Another issue is the inconsistency between dst_cache and
     dst_cookie.

   It can be reproduced by adding and removing the ip6gre tunnel
   while running a super_netperf TCP_CRR test.

2. ip6_tnl_dst_get does not take the dst refcnt before returning
   the dst.

This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. ip6_tnl_dst_get always takes the dst refcnt before returning

Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Conflicts:
	net/ipv6/ip6_gre.c
	net/ipv6/ip6_tunnel.c
---
 include/net/ip6_tunnel.h |  11 ++++-
 net/ipv6/ip6_gre.c       |  38 ++++++++-------
 net/ipv6/ip6_tunnel.c    | 122 +++++++++++++++++++++++++++++++++++------------
 3 files changed, 123 insertions(+), 48 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 6ebccce..39830c5 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -32,6 +32,12 @@ struct __ip6_tnl_parm {
 	__be32			o_key;
 };
 
+struct ip6_tnl_dst {
+	spinlock_t lock;
+	struct dst_entry *dst;
+	u32 cookie;
+};
+
 /* IPv6 tunnel */
 struct ip6_tnl {
 	struct ip6_tnl __rcu *next;	/* next tunnel in list */
@@ -39,8 +45,7 @@ struct ip6_tnl {
 	struct net *net;	/* netns for packet i/o */
 	struct __ip6_tnl_parm parms;	/* tunnel configuration parameters */
 	struct flowi fl;	/* flowi template for xmit */
-	struct dst_entry *dst_cache;    /* cached dst */
-	u32 dst_cookie;
+	struct ip6_tnl_dst __percpu *dst_cache;	/* cached dst */
 
 	int err_count;
 	unsigned long err_time;
@@ -61,6 +66,8 @@ struct ipv6_tlv_tnl_enc_lim {
 } __packed;
 
 struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
+int ip6_tnl_dst_init(struct ip6_tnl *t);
+void ip6_tnl_dst_destroy(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
 void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4903c9a..1425b21 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -636,17 +636,17 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 		dst = ip6_tnl_dst_get(tunnel);
 
 	if (!dst) {
-		ndst = ip6_route_output(net, NULL, fl6);
+		dst = ip6_route_output(net, NULL, fl6);
 
-		if (ndst->error)
+		if (dst->error)
 			goto tx_err_link_failure;
-		ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
-		if (IS_ERR(ndst)) {
-			err = PTR_ERR(ndst);
-			ndst = NULL;
+		dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+		if (IS_ERR(dst)) {
+			err = PTR_ERR(dst);
+			dst = NULL;
 			goto tx_err_link_failure;
 		}
-		dst = ndst;
+		ndst = dst;
 	}
 
 	tdev = dst->dev;
@@ -701,12 +701,9 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 		skb = new_skb;
 	}
 
-	if (fl6->flowi6_mark) {
-		skb_dst_set(skb, dst);
-		ndst = NULL;
-	} else {
-		skb_dst_set_noref(skb, dst);
-	}
+	if (!fl6->flowi6_mark && ndst)
+		ip6_tnl_dst_set(tunnel, ndst);
+	skb_dst_set(skb, dst);
 
 	proto = NEXTHDR_GRE;
 	if (encap_limit >= 0) {
@@ -761,14 +758,12 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 	skb_set_inner_protocol(skb, protocol);
 
 	ip6tunnel_xmit(skb, dev);
-	if (ndst)
-		ip6_tnl_dst_set(tunnel, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
 	dst_link_failure(skb);
 tx_err_dst_release:
-	dst_release(ndst);
+	dst_release(dst);
 	return err;
 }
 
@@ -1221,6 +1216,9 @@ static const struct net_device_ops ip6gre_netdev_ops = {
 
 static void ip6gre_dev_free(struct net_device *dev)
 {
+	struct ip6_tnl *t = netdev_priv(dev);
+
+	ip6_tnl_dst_destroy(t);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1247,6 +1245,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
+	int ret;
 
 	tunnel = netdev_priv(dev);
 
@@ -1259,6 +1258,13 @@ static int ip6gre_tunnel_init_common(struct net_device *dev)
 	if (!dev->tstats)
 		return -ENOMEM;
 
+	ret = ip6_tnl_dst_init(tunnel);
+	if (ret) {
+		free_percpu(dev->tstats);
+		dev->tstats = NULL;
+		return ret;
+	}
+
 	return 0;
 }
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 2c5a1e3..37d6874 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -132,37 +132,90 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+				      struct dst_entry *dst)
 {
-	struct dst_entry *dst = t->dst_cache;
-
-	if (dst && dst->obsolete &&
-	    dst->ops->check(dst, t->dst_cookie) == NULL) {
-		t->dst_cache = NULL;
-		dst_release(dst);
-		return NULL;
+	dst_release(idst->dst);
+	if (dst) {
+		dst_hold(dst);
+		idst->cookie = rt6_get_cookie((struct rt6_info *)dst);
+	} else {
+		idst->cookie = 0;
 	}
+	idst->dst = dst;
+}
+
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+				    struct dst_entry *dst)
+{
 
+	spin_lock_bh(&idst->lock);
+	__ip6_tnl_per_cpu_dst_set(idst, dst);
+	spin_unlock_bh(&idst->lock);
+}
+
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+{
+	struct ip6_tnl_dst *idst;
+	struct dst_entry *dst;
+
+	idst = raw_cpu_ptr(t->dst_cache);
+	spin_lock_bh(&idst->lock);
+	dst = idst->dst;
+	if (dst) {
+		if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
+			dst_hold(idst->dst);
+		} else {
+			__ip6_tnl_per_cpu_dst_set(idst, NULL);
+			dst = NULL;
+		}
+	}
+	spin_unlock_bh(&idst->lock);
 	return dst;
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
 
 void ip6_tnl_dst_reset(struct ip6_tnl *t)
 {
-	dst_release(t->dst_cache);
-	t->dst_cache = NULL;
+	int i;
+
+	for_each_possible_cpu(i)
+		ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), NULL);
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 
 void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
 {
-	struct rt6_info *rt = (struct rt6_info *) dst;
-	t->dst_cookie = rt6_get_cookie(rt);
-	dst_release(t->dst_cache);
-	t->dst_cache = dst;
+	ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), dst);
+
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
 
+void ip6_tnl_dst_destroy(struct ip6_tnl *t)
+{
+	if (!t->dst_cache)
+		return;
+
+	ip6_tnl_dst_reset(t);
+	free_percpu(t->dst_cache);
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_destroy);
+
+int ip6_tnl_dst_init(struct ip6_tnl *t)
+{
+	int i;
+
+	t->dst_cache = alloc_percpu(struct ip6_tnl_dst);
+	if (!t->dst_cache)
+		return -ENOMEM;
+
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_init);
+
 /**
  * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
  *   @remote: the address of the tunnel exit-point
@@ -277,6 +330,9 @@ ip6_tnl_unlink(struct ip6_tnl_net *ip6n, struct ip6_tnl *t)
 
 static void ip6_dev_free(struct net_device *dev)
 {
+	struct ip6_tnl *t = netdev_priv(dev);
+
+	ip6_tnl_dst_destroy(t);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1022,17 +1078,17 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		goto tx_err_link_failure;
 
 	if (!dst) {
-		ndst = ip6_route_output(net, NULL, fl6);
+		dst = ip6_route_output(net, NULL, fl6);
 
-		if (ndst->error)
+		if (dst->error)
 			goto tx_err_link_failure;
-		ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
-		if (IS_ERR(ndst)) {
-			err = PTR_ERR(ndst);
-			ndst = NULL;
+		dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+		if (IS_ERR(dst)) {
+			err = PTR_ERR(dst);
+			dst = NULL;
 			goto tx_err_link_failure;
 		}
-		dst = ndst;
+		ndst = dst;
 	}
 
 	tdev = dst->dev;
@@ -1078,12 +1134,11 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		consume_skb(skb);
 		skb = new_skb;
 	}
-	if (fl6->flowi6_mark) {
-		skb_dst_set(skb, dst);
-		ndst = NULL;
-	} else {
-		skb_dst_set_noref(skb, dst);
-	}
+
+	if (!fl6->flowi6_mark && ndst)
+		ip6_tnl_dst_set(t, ndst);
+	skb_dst_set(skb, dst);
+
 	skb->transport_header = skb->network_header;
 
 	proto = fl6->flowi6_proto;
@@ -1107,14 +1162,12 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 	ipv6h->saddr = fl6->saddr;
 	ipv6h->daddr = fl6->daddr;
 	ip6tunnel_xmit(skb, dev);
-	if (ndst)
-		ip6_tnl_dst_set(t, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
 	dst_link_failure(skb);
 tx_err_dst_release:
-	dst_release(ndst);
+	dst_release(dst);
 	return err;
 }
 
@@ -1573,12 +1626,21 @@ static inline int
 ip6_tnl_dev_init_gen(struct net_device *dev)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
+	int ret;
 
 	t->dev = dev;
 	t->net = dev_net(dev);
 	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
 	if (!dev->tstats)
 		return -ENOMEM;
+
+	ret = ip6_tnl_dst_init(t);
+	if (ret) {
+		free_percpu(dev->tstats);
+		dev->tstats = NULL;
+		return ret;
+	}
+
 	return 0;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free
  2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
                   ` (2 preceding siblings ...)
  2015-09-04 23:12 ` [PATCH RFC v2 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
@ 2015-09-04 23:12 ` Martin KaFai Lau
  2015-09-04 23:52   ` Martin KaFai Lau
  2015-09-04 23:12 ` [PATCH RFC v2 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau
  4 siblings, 1 reply; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

It is a prep work to get dst freeing from fib tree undergo
a rcu grace period.

The following is a common paradigm:
if (ip6_del_rt(rt))
	dst_free(rt)

which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
1. We don't know the ip6_del_rt(rt) failure is because it
   was not managed by fib tree (DST_NOCACHE) or it had already been
   removed from the fib tree.
2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
   dst_free(rt) has already been called already.  A second
   dst_free(rt) is not always obviously safe.  The rt may have
   been destroyed already.
3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
4. It is a stopper to make dst freeing from fib tree undergo a
   rcu grace period.

This patch is to use a DST_NOCACHE flag to indicate a rt is
managed by the fib tree or not.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv6/addrconf.c |  7 +++----
 net/ipv6/ip6_fib.c  | 11 +++++++++--
 net/ipv6/route.c    |  7 +++++--
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index ba006d9..b6225aa 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4855,13 +4855,12 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 
 			rt = addrconf_get_prefix_route(&ifp->peer_addr, 128,
 						       ifp->idev->dev, 0, 0);
-			if (rt && ip6_del_rt(rt))
-				dst_free(&rt->dst);
+			if (rt)
+				ip6_del_rt(rt);
 		}
 		dst_hold(&ifp->rt->dst);
 
-		if (ip6_del_rt(ifp->rt))
-			dst_free(&ifp->rt->dst);
+		ip6_del_rt(ifp->rt);
 
 		rt_genid_bump_ipv6(net);
 		break;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 6492115..346aa4a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -897,6 +897,10 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 	int replace_required = 0;
 	int sernum = fib6_new_sernum(info->nl_net);
 
+	if (WARN_ON_ONCE((rt->dst.flags & DST_NOCACHE) &&
+			 !atomic_read(&rt->dst.__refcnt)))
+		return -EINVAL;
+
 	if (info->nlh) {
 		if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
 			allow_create = 0;
@@ -989,6 +993,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
 		fib6_start_gc(info->nl_net, rt);
 		if (!(rt->rt6i_flags & RTF_CACHE))
 			fib6_prune_clones(info->nl_net, pn);
+		rt->dst.flags &= ~DST_NOCACHE;
 	}
 
 out:
@@ -1013,7 +1018,8 @@ out:
 			atomic_inc(&pn->leaf->rt6i_ref);
 		}
 #endif
-		dst_free(&rt->dst);
+		if (!(rt->dst.flags & DST_NOCACHE))
+			dst_free(&rt->dst);
 	}
 	return err;
 
@@ -1024,7 +1030,8 @@ out:
 st_failure:
 	if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
 		fib6_repair_tree(info->nl_net, fn);
-	dst_free(&rt->dst);
+	if (!(rt->dst.flags & DST_NOCACHE))
+		dst_free(&rt->dst);
 	return err;
 #endif
 }
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index bc9ff77..ecc63eb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1313,8 +1313,7 @@ static void ip6_link_failure(struct sk_buff *skb)
 	if (rt) {
 		if (rt->rt6i_flags & RTF_CACHE) {
 			dst_hold(&rt->dst);
-			if (ip6_del_rt(rt))
-				dst_free(&rt->dst);
+			ip6_del_rt(rt);
 		} else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT)) {
 			rt->rt6i_node->fn_sernum = -1;
 		}
@@ -1962,6 +1961,9 @@ static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
 	if (rt == net->ipv6.ip6_null_entry) {
 		err = -ENOENT;
 		goto out;
+	} else if (rt->dst.flags & DST_NOCACHE) {
+		err = -ENOENT;
+		goto out;
 	}
 
 	table = rt->rt6i_table;
@@ -2445,6 +2447,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 	rt->rt6i_dst.addr = *addr;
 	rt->rt6i_dst.plen = 128;
 	rt->rt6i_table = fib6_get_table(net, RT6_TABLE_LOCAL);
+	rt->dst.flags |= DST_NOCACHE;
 
 	atomic_set(&rt->dst.__refcnt, 1);
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC v2 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel
  2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
                   ` (3 preceding siblings ...)
  2015-09-04 23:12 ` [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
@ 2015-09-04 23:12 ` Martin KaFai Lau
  4 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:12 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

This patch uses a seqlock to ensure consistency between idst->dst and
idst->cookie.  It also makes dst freeing from fib tree to undergo a
rcu grace period.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip6_tunnel.h |  4 ++--
 net/ipv6/ip6_fib.c       |  9 +++++++--
 net/ipv6/ip6_tunnel.c    | 47 ++++++++++++++++++++++++-----------------------
 3 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 39830c5..ff17487 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -33,8 +33,8 @@ struct __ip6_tnl_parm {
 };
 
 struct ip6_tnl_dst {
-	spinlock_t lock;
-	struct dst_entry *dst;
+	seqlock_t lock;
+	struct dst_entry __rcu *dst;
 	u32 cookie;
 };
 
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 346aa4a..6d99460 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -154,6 +154,11 @@ static void node_free(struct fib6_node *fn)
 	kmem_cache_free(fib6_node_kmem, fn);
 }
 
+static void rt6_rcu_free(struct rt6_info *rt)
+{
+	call_rcu(&rt->dst.rcu_head, dst_rcu_free);
+}
+
 static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
 {
 	int cpu;
@@ -168,7 +173,7 @@ static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
 		ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
 		pcpu_rt = *ppcpu_rt;
 		if (pcpu_rt) {
-			dst_free(&pcpu_rt->dst);
+			rt6_rcu_free(pcpu_rt);
 			*ppcpu_rt = NULL;
 		}
 	}
@@ -180,7 +185,7 @@ static void rt6_release(struct rt6_info *rt)
 {
 	if (atomic_dec_and_test(&rt->rt6i_ref)) {
 		rt6_free_pcpu(rt);
-		dst_free(&rt->dst);
+		rt6_rcu_free(rt);
 	}
 }
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 37d6874..e3cd6dbb 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -132,9 +132,10 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
-				      struct dst_entry *dst)
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+				    struct dst_entry *dst)
 {
+	write_seqlock_bh(&idst->lock);
 	dst_release(idst->dst);
 	if (dst) {
 		dst_hold(dst);
@@ -142,35 +143,35 @@ static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
 	} else {
 		idst->cookie = 0;
 	}
-	idst->dst = dst;
-}
-
-static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
-				    struct dst_entry *dst)
-{
-
-	spin_lock_bh(&idst->lock);
-	__ip6_tnl_per_cpu_dst_set(idst, dst);
-	spin_unlock_bh(&idst->lock);
+	rcu_assign_pointer(idst->dst, dst);
+	write_sequnlock_bh(&idst->lock);
 }
 
 struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
 {
 	struct ip6_tnl_dst *idst;
 	struct dst_entry *dst;
+	unsigned int seq;
+	u32 cookie;
 
 	idst = raw_cpu_ptr(t->dst_cache);
-	spin_lock_bh(&idst->lock);
-	dst = idst->dst;
-	if (dst) {
-		if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
-			dst_hold(idst->dst);
-		} else {
-			__ip6_tnl_per_cpu_dst_set(idst, NULL);
-			dst = NULL;
-		}
+
+	rcu_read_lock();
+	do {
+		seq = read_seqbegin(&idst->lock);
+		dst = rcu_dereference(idst->dst);
+		cookie = idst->cookie;
+	} while (read_seqretry(&idst->lock, seq));
+
+	if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+		dst = NULL;
+	rcu_read_unlock();
+
+	if (dst && dst->obsolete && !dst->ops->check(dst, cookie)) {
+		ip6_tnl_per_cpu_dst_set(idst, NULL);
+		dst_release(dst);
+		dst = NULL;
 	}
-	spin_unlock_bh(&idst->lock);
 	return dst;
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
@@ -210,7 +211,7 @@ int ip6_tnl_dst_init(struct ip6_tnl *t)
 		return -ENOMEM;
 
 	for_each_possible_cpu(i)
-		spin_lock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
+		seqlock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
 
 	return 0;
 }
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free
  2015-09-04 23:12 ` [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
@ 2015-09-04 23:52   ` Martin KaFai Lau
  0 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-04 23:52 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team

On Fri, Sep 04, 2015 at 04:12:41PM -0700, Martin KaFai Lau wrote:
> @@ -1962,6 +1961,9 @@ static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
>  	if (rt == net->ipv6.ip6_null_entry) {
>  		err = -ENOENT;
>  		goto out;
> +	} else if (rt->dst.flags & DST_NOCACHE) {
> +		err = -ENOENT;
> +		goto out;
>  	}
These two cases should be merged.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-09-04 23:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-04 23:12 [PATCH RFC v2 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
2015-09-04 23:12 ` [PATCH RFC v2 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
2015-09-04 23:12 ` [PATCH RFC v2 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
2015-09-04 23:12 ` [PATCH RFC v2 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
2015-09-04 23:12 ` [PATCH RFC v2 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
2015-09-04 23:52   ` Martin KaFai Lau
2015-09-04 23:12 ` [PATCH RFC v2 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).