netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
@ 2015-09-01 18:55 Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 1/3] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 18:55 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Kernel Team

This patch series is to fix the dst refcnt bugs in ip6_tunnel.

Patch 1 and 2 are the prep works.  Patch 3 is the fix.

I can reproduce the bug by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.  I get the following
trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():

[  312.760432] ------------[ cut here ]------------
[  312.774664] WARNING: CPU: 2 PID: 10263 at net/core/dst.c:288 dst_release+0xf3/0x100()
[  312.776041] Modules linked in: k10temp coretemp hwmon ip6_gre ip6_tunnel tunnel6 ipmi_devintf ipmi_ms\
ghandler ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_fil\
ter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd grace mptctl netconsole autofs4 rpcsec_gss_krb5 a\
uth_rpcgss oid_registry sunrpc ipv6 dm_mod loop iTCO_wdt iTCO_vendor_support serio_raw rtc_cmos pcspkr i\
2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd e1000e mlx4_en ptp pps_core vxlan udp_tunnel ip6_udp\
_tunnel mlx4_core sg button ext3 jbd mpt2sas raid_class
[  312.785302] CPU: 2 PID: 10263 Comm: netperf Not tainted 4.2.0-rc8-00046-g4db9b63-dirty #15
[  312.791695] Hardware name: Quanta Freedom /Windmill-EP, BIOS F03_3B04 09/12/2013
[  312.792965]  ffffffff819dca2c ffff8811dfbdf6f8 ffffffff816537de ffff88123788fdb8
[  312.794263]  0000000000000000 ffff8811dfbdf738 ffffffff81052646 ffff8811dfbdf768
[  312.795593]  ffff881203a98180 00000000ffffffff ffff88242927a000 ffff88120a2532e0
[  312.796946] Call Trace:
[  312.797380]  [<ffffffff816537de>] dump_stack+0x45/0x57
[  312.798288]  [<ffffffff81052646>] warn_slowpath_common+0x86/0xc0
[  312.799699]  [<ffffffff8105273a>] warn_slowpath_null+0x1a/0x20
[  312.800852]  [<ffffffff8159f9b3>] dst_release+0xf3/0x100
[  312.801834]  [<ffffffffa03f1308>] ip6_tnl_dst_store+0x48/0x70 [ip6_tunnel]
[  312.803738]  [<ffffffffa03fd0b6>] ip6gre_xmit2+0x536/0x720 [ip6_gre]
[  312.804774]  [<ffffffffa03fd40a>] ip6gre_tunnel_xmit+0x16a/0x410 [ip6_gre]
[  312.805986]  [<ffffffff8159934b>] dev_hard_start_xmit+0x23b/0x390
[  312.808810]  [<ffffffff815a2f5f>] ? neigh_destroy+0xef/0x140
[  312.809843]  [<ffffffff81599a6c>] __dev_queue_xmit+0x48c/0x4f0
[  312.813931]  [<ffffffff81599ae3>] dev_queue_xmit_sk+0x13/0x20
[  312.814993]  [<ffffffff815a0832>] neigh_direct_output+0x12/0x20
[  312.817448]  [<ffffffffa021d633>] ip6_finish_output2+0x183/0x460 [ipv6]
[  312.818762]  [<ffffffff81306fc5>] ? find_next_bit+0x15/0x20
[  312.819671]  [<ffffffffa021fd79>] ip6_finish_output+0x89/0xe0 [ipv6]
[  312.820720]  [<ffffffffa021fe14>] ip6_output+0x44/0xe0 [ipv6]
[  312.821762]  [<ffffffff815c8809>] ? nf_hook_slow+0x69/0xc0
[  312.823123]  [<ffffffffa021d232>] ip6_xmit+0x242/0x4c0 [ipv6]
[  312.824073]  [<ffffffffa021c9f0>] ? ac6_proc_exit+0x20/0x20 [ipv6]
[  312.825116]  [<ffffffffa024c751>] inet6_csk_xmit+0x61/0xa0 [ipv6]
[  312.826127]  [<ffffffff815eb590>] tcp_transmit_skb+0x4f0/0x9b0
[  312.827441]  [<ffffffff815ed267>] tcp_connect+0x637/0x7a0
[  312.828327]  [<ffffffffa0245906>] tcp_v6_connect+0x2d6/0x550 [ipv6]
[  312.829581]  [<ffffffff81606f05>] __inet_stream_connect+0x95/0x2f0
[  312.830600]  [<ffffffff810ae13a>] ? hrtimer_try_to_cancel+0x1a/0xf0
[  312.833456]  [<ffffffff812fba19>] ? timerqueue_add+0x59/0xb0
[  312.834407]  [<ffffffff81607198>] inet_stream_connect+0x38/0x50
[  312.835886]  [<ffffffff8157cb17>] SYSC_connect+0xb7/0xf0
[  312.840035]  [<ffffffff810af6d3>] ? do_setitimer+0x1b3/0x200
[  312.840983]  [<ffffffff810af75a>] ? alarm_setitimer+0x3a/0x70
[  312.841941]  [<ffffffff8157d7ae>] SyS_connect+0xe/0x10
[  312.842818]  [<ffffffff81659297>] entry_SYSCALL_64_fastpath+0x12/0x6a
[  312.844206] ---[ end trace 43f3ecd86c3b1313 ]---

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net 1/3] ipv6: Refactor common ip6gre_tunnel_init codes
  2015-09-01 18:55 [PATCH net 0/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
@ 2015-09-01 18:55 ` Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 2/3] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
  2 siblings, 0 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 18:55 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Kernel Team

It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.

This patch refactors some common init codes used by both
ip6gre_tunnel_init and ip6gre_tap_init.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv6/ip6_gre.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 69f4f68..495eab3 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1244,7 +1244,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
 	netif_keep_dst(dev);
 }
 
-static int ip6gre_tunnel_init(struct net_device *dev)
+static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
 
@@ -1254,16 +1254,30 @@ static int ip6gre_tunnel_init(struct net_device *dev)
 	tunnel->net = dev_net(dev);
 	strcpy(tunnel->parms.name, dev->name);
 
+	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+	if (!dev->tstats)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int ip6gre_tunnel_init(struct net_device *dev)
+{
+	struct ip6_tnl *tunnel;
+	int ret;
+
+	ret = ip6gre_tunnel_init_common(dev);
+	if (ret)
+		return ret;
+
+	tunnel = netdev_priv(dev);
+
 	memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr));
 	memcpy(dev->broadcast, &tunnel->parms.raddr, sizeof(struct in6_addr));
 
 	if (ipv6_addr_any(&tunnel->parms.raddr))
 		dev->header_ops = &ip6gre_header_ops;
 
-	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-	if (!dev->tstats)
-		return -ENOMEM;
-
 	return 0;
 }
 
@@ -1459,19 +1473,16 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
 static int ip6gre_tap_init(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
+	int ret;
 
-	tunnel = netdev_priv(dev);
+	ret = ip6gre_tunnel_init_common(dev);
+	if (ret)
+		return ret;
 
-	tunnel->dev = dev;
-	tunnel->net = dev_net(dev);
-	strcpy(tunnel->parms.name, dev->name);
+	tunnel = netdev_priv(dev);
 
 	ip6gre_tnl_link_config(tunnel, 1);
 
-	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
-	if (!dev->tstats)
-		return -ENOMEM;
-
 	return 0;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net 2/3] ipv6: Rename the dst_cache helper functions in ip6_tunnel
  2015-09-01 18:55 [PATCH net 0/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 1/3] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
@ 2015-09-01 18:55 ` Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
  2 siblings, 0 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 18:55 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Kernel Team

It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.

This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
   reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
   conventional name matching with ip6_tnl_dst_get().

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip6_tunnel.h |  4 ++--
 net/ipv6/ip6_gre.c       |  4 ++--
 net/ipv6/ip6_tunnel.c    | 12 ++++++------
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index b8529aa..979b081 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -60,9 +60,9 @@ struct ipv6_tlv_tnl_enc_lim {
 	__u8 encap_limit;	/* tunnel encapsulation limit   */
 } __packed;
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t);
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst);
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
 		const struct in6_addr *raddr);
 int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 495eab3..2f2cff0 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -634,7 +634,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 	}
 
 	if (!fl6->flowi6_mark)
-		dst = ip6_tnl_dst_check(tunnel);
+		dst = ip6_tnl_dst_get(tunnel);
 
 	if (!dst) {
 		ndst = ip6_route_output(net, NULL, fl6);
@@ -763,7 +763,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 
 	ip6tunnel_xmit(NULL, skb, dev);
 	if (ndst)
-		ip6_tnl_dst_store(tunnel, ndst);
+		ip6_tnl_dst_set(tunnel, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 2e67b66..d2dc9e7 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,7 +126,7 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
 {
 	struct dst_entry *dst = t->dst_cache;
 
@@ -139,7 +139,7 @@ struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
 
 	return dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_check);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
 
 void ip6_tnl_dst_reset(struct ip6_tnl *t)
 {
@@ -148,14 +148,14 @@ void ip6_tnl_dst_reset(struct ip6_tnl *t)
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
 {
 	struct rt6_info *rt = (struct rt6_info *) dst;
 	t->dst_cookie = rt6_get_cookie(rt);
 	dst_release(t->dst_cache);
 	t->dst_cache = dst;
 }
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_store);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
 
 /**
  * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
@@ -1010,7 +1010,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
 		neigh_release(neigh);
 	} else if (!fl6->flowi6_mark)
-		dst = ip6_tnl_dst_check(t);
+		dst = ip6_tnl_dst_get(t);
 
 	if (!ip6_tnl_xmit_ctl(t, &fl6->saddr, &fl6->daddr))
 		goto tx_err_link_failure;
@@ -1102,7 +1102,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 	ipv6h->daddr = fl6->daddr;
 	ip6tunnel_xmit(NULL, skb, dev);
 	if (ndst)
-		ip6_tnl_dst_store(t, ndst);
+		ip6_tnl_dst_set(t, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 18:55 [PATCH net 0/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 1/3] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
  2015-09-01 18:55 ` [PATCH net 2/3] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
@ 2015-09-01 18:55 ` Martin KaFai Lau
  2015-09-01 20:14   ` Eric Dumazet
  2 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 18:55 UTC (permalink / raw)
  To: netdev, David S. Miller; +Cc: Kernel Team

Problems in the current dst_entry cache in the ip6_tunnel:

1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
   - One major problem is that the dst refcnt gets messed up. F.e.
     the same dst_cache can be released multiple times and then
     triggering the infamous dst refcnt < 0 warning message.
   - Another issue is the inconsistency between dst_cache and
     dst_cookie.

   It can be reproduced by adding and removing the ip6gre tunnel
   while running a super_netperf TCP_CRR test.

2. In ip6_tnl_xmit2() and ip6gre_xmit2(), the outgoing skb does
   not hold a dst_entry's refcnt.

This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. The outgoing skb always holds the dst_entry's refcnt

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/net/ip6_tunnel.h |  11 ++++-
 net/ipv6/ip6_gre.c       |  38 ++++++++-------
 net/ipv6/ip6_tunnel.c    | 122 +++++++++++++++++++++++++++++++++++------------
 3 files changed, 123 insertions(+), 48 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 979b081..60b4f40 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -32,6 +32,12 @@ struct __ip6_tnl_parm {
 	__be32			o_key;
 };
 
+struct ip6_tnl_dst {
+	spinlock_t lock;
+	struct dst_entry *dst;
+	u32 cookie;
+};
+
 /* IPv6 tunnel */
 struct ip6_tnl {
 	struct ip6_tnl __rcu *next;	/* next tunnel in list */
@@ -39,8 +45,7 @@ struct ip6_tnl {
 	struct net *net;	/* netns for packet i/o */
 	struct __ip6_tnl_parm parms;	/* tunnel configuration parameters */
 	struct flowi fl;	/* flowi template for xmit */
-	struct dst_entry *dst_cache;    /* cached dst */
-	u32 dst_cookie;
+	struct ip6_tnl_dst __percpu *dst_cache;	/* cached dst */
 
 	int err_count;
 	unsigned long err_time;
@@ -61,6 +66,8 @@ struct ipv6_tlv_tnl_enc_lim {
 } __packed;
 
 struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
+int ip6_tnl_dst_init(struct ip6_tnl *t);
+void ip6_tnl_dst_destroy(struct ip6_tnl *t);
 void ip6_tnl_dst_reset(struct ip6_tnl *t);
 void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
 int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 2f2cff0..76fb4ac 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -637,17 +637,17 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 		dst = ip6_tnl_dst_get(tunnel);
 
 	if (!dst) {
-		ndst = ip6_route_output(net, NULL, fl6);
+		dst = ip6_route_output(net, NULL, fl6);
 
-		if (ndst->error)
+		if (dst->error)
 			goto tx_err_link_failure;
-		ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
-		if (IS_ERR(ndst)) {
-			err = PTR_ERR(ndst);
-			ndst = NULL;
+		dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+		if (IS_ERR(dst)) {
+			err = PTR_ERR(dst);
+			dst = NULL;
 			goto tx_err_link_failure;
 		}
-		dst = ndst;
+		ndst = dst;
 	}
 
 	tdev = dst->dev;
@@ -702,12 +702,9 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 		skb = new_skb;
 	}
 
-	if (fl6->flowi6_mark) {
-		skb_dst_set(skb, dst);
-		ndst = NULL;
-	} else {
-		skb_dst_set_noref(skb, dst);
-	}
+	if (!fl6->flowi6_mark && ndst)
+		ip6_tnl_dst_set(tunnel, ndst);
+	skb_dst_set(skb, dst);
 
 	proto = NEXTHDR_GRE;
 	if (encap_limit >= 0) {
@@ -762,14 +759,12 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
 	skb_set_inner_protocol(skb, protocol);
 
 	ip6tunnel_xmit(NULL, skb, dev);
-	if (ndst)
-		ip6_tnl_dst_set(tunnel, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
 	dst_link_failure(skb);
 tx_err_dst_release:
-	dst_release(ndst);
+	dst_release(dst);
 	return err;
 }
 
@@ -1222,6 +1217,9 @@ static const struct net_device_ops ip6gre_netdev_ops = {
 
 static void ip6gre_dev_free(struct net_device *dev)
 {
+	struct ip6_tnl *t = netdev_priv(dev);
+
+	ip6_tnl_dst_destroy(t);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1247,6 +1245,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
 static int ip6gre_tunnel_init_common(struct net_device *dev)
 {
 	struct ip6_tnl *tunnel;
+	int ret;
 
 	tunnel = netdev_priv(dev);
 
@@ -1258,6 +1257,13 @@ static int ip6gre_tunnel_init_common(struct net_device *dev)
 	if (!dev->tstats)
 		return -ENOMEM;
 
+	ret = ip6_tnl_dst_init(tunnel);
+	if (ret) {
+		free_percpu(dev->tstats);
+		dev->tstats = NULL;
+		return ret;
+	}
+
 	return 0;
 }
 
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index d2dc9e7..97b3e781 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,37 +126,90 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
  * Locking : hash tables are protected by RCU and RTNL
  */
 
-struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+				      struct dst_entry *dst)
 {
-	struct dst_entry *dst = t->dst_cache;
-
-	if (dst && dst->obsolete &&
-	    !dst->ops->check(dst, t->dst_cookie)) {
-		t->dst_cache = NULL;
-		dst_release(dst);
-		return NULL;
+	dst_release(idst->dst);
+	idst->dst = dst;
+	if (dst) {
+		dst_hold(dst);
+		idst->cookie = rt6_get_cookie((struct rt6_info *)dst);
+	} else {
+		idst->cookie = 0;
 	}
+}
+
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+				    struct dst_entry *dst)
+{
 
+	spin_lock_bh(&idst->lock);
+	__ip6_tnl_per_cpu_dst_set(idst, dst);
+	spin_unlock_bh(&idst->lock);
+}
+
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+{
+	struct dst_entry *dst;
+	struct ip6_tnl_dst *idst;
+
+	idst = raw_cpu_ptr(t->dst_cache);
+	spin_lock_bh(&idst->lock);
+	dst = idst->dst;
+	if (dst) {
+		if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
+			dst_hold(idst->dst);
+		} else {
+			__ip6_tnl_per_cpu_dst_set(idst, NULL);
+			dst = NULL;
+		}
+	}
+	spin_unlock_bh(&idst->lock);
 	return dst;
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
 
 void ip6_tnl_dst_reset(struct ip6_tnl *t)
 {
-	dst_release(t->dst_cache);
-	t->dst_cache = NULL;
+	int i;
+
+	for_each_possible_cpu(i)
+		ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), NULL);
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
 
 void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
 {
-	struct rt6_info *rt = (struct rt6_info *) dst;
-	t->dst_cookie = rt6_get_cookie(rt);
-	dst_release(t->dst_cache);
-	t->dst_cache = dst;
+	ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), dst);
+
 }
 EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
 
+void ip6_tnl_dst_destroy(struct ip6_tnl *t)
+{
+	if (!t->dst_cache)
+		return;
+
+	ip6_tnl_dst_reset(t);
+	free_percpu(t->dst_cache);
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_destroy);
+
+int ip6_tnl_dst_init(struct ip6_tnl *t)
+{
+	int i;
+
+	t->dst_cache = alloc_percpu(struct ip6_tnl_dst);
+	if (!t->dst_cache)
+		return -ENOMEM;
+
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_init);
+
 /**
  * ip6_tnl_lookup - fetch tunnel matching the end-point addresses
  *   @remote: the address of the tunnel exit-point
@@ -271,6 +324,9 @@ ip6_tnl_unlink(struct ip6_tnl_net *ip6n, struct ip6_tnl *t)
 
 static void ip6_dev_free(struct net_device *dev)
 {
+	struct ip6_tnl *t = netdev_priv(dev);
+
+	ip6_tnl_dst_destroy(t);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1016,17 +1072,17 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		goto tx_err_link_failure;
 
 	if (!dst) {
-		ndst = ip6_route_output(net, NULL, fl6);
+		dst = ip6_route_output(net, NULL, fl6);
 
-		if (ndst->error)
+		if (dst->error)
 			goto tx_err_link_failure;
-		ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
-		if (IS_ERR(ndst)) {
-			err = PTR_ERR(ndst);
-			ndst = NULL;
+		dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+		if (IS_ERR(dst)) {
+			err = PTR_ERR(dst);
+			dst = NULL;
 			goto tx_err_link_failure;
 		}
-		dst = ndst;
+		ndst = dst;
 	}
 
 	tdev = dst->dev;
@@ -1072,12 +1128,11 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 		consume_skb(skb);
 		skb = new_skb;
 	}
-	if (fl6->flowi6_mark) {
-		skb_dst_set(skb, dst);
-		ndst = NULL;
-	} else {
-		skb_dst_set_noref(skb, dst);
-	}
+
+	if (!fl6->flowi6_mark && ndst)
+		ip6_tnl_dst_set(t, ndst);
+	skb_dst_set(skb, dst);
+
 	skb->transport_header = skb->network_header;
 
 	proto = fl6->flowi6_proto;
@@ -1101,14 +1156,12 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
 	ipv6h->saddr = fl6->saddr;
 	ipv6h->daddr = fl6->daddr;
 	ip6tunnel_xmit(NULL, skb, dev);
-	if (ndst)
-		ip6_tnl_dst_set(t, ndst);
 	return 0;
 tx_err_link_failure:
 	stats->tx_carrier_errors++;
 	dst_link_failure(skb);
 tx_err_dst_release:
-	dst_release(ndst);
+	dst_release(dst);
 	return err;
 }
 
@@ -1573,12 +1626,21 @@ static inline int
 ip6_tnl_dev_init_gen(struct net_device *dev)
 {
 	struct ip6_tnl *t = netdev_priv(dev);
+	int ret;
 
 	t->dev = dev;
 	t->net = dev_net(dev);
 	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
 	if (!dev->tstats)
 		return -ENOMEM;
+
+	ret = ip6_tnl_dst_init(t);
+	if (ret) {
+		free_percpu(dev->tstats);
+		dev->tstats = NULL;
+		return ret;
+	}
+
 	return 0;
 }
 
-- 
1.8.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 18:55 ` [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
@ 2015-09-01 20:14   ` Eric Dumazet
  2015-09-01 20:55     ` Martin KaFai Lau
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Eric Dumazet @ 2015-09-01 20:14 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Tue, 2015-09-01 at 11:55 -0700, Martin KaFai Lau wrote:
> Problems in the current dst_entry cache in the ip6_tunnel:
> 
> 1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
>    - One major problem is that the dst refcnt gets messed up. F.e.
>      the same dst_cache can be released multiple times and then
>      triggering the infamous dst refcnt < 0 warning message.
>    - Another issue is the inconsistency between dst_cache and
>      dst_cookie.
> 
>    It can be reproduced by adding and removing the ip6gre tunnel
>    while running a super_netperf TCP_CRR test.
> 
> 2. In ip6_tnl_xmit2() and ip6gre_xmit2(), the outgoing skb does
>    not hold a dst_entry's refcnt.

It should not be a problem. refcnt is taken when/if necessary (skb
queued on a qdisc for example)

We have other uses of skb_dst_set_noref()

Please describe the problem ?

> 
> This patch:
> 1. Create a percpu dst_entry cache in ip6_tnl
OK
> 2. Use a spinlock to protect the dst_cache operations

Well, a seqlock would be better : No need for an atomic operation in
fast path.

> 3. The outgoing skb always holds the dst_entry's refcnt
> 

Not clear why it is needed. Inexpensive thanks to per cpu dst, but
still...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 20:14   ` Eric Dumazet
@ 2015-09-01 20:55     ` Martin KaFai Lau
  2015-09-01 21:26       ` Eric Dumazet
  2015-09-02  1:55     ` Martin KaFai Lau
  2015-09-02 20:58     ` Martin KaFai Lau
  2 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 20:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> It should not be a problem. refcnt is taken when/if necessary (skb
> queued on a qdisc for example)
>
> We have other uses of skb_dst_set_noref()
>
> Please describe the problem ?
The current ip6_tnl_dst_get() does not take the dst refcnt.

If the dst is released after ip6_tnl_dst_get() and before
skb_dst_set_noref(), would it cause an issue?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 20:55     ` Martin KaFai Lau
@ 2015-09-01 21:26       ` Eric Dumazet
  2015-09-01 22:25         ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-09-01 21:26 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Tue, 2015-09-01 at 13:55 -0700, Martin KaFai Lau wrote:
> On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > It should not be a problem. refcnt is taken when/if necessary (skb
> > queued on a qdisc for example)
> >
> > We have other uses of skb_dst_set_noref()
> >
> > Please describe the problem ?
> The current ip6_tnl_dst_get() does not take the dst refcnt.
> 
> If the dst is released after ip6_tnl_dst_get() and before
> skb_dst_set_noref(), would it cause an issue?

We are under rcu here, and a dst in a cache is protected by RCU by
definition.

skb_dst_set_noref() has following debugging clause, does it trigger for
you ?

WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 21:26       ` Eric Dumazet
@ 2015-09-01 22:25         ` Martin KaFai Lau
  2015-09-01 22:38           ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-01 22:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 02:26:58PM -0700, Eric Dumazet wrote:
> On Tue, 2015-09-01 at 13:55 -0700, Martin KaFai Lau wrote:
> > On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > It should not be a problem. refcnt is taken when/if necessary (skb
> > > queued on a qdisc for example)
> > >
> > > We have other uses of skb_dst_set_noref()
> > >
> > > Please describe the problem ?
> > The current ip6_tnl_dst_get() does not take the dst refcnt.
> >
> > If the dst is released after ip6_tnl_dst_get() and before
> > skb_dst_set_noref(), would it cause an issue?
>
> We are under rcu here, and a dst in a cache is protected by RCU by
> definition.
>
> skb_dst_set_noref() has following debugging clause, does it trigger for
> you ?
>
> WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
No. I did not see this.

I am probably missing something.  Do you mean the rcu can
protect the followings:


ip6_tnl_dst_get()
			dst_release()
						dst_free() /* refcnt is 0 */
skb_dst_set_noref()

--------

or for DST_NOCACHE

ip6_tnl_dst_get()
			dst_release() /* refcnt is 0 */
skb_dst_set_noref()
/* ip6tunnel_xmit()
 * or its callees
 * take a dst refcnt
 */

--------

or the above situation does not matter/will not happen?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 22:25         ` Martin KaFai Lau
@ 2015-09-01 22:38           ` Eric Dumazet
  2015-09-02  0:31             ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-09-01 22:38 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Tue, 2015-09-01 at 15:25 -0700, Martin KaFai Lau wrote:
> On Tue, Sep 01, 2015 at 02:26:58PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-09-01 at 13:55 -0700, Martin KaFai Lau wrote:
> > > On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > > It should not be a problem. refcnt is taken when/if necessary (skb
> > > > queued on a qdisc for example)
> > > >
> > > > We have other uses of skb_dst_set_noref()
> > > >
> > > > Please describe the problem ?
> > > The current ip6_tnl_dst_get() does not take the dst refcnt.
> > >
> > > If the dst is released after ip6_tnl_dst_get() and before
> > > skb_dst_set_noref(), would it cause an issue?
> >
> > We are under rcu here, and a dst in a cache is protected by RCU by
> > definition.
> >
> > skb_dst_set_noref() has following debugging clause, does it trigger for
> > you ?
> >
> > WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> No. I did not see this.
> 
> I am probably missing something.  Do you mean the rcu can
> protect the followings:
> 
> 
> ip6_tnl_dst_get()
> 			dst_release()
> 						dst_free() /* refcnt is 0 */
> skb_dst_set_noref()
> 

Yes, this is protected by normal rcu rules.

dst wont be freed until all cpus exit their rcu read sections.

> --------
> 
> or for DST_NOCACHE
> 
> ip6_tnl_dst_get()
> 			dst_release() /* refcnt is 0 */
> skb_dst_set_noref()
> /* ip6tunnel_xmit()
>  * or its callees
>  * take a dst refcnt
>  */
> 



> --------
> 
> or the above situation does not matter/will not happen?
> --

You should take a look at following commits for a bit of history

10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
dbfc4fb7d578d4f224faa6b60deb40804dfdc1b1
f88649721268999bdff09777847080a52004f691
6c7e7610ff6888ea15a901fbcb30c5d461816b34

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 22:38           ` Eric Dumazet
@ 2015-09-02  0:31             ` Martin KaFai Lau
  2015-09-02  0:42               ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02  0:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 03:38:36PM -0700, Eric Dumazet wrote:
> On Tue, 2015-09-01 at 15:25 -0700, Martin KaFai Lau wrote:
> > On Tue, Sep 01, 2015 at 02:26:58PM -0700, Eric Dumazet wrote:
> > > On Tue, 2015-09-01 at 13:55 -0700, Martin KaFai Lau wrote:
> > > > On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > > > It should not be a problem. refcnt is taken when/if necessary (skb
> > > > > queued on a qdisc for example)
> > > > >
> > > > > We have other uses of skb_dst_set_noref()
> > > > >
> > > > > Please describe the problem ?
> > > > The current ip6_tnl_dst_get() does not take the dst refcnt.
> > > >
> > > > If the dst is released after ip6_tnl_dst_get() and before
> > > > skb_dst_set_noref(), would it cause an issue?
> > >
> > > We are under rcu here, and a dst in a cache is protected by RCU by
> > > definition.
> > >
> > > skb_dst_set_noref() has following debugging clause, does it trigger for
> > > you ?
> > >
> > > WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> > No. I did not see this.
> >
> > I am probably missing something.  Do you mean the rcu can
> > protect the followings:
> >
> >
> > ip6_tnl_dst_get()
> > 			dst_release()
> > 						dst_free() /* refcnt is 0 */
> > skb_dst_set_noref()
> >
>
> Yes, this is protected by normal rcu rules.
>
> dst wont be freed until all cpus exit their rcu read sections.
I can see skb_dst_set_noref() is safe at this point
but what happen after rcu_read_unlock()?

hmmm... Also, I don't see dst_free() is under the rcu contract, like
call_rcu() or synchronize_rcu().

Even it is (like dst_release for DST_NOCACHE dst_entry),  what happen
after the rcu_read_unlock()? Would someone (like qdisc) holds a dst refcnt
to an already/to-be destroyed dst? For DST_NOCACHE, like:

rcu_read_lock()

ip6_tnl_dst_get()
 			dst_release()	/* refcnt is 0 */
			=>call_rcu(dst_destroy)
skb_dst_set_noref()
__dev_queue_xmit()
=>skb_dst_force()
=>__dev_xmit_skb()
=>q->enqueue()

rcu_read_unlock()
/* Here, I am holding a dst refcnt but
 * the dst is already in the next
 * rcu destroy cycle?
 */

> You should take a look at following commits for a bit of history
>
> 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
> dbfc4fb7d578d4f224faa6b60deb40804dfdc1b1
> f88649721268999bdff09777847080a52004f691
> 6c7e7610ff6888ea15a901fbcb30c5d461816b34
>
Thanks for the pointers.  It seems the ip_tunnel.c and ip_tunnel_core.c is
always doing skb_dst_set() also.  It is something that can also be optimized
in IPv4 or the situation is different in IPv4?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02  0:31             ` Martin KaFai Lau
@ 2015-09-02  0:42               ` Martin KaFai Lau
  2015-09-02  1:02                 ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02  0:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 05:31:44PM -0700, Martin KaFai Lau wrote:
> On Tue, Sep 01, 2015 at 03:38:36PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-09-01 at 15:25 -0700, Martin KaFai Lau wrote:
> > > On Tue, Sep 01, 2015 at 02:26:58PM -0700, Eric Dumazet wrote:
> > > > On Tue, 2015-09-01 at 13:55 -0700, Martin KaFai Lau wrote:
> > > > > On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > > > > It should not be a problem. refcnt is taken when/if necessary (skb
> > > > > > queued on a qdisc for example)
> > > > > >
> > > > > > We have other uses of skb_dst_set_noref()
> > > > > >
> > > > > > Please describe the problem ?
> > > > > The current ip6_tnl_dst_get() does not take the dst refcnt.
> > > > >
> > > > > If the dst is released after ip6_tnl_dst_get() and before
> > > > > skb_dst_set_noref(), would it cause an issue?
> > > >
> > > > We are under rcu here, and a dst in a cache is protected by RCU by
> > > > definition.
> > > >
> > > > skb_dst_set_noref() has following debugging clause, does it trigger for
> > > > you ?
> > > >
> > > > WARN_ON(!rcu_read_lock_held() && !rcu_read_lock_bh_held());
> > > No. I did not see this.
> > >
> > > I am probably missing something.  Do you mean the rcu can
> > > protect the followings:
> > >
> > >
> > > ip6_tnl_dst_get()
> > > 			dst_release()
> > > 						dst_free() /* refcnt is 0 */
> > > skb_dst_set_noref()
> > >
> >
> > Yes, this is protected by normal rcu rules.
> >
> > dst wont be freed until all cpus exit their rcu read sections.
> For DST_NOCACHE, like:
>
> rcu_read_lock()
>
> ip6_tnl_dst_get()
>  			dst_release()	/* refcnt is 0 */
> 			=>call_rcu(dst_destroy)
> skb_dst_set_noref()
> __dev_queue_xmit()
> =>skb_dst_force()
> =>__dev_xmit_skb()
> =>q->enqueue()
>
> rcu_read_unlock()
> /* Here, I am holding a dst refcnt but
>  * the dst is already in the next
>  * rcu destroy cycle?
>  */
I look a closer look at dst_rcu_free() and your commit pointers.  I can see your point
for DST_NOCACHE.

However, dst_free() for not DST_NOCACHE is still an issue, I think.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02  0:42               ` Martin KaFai Lau
@ 2015-09-02  1:02                 ` Martin KaFai Lau
  0 siblings, 0 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02  1:02 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 05:42:00PM -0700, Martin KaFai Lau wrote:
> I look a closer look at dst_rcu_free() and your commit pointers.  I can see your point
> for DST_NOCACHE.
>
> However, dst_free() for not DST_NOCACHE is still an issue, I think.
oops. Ignore this email and continue the discussion in my last email instead.

For DST_NOCACHE, dst_release() is doing call_rcu(dst_destroy_rcu)
and dst_destroy_rcu() is directly calling dst_destroy().   I was confused dst_rcu_free()
was used instead.... time for a break towards the end of the day :p

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 20:14   ` Eric Dumazet
  2015-09-01 20:55     ` Martin KaFai Lau
@ 2015-09-02  1:55     ` Martin KaFai Lau
  2015-09-02 20:58     ` Martin KaFai Lau
  2 siblings, 0 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02  1:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> On Tue, 2015-09-01 at 11:55 -0700, Martin KaFai Lau wrote:
> > Problems in the current dst_entry cache in the ip6_tunnel:
> >
> > 1. ip6_tnl_dst_set is racy.  There is no lock to protect it:
> >    - One major problem is that the dst refcnt gets messed up. F.e.
> >      the same dst_cache can be released multiple times and then
> >      triggering the infamous dst refcnt < 0 warning message.
> >    - Another issue is the inconsistency between dst_cache and
> >      dst_cookie.
> >
> >    It can be reproduced by adding and removing the ip6gre tunnel
> >    while running a super_netperf TCP_CRR test.
> >
> > 2. In ip6_tnl_xmit2() and ip6gre_xmit2(), the outgoing skb does
> >    not hold a dst_entry's refcnt.
>
> It should not be a problem. refcnt is taken when/if necessary (skb
> queued on a qdisc for example)
>
> We have other uses of skb_dst_set_noref()
>
> Please describe the problem ?
After some more thoughts, I think it could be that the commit description is
inaccurate/confusing.  skb_dst_set_noref() is not the _source_ of the
problem per se.  Instead, the ip6_tnl_dst_get() should always bump the
dst refcnt before returning the dst. Using skb_dst_set() instead
of skb_dst_set_noref() here is just a follow-through effect.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-01 20:14   ` Eric Dumazet
  2015-09-01 20:55     ` Martin KaFai Lau
  2015-09-02  1:55     ` Martin KaFai Lau
@ 2015-09-02 20:58     ` Martin KaFai Lau
  2015-09-02 21:30       ` Eric Dumazet
  2 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02 20:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > 2. Use a spinlock to protect the dst_cache operations
>
> Well, a seqlock would be better : No need for an atomic operation in
> fast path.
>
seqlock can ensure consistency between idst->dst and idst->cookie.
However, IPv6 dst destruction is not protected by rcu.  dst_free() is
directly called, like in ip6_fib.c and a few other places.
Hence, atomic_inc_not_zero() cannot be used here because the dst may
have already been kmem_cache_free() when refcnt is 0.  A spinlock is
needed to stop the ip6_tnl_dst_set() side from removing the refcnt.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 20:58     ` Martin KaFai Lau
@ 2015-09-02 21:30       ` Eric Dumazet
  2015-09-02 21:52         ` Martin KaFai Lau
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2015-09-02 21:30 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Wed, 2015-09-02 at 13:58 -0700, Martin KaFai Lau wrote:
> On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > 2. Use a spinlock to protect the dst_cache operations
> >
> > Well, a seqlock would be better : No need for an atomic operation in
> > fast path.
> >
> seqlock can ensure consistency between idst->dst and idst->cookie.
> However, IPv6 dst destruction is not protected by rcu.  dst_free() is
> directly called, like in ip6_fib.c and a few other places.
> Hence, atomic_inc_not_zero() cannot be used here because the dst may
> have already been kmem_cache_free() when refcnt is 0.

Really ? What about basic rcu rules ?

Object cannot be freed until all cpus have exited their RCU sections.

>   A spinlock is
> needed to stop the ip6_tnl_dst_set() side from removing the refcnt.

Are you telling me RCU should be banished from the kernel ? ;)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 21:30       ` Eric Dumazet
@ 2015-09-02 21:52         ` Martin KaFai Lau
  2015-09-02 22:48           ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02 21:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> Object cannot be freed until all cpus have exited their RCU sections.
You meant the dst_destroy() here will wait for all cpus exited their RCU sections?

static inline void dst_free(struct dst_entry *dst)
{
	if (dst->obsolete > 0)
		return;
	if (!atomic_read(&dst->__refcnt)) {
		dst = dst_destroy(dst);
		if (!dst)
			return;
	}
	__dst_free(dst);
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 21:52         ` Martin KaFai Lau
@ 2015-09-02 22:48           ` Eric Dumazet
  2015-09-02 23:04             ` David Miller
  2015-09-02 23:10             ` Martin KaFai Lau
  0 siblings, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2015-09-02 22:48 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > Object cannot be freed until all cpus have exited their RCU sections.
> You meant the dst_destroy() here will wait for all cpus exited their RCU sections?
> 
> static inline void dst_free(struct dst_entry *dst)
> {
> 	if (dst->obsolete > 0)
> 		return;
> 	if (!atomic_read(&dst->__refcnt)) {
> 		dst = dst_destroy(dst);
> 		if (!dst)
> 			return;
> 	}
> 	__dst_free(dst);
> }

dst_free() is called after RCU grace period, in the case you are
interested in.

Look at dst_rcu_free() and rt_free()

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 22:48           ` Eric Dumazet
@ 2015-09-02 23:04             ` David Miller
  2015-09-02 23:10             ` Martin KaFai Lau
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2015-09-02 23:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kafai, netdev, kernel-team

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 02 Sep 2015 15:48:57 -0700

> On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
>> On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
>> > Object cannot be freed until all cpus have exited their RCU sections.
>> You meant the dst_destroy() here will wait for all cpus exited their RCU sections?
>> 
>> static inline void dst_free(struct dst_entry *dst)
>> {
>> 	if (dst->obsolete > 0)
>> 		return;
>> 	if (!atomic_read(&dst->__refcnt)) {
>> 		dst = dst_destroy(dst);
>> 		if (!dst)
>> 			return;
>> 	}
>> 	__dst_free(dst);
>> }
> 
> dst_free() is called after RCU grace period, in the case you are
> interested in.
> 
> Look at dst_rcu_free() and rt_free()

For ipv4, this is true, but in ipv6, it is not necessarily done in
this way.  And I think that is the point Martin is trying to make.

If you look, the dst_free() calls in ipv6 are basically synchronous,
it does not use dst_rcu_free().

And thus, the fix is to make ipv6 properly RCU free route entries.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 22:48           ` Eric Dumazet
  2015-09-02 23:04             ` David Miller
@ 2015-09-02 23:10             ` Martin KaFai Lau
  2015-09-02 23:14               ` David Miller
  2015-09-02 23:46               ` Eric Dumazet
  1 sibling, 2 replies; 21+ messages in thread
From: Martin KaFai Lau @ 2015-09-02 23:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David S. Miller, Kernel Team

On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
> On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > > Object cannot be freed until all cpus have exited their RCU sections.
> > You meant the dst_destroy() here will wait for all cpus exited their RCU sections?
> >
> > static inline void dst_free(struct dst_entry *dst)
> > {
> > 	if (dst->obsolete > 0)
> > 		return;
> > 	if (!atomic_read(&dst->__refcnt)) {
> > 		dst = dst_destroy(dst);
> > 		if (!dst)
> > 			return;
> > 	}
> > 	__dst_free(dst);
> > }
>
> dst_free() is called after RCU grace period, in the case you are
> interested in.
>
> Look at dst_rcu_free() and rt_free()
Yes for IPv4 FIB

Not for IPv6 FIB. F.e. rt6_release()
The IPv6 FIB is protected by rwlock now.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 23:10             ` Martin KaFai Lau
@ 2015-09-02 23:14               ` David Miller
  2015-09-02 23:46               ` Eric Dumazet
  1 sibling, 0 replies; 21+ messages in thread
From: David Miller @ 2015-09-02 23:14 UTC (permalink / raw)
  To: kafai; +Cc: eric.dumazet, netdev, kernel-team

From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 2 Sep 2015 16:10:31 -0700

> On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
>> dst_free() is called after RCU grace period, in the case you are
>> interested in.
>>
>> Look at dst_rcu_free() and rt_free()
> Yes for IPv4 FIB
> 
> Not for IPv6 FIB. F.e. rt6_release()
> The IPv6 FIB is protected by rwlock now.

The FIB tree can use whatever locking scheme it wants, but the
actual route objects need to be released via RCU to fix the
problems you are seeing.

Converting the entire ipv6 FIB tree handling to RCU is not a
prerequisite for this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
  2015-09-02 23:10             ` Martin KaFai Lau
  2015-09-02 23:14               ` David Miller
@ 2015-09-02 23:46               ` Eric Dumazet
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2015-09-02 23:46 UTC (permalink / raw)
  To: Martin KaFai Lau; +Cc: netdev, David S. Miller, Kernel Team

On Wed, 2015-09-02 at 16:10 -0700, Martin KaFai Lau wrote:
> On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
> > On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> > > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > > > Object cannot be freed until all cpus have exited their RCU sections.
> > > You meant the dst_destroy() here will wait for all cpus exited their RCU sections?
> > >
> > > static inline void dst_free(struct dst_entry *dst)
> > > {
> > > 	if (dst->obsolete > 0)
> > > 		return;
> > > 	if (!atomic_read(&dst->__refcnt)) {
> > > 		dst = dst_destroy(dst);
> > > 		if (!dst)
> > > 			return;
> > > 	}
> > > 	__dst_free(dst);
> > > }
> >
> > dst_free() is called after RCU grace period, in the case you are
> > interested in.
> >
> > Look at dst_rcu_free() and rt_free()
> Yes for IPv4 FIB
> 
> Not for IPv6 FIB. F.e. rt6_release()
> The IPv6 FIB is protected by rwlock now.

Oh well. I gave you a hint. I was not saying that it was currently used
in IPv6.

Are you telling me that IPv6 needs to continue to use techniques from
1990 ?

Surely we can use modern stuff, like proper RCU and/or seqlocks.

Since you are fixing a day-0 bug, I do not believe there is a particular
hurry to be conservative.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-09-02 23:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-01 18:55 [PATCH net 0/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
2015-09-01 18:55 ` [PATCH net 1/3] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
2015-09-01 18:55 ` [PATCH net 2/3] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
2015-09-01 18:55 ` [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
2015-09-01 20:14   ` Eric Dumazet
2015-09-01 20:55     ` Martin KaFai Lau
2015-09-01 21:26       ` Eric Dumazet
2015-09-01 22:25         ` Martin KaFai Lau
2015-09-01 22:38           ` Eric Dumazet
2015-09-02  0:31             ` Martin KaFai Lau
2015-09-02  0:42               ` Martin KaFai Lau
2015-09-02  1:02                 ` Martin KaFai Lau
2015-09-02  1:55     ` Martin KaFai Lau
2015-09-02 20:58     ` Martin KaFai Lau
2015-09-02 21:30       ` Eric Dumazet
2015-09-02 21:52         ` Martin KaFai Lau
2015-09-02 22:48           ` Eric Dumazet
2015-09-02 23:04             ` David Miller
2015-09-02 23:10             ` Martin KaFai Lau
2015-09-02 23:14               ` David Miller
2015-09-02 23:46               ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).