* [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
@ 2015-09-15 21:30 Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
` (5 more replies)
0 siblings, 6 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
v4:
- Fix a compilation error in patch 5 when CONFIG_LOCKDEP is turned on and
re-test it
v3:
- Merge a 'if else if' test in patch 4
- Use rcu_dereference_protected in patch 5 to fix a sparse check when
CONFIG_SPARSE_RCU_POINTER is enabled
v2:
- Add patch 4 and 5 to remove the spinlock
v1:
This patch series is to fix the dst refcnt bugs in ip6_tunnel.
Patch 1 and 2 are the prep works. Patch 3 is the fix.
I can reproduce the bug by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test. I get the following
trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():
[ 312.760432] ------------[ cut here ]------------
[ 312.774664] WARNING: CPU: 2 PID: 10263 at net/core/dst.c:288 dst_release+0xf3/0x100()
[ 312.776041] Modules linked in: k10temp coretemp hwmon ip6_gre ip6_tunnel tunnel6 ipmi_devintf ipmi_ms\
ghandler ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_fil\
ter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd grace mptctl netconsole autofs4 rpcsec_gss_krb5 a\
uth_rpcgss oid_registry sunrpc ipv6 dm_mod loop iTCO_wdt iTCO_vendor_support serio_raw rtc_cmos pcspkr i\
2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd e1000e mlx4_en ptp pps_core vxlan udp_tunnel ip6_udp\
_tunnel mlx4_core sg button ext3 jbd mpt2sas raid_class
[ 312.785302] CPU: 2 PID: 10263 Comm: netperf Not tainted 4.2.0-rc8-00046-g4db9b63-dirty #15
[ 312.791695] Hardware name: Quanta Freedom /Windmill-EP, BIOS F03_3B04 09/12/2013
[ 312.792965] ffffffff819dca2c ffff8811dfbdf6f8 ffffffff816537de ffff88123788fdb8
[ 312.794263] 0000000000000000 ffff8811dfbdf738 ffffffff81052646 ffff8811dfbdf768
[ 312.795593] ffff881203a98180 00000000ffffffff ffff88242927a000 ffff88120a2532e0
[ 312.796946] Call Trace:
[ 312.797380] [<ffffffff816537de>] dump_stack+0x45/0x57
[ 312.798288] [<ffffffff81052646>] warn_slowpath_common+0x86/0xc0
[ 312.799699] [<ffffffff8105273a>] warn_slowpath_null+0x1a/0x20
[ 312.800852] [<ffffffff8159f9b3>] dst_release+0xf3/0x100
[ 312.801834] [<ffffffffa03f1308>] ip6_tnl_dst_store+0x48/0x70 [ip6_tunnel]
[ 312.803738] [<ffffffffa03fd0b6>] ip6gre_xmit2+0x536/0x720 [ip6_gre]
[ 312.804774] [<ffffffffa03fd40a>] ip6gre_tunnel_xmit+0x16a/0x410 [ip6_gre]
[ 312.805986] [<ffffffff8159934b>] dev_hard_start_xmit+0x23b/0x390
[ 312.808810] [<ffffffff815a2f5f>] ? neigh_destroy+0xef/0x140
[ 312.809843] [<ffffffff81599a6c>] __dev_queue_xmit+0x48c/0x4f0
[ 312.813931] [<ffffffff81599ae3>] dev_queue_xmit_sk+0x13/0x20
[ 312.814993] [<ffffffff815a0832>] neigh_direct_output+0x12/0x20
[ 312.817448] [<ffffffffa021d633>] ip6_finish_output2+0x183/0x460 [ipv6]
[ 312.818762] [<ffffffff81306fc5>] ? find_next_bit+0x15/0x20
[ 312.819671] [<ffffffffa021fd79>] ip6_finish_output+0x89/0xe0 [ipv6]
[ 312.820720] [<ffffffffa021fe14>] ip6_output+0x44/0xe0 [ipv6]
[ 312.821762] [<ffffffff815c8809>] ? nf_hook_slow+0x69/0xc0
[ 312.823123] [<ffffffffa021d232>] ip6_xmit+0x242/0x4c0 [ipv6]
[ 312.824073] [<ffffffffa021c9f0>] ? ac6_proc_exit+0x20/0x20 [ipv6]
[ 312.825116] [<ffffffffa024c751>] inet6_csk_xmit+0x61/0xa0 [ipv6]
[ 312.826127] [<ffffffff815eb590>] tcp_transmit_skb+0x4f0/0x9b0
[ 312.827441] [<ffffffff815ed267>] tcp_connect+0x637/0x7a0
[ 312.828327] [<ffffffffa0245906>] tcp_v6_connect+0x2d6/0x550 [ipv6]
[ 312.829581] [<ffffffff81606f05>] __inet_stream_connect+0x95/0x2f0
[ 312.830600] [<ffffffff810ae13a>] ? hrtimer_try_to_cancel+0x1a/0xf0
[ 312.833456] [<ffffffff812fba19>] ? timerqueue_add+0x59/0xb0
[ 312.834407] [<ffffffff81607198>] inet_stream_connect+0x38/0x50
[ 312.835886] [<ffffffff8157cb17>] SYSC_connect+0xb7/0xf0
[ 312.840035] [<ffffffff810af6d3>] ? do_setitimer+0x1b3/0x200
[ 312.840983] [<ffffffff810af75a>] ? alarm_setitimer+0x3a/0x70
[ 312.841941] [<ffffffff8157d7ae>] SyS_connect+0xe/0x10
[ 312.842818] [<ffffffff81659297>] entry_SYSCALL_64_fastpath+0x12/0x6a
[ 312.844206] ---[ end trace 43f3ecd86c3b1313 ]---
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v4 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
@ 2015-09-15 21:30 ` Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
It is a prep work to fix the dst_entry refcnt bugs in ip6_tunnel.
This patch refactors some common init codes used by both
ip6gre_tunnel_init and ip6gre_tap_init.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
net/ipv6/ip6_gre.c | 37 ++++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 13 deletions(-)
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4038c69..af60d46 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1245,7 +1245,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
netif_keep_dst(dev);
}
-static int ip6gre_tunnel_init(struct net_device *dev)
+static int ip6gre_tunnel_init_common(struct net_device *dev)
{
struct ip6_tnl *tunnel;
@@ -1255,16 +1255,30 @@ static int ip6gre_tunnel_init(struct net_device *dev)
tunnel->net = dev_net(dev);
strcpy(tunnel->parms.name, dev->name);
+ dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+ if (!dev->tstats)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int ip6gre_tunnel_init(struct net_device *dev)
+{
+ struct ip6_tnl *tunnel;
+ int ret;
+
+ ret = ip6gre_tunnel_init_common(dev);
+ if (ret)
+ return ret;
+
+ tunnel = netdev_priv(dev);
+
memcpy(dev->dev_addr, &tunnel->parms.laddr, sizeof(struct in6_addr));
memcpy(dev->broadcast, &tunnel->parms.raddr, sizeof(struct in6_addr));
if (ipv6_addr_any(&tunnel->parms.raddr))
dev->header_ops = &ip6gre_header_ops;
- dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
- if (!dev->tstats)
- return -ENOMEM;
-
return 0;
}
@@ -1460,19 +1474,16 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
static int ip6gre_tap_init(struct net_device *dev)
{
struct ip6_tnl *tunnel;
+ int ret;
- tunnel = netdev_priv(dev);
+ ret = ip6gre_tunnel_init_common(dev);
+ if (ret)
+ return ret;
- tunnel->dev = dev;
- tunnel->net = dev_net(dev);
- strcpy(tunnel->parms.name, dev->name);
+ tunnel = netdev_priv(dev);
ip6gre_tnl_link_config(tunnel, 1);
- dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
- if (!dev->tstats)
- return -ENOMEM;
-
return 0;
}
--
2.5.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
@ 2015-09-15 21:30 ` Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
It is a prep work to fix the dst_entry refcnt bugs in
ip6_tunnel.
This patch rename:
1. ip6_tnl_dst_check() to ip6_tnl_dst_get() to better
reflect that it will take a dst refcnt in the next patch.
2. ip6_tnl_dst_store() to ip6_tnl_dst_set() to have a more
conventional name matching with ip6_tnl_dst_get().
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
include/net/ip6_tunnel.h | 4 ++--
net/ipv6/ip6_gre.c | 4 ++--
net/ipv6/ip6_tunnel.c | 12 ++++++------
3 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index b8529aa..979b081 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -60,9 +60,9 @@ struct ipv6_tlv_tnl_enc_lim {
__u8 encap_limit; /* tunnel encapsulation limit */
} __packed;
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t);
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
void ip6_tnl_dst_reset(struct ip6_tnl *t);
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst);
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
const struct in6_addr *raddr);
int ip6_tnl_xmit_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index af60d46..24f5dd8 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -634,7 +634,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
}
if (!fl6->flowi6_mark)
- dst = ip6_tnl_dst_check(tunnel);
+ dst = ip6_tnl_dst_get(tunnel);
if (!dst) {
ndst = ip6_route_output(net, NULL, fl6);
@@ -763,7 +763,7 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
ip6tunnel_xmit(NULL, skb, dev);
if (ndst)
- ip6_tnl_dst_store(tunnel, ndst);
+ ip6_tnl_dst_set(tunnel, ndst);
return 0;
tx_err_link_failure:
stats->tx_carrier_errors++;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index b0ab420..599b0b4 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,7 +126,7 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
* Locking : hash tables are protected by RCU and RTNL
*/
-struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
{
struct dst_entry *dst = t->dst_cache;
@@ -139,7 +139,7 @@ struct dst_entry *ip6_tnl_dst_check(struct ip6_tnl *t)
return dst;
}
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_check);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
void ip6_tnl_dst_reset(struct ip6_tnl *t)
{
@@ -148,14 +148,14 @@ void ip6_tnl_dst_reset(struct ip6_tnl *t)
}
EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
-void ip6_tnl_dst_store(struct ip6_tnl *t, struct dst_entry *dst)
+void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
{
struct rt6_info *rt = (struct rt6_info *) dst;
t->dst_cookie = rt6_get_cookie(rt);
dst_release(t->dst_cache);
t->dst_cache = dst;
}
-EXPORT_SYMBOL_GPL(ip6_tnl_dst_store);
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
/**
* ip6_tnl_lookup - fetch tunnel matching the end-point addresses
@@ -1010,7 +1010,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr));
neigh_release(neigh);
} else if (!fl6->flowi6_mark)
- dst = ip6_tnl_dst_check(t);
+ dst = ip6_tnl_dst_get(t);
if (!ip6_tnl_xmit_ctl(t, &fl6->saddr, &fl6->daddr))
goto tx_err_link_failure;
@@ -1102,7 +1102,7 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
ipv6h->daddr = fl6->daddr;
ip6tunnel_xmit(NULL, skb, dev);
if (ndst)
- ip6_tnl_dst_store(t, ndst);
+ ip6_tnl_dst_set(t, ndst);
return 0;
tx_err_link_failure:
stats->tx_carrier_errors++;
--
2.5.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 net 3/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
@ 2015-09-15 21:30 ` Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
Problems in the current dst_entry cache in the ip6_tunnel:
1. ip6_tnl_dst_set is racy. There is no lock to protect it:
- One major problem is that the dst refcnt gets messed up. F.e.
the same dst_cache can be released multiple times and then
triggering the infamous dst refcnt < 0 warning message.
- Another issue is the inconsistency between dst_cache and
dst_cookie.
It can be reproduced by adding and removing the ip6gre tunnel
while running a super_netperf TCP_CRR test.
2. ip6_tnl_dst_get does not take the dst refcnt before returning
the dst.
This patch:
1. Create a percpu dst_entry cache in ip6_tnl
2. Use a spinlock to protect the dst_cache operations
3. ip6_tnl_dst_get always takes the dst refcnt before returning
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
include/net/ip6_tunnel.h | 11 ++++-
net/ipv6/ip6_gre.c | 38 ++++++++-------
net/ipv6/ip6_tunnel.c | 122 +++++++++++++++++++++++++++++++++++------------
3 files changed, 123 insertions(+), 48 deletions(-)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 979b081..60b4f40 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -32,6 +32,12 @@ struct __ip6_tnl_parm {
__be32 o_key;
};
+struct ip6_tnl_dst {
+ spinlock_t lock;
+ struct dst_entry *dst;
+ u32 cookie;
+};
+
/* IPv6 tunnel */
struct ip6_tnl {
struct ip6_tnl __rcu *next; /* next tunnel in list */
@@ -39,8 +45,7 @@ struct ip6_tnl {
struct net *net; /* netns for packet i/o */
struct __ip6_tnl_parm parms; /* tunnel configuration parameters */
struct flowi fl; /* flowi template for xmit */
- struct dst_entry *dst_cache; /* cached dst */
- u32 dst_cookie;
+ struct ip6_tnl_dst __percpu *dst_cache; /* cached dst */
int err_count;
unsigned long err_time;
@@ -61,6 +66,8 @@ struct ipv6_tlv_tnl_enc_lim {
} __packed;
struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t);
+int ip6_tnl_dst_init(struct ip6_tnl *t);
+void ip6_tnl_dst_destroy(struct ip6_tnl *t);
void ip6_tnl_dst_reset(struct ip6_tnl *t);
void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst);
int ip6_tnl_rcv_ctl(struct ip6_tnl *t, const struct in6_addr *laddr,
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 24f5dd8..6465124 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -637,17 +637,17 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
dst = ip6_tnl_dst_get(tunnel);
if (!dst) {
- ndst = ip6_route_output(net, NULL, fl6);
+ dst = ip6_route_output(net, NULL, fl6);
- if (ndst->error)
+ if (dst->error)
goto tx_err_link_failure;
- ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
- if (IS_ERR(ndst)) {
- err = PTR_ERR(ndst);
- ndst = NULL;
+ dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+ if (IS_ERR(dst)) {
+ err = PTR_ERR(dst);
+ dst = NULL;
goto tx_err_link_failure;
}
- dst = ndst;
+ ndst = dst;
}
tdev = dst->dev;
@@ -702,12 +702,9 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
skb = new_skb;
}
- if (fl6->flowi6_mark) {
- skb_dst_set(skb, dst);
- ndst = NULL;
- } else {
- skb_dst_set_noref(skb, dst);
- }
+ if (!fl6->flowi6_mark && ndst)
+ ip6_tnl_dst_set(tunnel, ndst);
+ skb_dst_set(skb, dst);
proto = NEXTHDR_GRE;
if (encap_limit >= 0) {
@@ -762,14 +759,12 @@ static netdev_tx_t ip6gre_xmit2(struct sk_buff *skb,
skb_set_inner_protocol(skb, protocol);
ip6tunnel_xmit(NULL, skb, dev);
- if (ndst)
- ip6_tnl_dst_set(tunnel, ndst);
return 0;
tx_err_link_failure:
stats->tx_carrier_errors++;
dst_link_failure(skb);
tx_err_dst_release:
- dst_release(ndst);
+ dst_release(dst);
return err;
}
@@ -1223,6 +1218,9 @@ static const struct net_device_ops ip6gre_netdev_ops = {
static void ip6gre_dev_free(struct net_device *dev)
{
+ struct ip6_tnl *t = netdev_priv(dev);
+
+ ip6_tnl_dst_destroy(t);
free_percpu(dev->tstats);
free_netdev(dev);
}
@@ -1248,6 +1246,7 @@ static void ip6gre_tunnel_setup(struct net_device *dev)
static int ip6gre_tunnel_init_common(struct net_device *dev)
{
struct ip6_tnl *tunnel;
+ int ret;
tunnel = netdev_priv(dev);
@@ -1259,6 +1258,13 @@ static int ip6gre_tunnel_init_common(struct net_device *dev)
if (!dev->tstats)
return -ENOMEM;
+ ret = ip6_tnl_dst_init(tunnel);
+ if (ret) {
+ free_percpu(dev->tstats);
+ dev->tstats = NULL;
+ return ret;
+ }
+
return 0;
}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 599b0b4..851cf6d 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,37 +126,90 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
* Locking : hash tables are protected by RCU and RTNL
*/
-struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+ struct dst_entry *dst)
{
- struct dst_entry *dst = t->dst_cache;
-
- if (dst && dst->obsolete &&
- !dst->ops->check(dst, t->dst_cookie)) {
- t->dst_cache = NULL;
- dst_release(dst);
- return NULL;
+ dst_release(idst->dst);
+ if (dst) {
+ dst_hold(dst);
+ idst->cookie = rt6_get_cookie((struct rt6_info *)dst);
+ } else {
+ idst->cookie = 0;
}
+ idst->dst = dst;
+}
+
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+ struct dst_entry *dst)
+{
+ spin_lock_bh(&idst->lock);
+ __ip6_tnl_per_cpu_dst_set(idst, dst);
+ spin_unlock_bh(&idst->lock);
+}
+
+struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
+{
+ struct ip6_tnl_dst *idst;
+ struct dst_entry *dst;
+
+ idst = raw_cpu_ptr(t->dst_cache);
+ spin_lock_bh(&idst->lock);
+ dst = idst->dst;
+ if (dst) {
+ if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
+ dst_hold(idst->dst);
+ } else {
+ __ip6_tnl_per_cpu_dst_set(idst, NULL);
+ dst = NULL;
+ }
+ }
+ spin_unlock_bh(&idst->lock);
return dst;
}
EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
void ip6_tnl_dst_reset(struct ip6_tnl *t)
{
- dst_release(t->dst_cache);
- t->dst_cache = NULL;
+ int i;
+
+ for_each_possible_cpu(i)
+ ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), NULL);
}
EXPORT_SYMBOL_GPL(ip6_tnl_dst_reset);
void ip6_tnl_dst_set(struct ip6_tnl *t, struct dst_entry *dst)
{
- struct rt6_info *rt = (struct rt6_info *) dst;
- t->dst_cookie = rt6_get_cookie(rt);
- dst_release(t->dst_cache);
- t->dst_cache = dst;
+ ip6_tnl_per_cpu_dst_set(raw_cpu_ptr(t->dst_cache), dst);
+
}
EXPORT_SYMBOL_GPL(ip6_tnl_dst_set);
+void ip6_tnl_dst_destroy(struct ip6_tnl *t)
+{
+ if (!t->dst_cache)
+ return;
+
+ ip6_tnl_dst_reset(t);
+ free_percpu(t->dst_cache);
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_destroy);
+
+int ip6_tnl_dst_init(struct ip6_tnl *t)
+{
+ int i;
+
+ t->dst_cache = alloc_percpu(struct ip6_tnl_dst);
+ if (!t->dst_cache)
+ return -ENOMEM;
+
+ for_each_possible_cpu(i)
+ spin_lock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ip6_tnl_dst_init);
+
/**
* ip6_tnl_lookup - fetch tunnel matching the end-point addresses
* @remote: the address of the tunnel exit-point
@@ -271,6 +324,9 @@ ip6_tnl_unlink(struct ip6_tnl_net *ip6n, struct ip6_tnl *t)
static void ip6_dev_free(struct net_device *dev)
{
+ struct ip6_tnl *t = netdev_priv(dev);
+
+ ip6_tnl_dst_destroy(t);
free_percpu(dev->tstats);
free_netdev(dev);
}
@@ -1016,17 +1072,17 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
goto tx_err_link_failure;
if (!dst) {
- ndst = ip6_route_output(net, NULL, fl6);
+ dst = ip6_route_output(net, NULL, fl6);
- if (ndst->error)
+ if (dst->error)
goto tx_err_link_failure;
- ndst = xfrm_lookup(net, ndst, flowi6_to_flowi(fl6), NULL, 0);
- if (IS_ERR(ndst)) {
- err = PTR_ERR(ndst);
- ndst = NULL;
+ dst = xfrm_lookup(net, dst, flowi6_to_flowi(fl6), NULL, 0);
+ if (IS_ERR(dst)) {
+ err = PTR_ERR(dst);
+ dst = NULL;
goto tx_err_link_failure;
}
- dst = ndst;
+ ndst = dst;
}
tdev = dst->dev;
@@ -1072,12 +1128,11 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
consume_skb(skb);
skb = new_skb;
}
- if (fl6->flowi6_mark) {
- skb_dst_set(skb, dst);
- ndst = NULL;
- } else {
- skb_dst_set_noref(skb, dst);
- }
+
+ if (!fl6->flowi6_mark && ndst)
+ ip6_tnl_dst_set(t, ndst);
+ skb_dst_set(skb, dst);
+
skb->transport_header = skb->network_header;
proto = fl6->flowi6_proto;
@@ -1101,14 +1156,12 @@ static int ip6_tnl_xmit2(struct sk_buff *skb,
ipv6h->saddr = fl6->saddr;
ipv6h->daddr = fl6->daddr;
ip6tunnel_xmit(NULL, skb, dev);
- if (ndst)
- ip6_tnl_dst_set(t, ndst);
return 0;
tx_err_link_failure:
stats->tx_carrier_errors++;
dst_link_failure(skb);
tx_err_dst_release:
- dst_release(ndst);
+ dst_release(dst);
return err;
}
@@ -1573,12 +1626,21 @@ static inline int
ip6_tnl_dev_init_gen(struct net_device *dev)
{
struct ip6_tnl *t = netdev_priv(dev);
+ int ret;
t->dev = dev;
t->net = dev_net(dev);
dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
if (!dev->tstats)
return -ENOMEM;
+
+ ret = ip6_tnl_dst_init(t);
+ if (ret) {
+ free_percpu(dev->tstats);
+ dev->tstats = NULL;
+ return ret;
+ }
+
return 0;
}
--
2.5.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 net 4/5] ipv6: Avoid double dst_free
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
` (2 preceding siblings ...)
2015-09-15 21:30 ` [PATCH v4 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
@ 2015-09-15 21:30 ` Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau
2015-09-15 22:02 ` [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs " David Miller
5 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
It is a prep work to get dst freeing from fib tree undergo
a rcu grace period.
The following is a common paradigm:
if (ip6_del_rt(rt))
dst_free(rt)
which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
1. We don't know the ip6_del_rt(rt) failure is because it
was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
removed from the fib tree.
2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
dst_free(rt) has been called already. A second
dst_free(rt) is not always obviously safe. The rt may have
been destroyed already.
3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
4. It is a stopper to make dst freeing from fib tree undergo a
rcu grace period.
This patch is to use a DST_NOCACHE flag to indicate a rt is
not managed by the fib tree.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
net/ipv6/addrconf.c | 7 +++----
net/ipv6/ip6_fib.c | 11 +++++++++--
net/ipv6/route.c | 7 ++++---
3 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 030fefd..9001133 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5127,13 +5127,12 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
rt = addrconf_get_prefix_route(&ifp->peer_addr, 128,
ifp->idev->dev, 0, 0);
- if (rt && ip6_del_rt(rt))
- dst_free(&rt->dst);
+ if (rt)
+ ip6_del_rt(rt);
}
dst_hold(&ifp->rt->dst);
- if (ip6_del_rt(ifp->rt))
- dst_free(&ifp->rt->dst);
+ ip6_del_rt(ifp->rt);
rt_genid_bump_ipv6(net);
break;
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 418d982..e68350b 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -933,6 +933,10 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
int replace_required = 0;
int sernum = fib6_new_sernum(info->nl_net);
+ if (WARN_ON_ONCE((rt->dst.flags & DST_NOCACHE) &&
+ !atomic_read(&rt->dst.__refcnt)))
+ return -EINVAL;
+
if (info->nlh) {
if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
allow_create = 0;
@@ -1025,6 +1029,7 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt,
fib6_start_gc(info->nl_net, rt);
if (!(rt->rt6i_flags & RTF_CACHE))
fib6_prune_clones(info->nl_net, pn);
+ rt->dst.flags &= ~DST_NOCACHE;
}
out:
@@ -1049,7 +1054,8 @@ out:
atomic_inc(&pn->leaf->rt6i_ref);
}
#endif
- dst_free(&rt->dst);
+ if (!(rt->dst.flags & DST_NOCACHE))
+ dst_free(&rt->dst);
}
return err;
@@ -1060,7 +1066,8 @@ out:
st_failure:
if (fn && !(fn->fn_flags & (RTN_RTINFO|RTN_ROOT)))
fib6_repair_tree(info->nl_net, fn);
- dst_free(&rt->dst);
+ if (!(rt->dst.flags & DST_NOCACHE))
+ dst_free(&rt->dst);
return err;
#endif
}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 53617d7..3d3c1b2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1322,8 +1322,7 @@ static void ip6_link_failure(struct sk_buff *skb)
if (rt) {
if (rt->rt6i_flags & RTF_CACHE) {
dst_hold(&rt->dst);
- if (ip6_del_rt(rt))
- dst_free(&rt->dst);
+ ip6_del_rt(rt);
} else if (rt->rt6i_node && (rt->rt6i_flags & RTF_DEFAULT)) {
rt->rt6i_node->fn_sernum = -1;
}
@@ -2028,7 +2027,8 @@ static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
struct fib6_table *table;
struct net *net = dev_net(rt->dst.dev);
- if (rt == net->ipv6.ip6_null_entry) {
+ if (rt == net->ipv6.ip6_null_entry ||
+ rt->dst.flags & DST_NOCACHE) {
err = -ENOENT;
goto out;
}
@@ -2515,6 +2515,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
rt->rt6i_dst.addr = *addr;
rt->rt6i_dst.plen = 128;
rt->rt6i_table = fib6_get_table(net, RT6_TABLE_LOCAL);
+ rt->dst.flags |= DST_NOCACHE;
atomic_set(&rt->dst.__refcnt, 1);
--
2.5.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v4 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
` (3 preceding siblings ...)
2015-09-15 21:30 ` [PATCH v4 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
@ 2015-09-15 21:30 ` Martin KaFai Lau
2015-09-15 22:02 ` [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs " David Miller
5 siblings, 0 replies; 7+ messages in thread
From: Martin KaFai Lau @ 2015-09-15 21:30 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, Eric Dumazet, Hannes Frederic Sowa, Kernel Team
This patch uses a seqlock to ensure consistency between idst->dst and
idst->cookie. It also makes dst freeing from fib tree to undergo a
rcu grace period.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
include/net/ip6_tunnel.h | 4 ++--
net/ipv6/ip6_fib.c | 9 +++++++--
net/ipv6/ip6_tunnel.c | 51 +++++++++++++++++++++++++-----------------------
3 files changed, 36 insertions(+), 28 deletions(-)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 60b4f40..65c2a93 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -33,8 +33,8 @@ struct __ip6_tnl_parm {
};
struct ip6_tnl_dst {
- spinlock_t lock;
- struct dst_entry *dst;
+ seqlock_t lock;
+ struct dst_entry __rcu *dst;
u32 cookie;
};
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e68350b..8a9ec01 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -155,6 +155,11 @@ static void node_free(struct fib6_node *fn)
kmem_cache_free(fib6_node_kmem, fn);
}
+static void rt6_rcu_free(struct rt6_info *rt)
+{
+ call_rcu(&rt->dst.rcu_head, dst_rcu_free);
+}
+
static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
{
int cpu;
@@ -169,7 +174,7 @@ static void rt6_free_pcpu(struct rt6_info *non_pcpu_rt)
ppcpu_rt = per_cpu_ptr(non_pcpu_rt->rt6i_pcpu, cpu);
pcpu_rt = *ppcpu_rt;
if (pcpu_rt) {
- dst_free(&pcpu_rt->dst);
+ rt6_rcu_free(pcpu_rt);
*ppcpu_rt = NULL;
}
}
@@ -181,7 +186,7 @@ static void rt6_release(struct rt6_info *rt)
{
if (atomic_dec_and_test(&rt->rt6i_ref)) {
rt6_free_pcpu(rt);
- dst_free(&rt->dst);
+ rt6_rcu_free(rt);
}
}
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 851cf6d..983f0d2 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -126,45 +126,48 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev)
* Locking : hash tables are protected by RCU and RTNL
*/
-static void __ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
- struct dst_entry *dst)
+static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
+ struct dst_entry *dst)
{
- dst_release(idst->dst);
+ write_seqlock_bh(&idst->lock);
+ dst_release(rcu_dereference_protected(
+ idst->dst,
+ lockdep_is_held(&idst->lock.lock)));
if (dst) {
dst_hold(dst);
idst->cookie = rt6_get_cookie((struct rt6_info *)dst);
} else {
idst->cookie = 0;
}
- idst->dst = dst;
-}
-
-static void ip6_tnl_per_cpu_dst_set(struct ip6_tnl_dst *idst,
- struct dst_entry *dst)
-{
-
- spin_lock_bh(&idst->lock);
- __ip6_tnl_per_cpu_dst_set(idst, dst);
- spin_unlock_bh(&idst->lock);
+ rcu_assign_pointer(idst->dst, dst);
+ write_sequnlock_bh(&idst->lock);
}
struct dst_entry *ip6_tnl_dst_get(struct ip6_tnl *t)
{
struct ip6_tnl_dst *idst;
struct dst_entry *dst;
+ unsigned int seq;
+ u32 cookie;
idst = raw_cpu_ptr(t->dst_cache);
- spin_lock_bh(&idst->lock);
- dst = idst->dst;
- if (dst) {
- if (!dst->obsolete || dst->ops->check(dst, idst->cookie)) {
- dst_hold(idst->dst);
- } else {
- __ip6_tnl_per_cpu_dst_set(idst, NULL);
- dst = NULL;
- }
+
+ rcu_read_lock();
+ do {
+ seq = read_seqbegin(&idst->lock);
+ dst = rcu_dereference(idst->dst);
+ cookie = idst->cookie;
+ } while (read_seqretry(&idst->lock, seq));
+
+ if (dst && !atomic_inc_not_zero(&dst->__refcnt))
+ dst = NULL;
+ rcu_read_unlock();
+
+ if (dst && dst->obsolete && !dst->ops->check(dst, cookie)) {
+ ip6_tnl_per_cpu_dst_set(idst, NULL);
+ dst_release(dst);
+ dst = NULL;
}
- spin_unlock_bh(&idst->lock);
return dst;
}
EXPORT_SYMBOL_GPL(ip6_tnl_dst_get);
@@ -204,7 +207,7 @@ int ip6_tnl_dst_init(struct ip6_tnl *t)
return -ENOMEM;
for_each_possible_cpu(i)
- spin_lock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
+ seqlock_init(&per_cpu_ptr(t->dst_cache, i)->lock);
return 0;
}
--
2.5.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
` (4 preceding siblings ...)
2015-09-15 21:30 ` [PATCH v4 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau
@ 2015-09-15 22:02 ` David Miller
5 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2015-09-15 22:02 UTC (permalink / raw)
To: kafai; +Cc: netdev, eric.dumazet, hannes, kernel-team
From: Martin KaFai Lau <kafai@fb.com>
Date: Tue, 15 Sep 2015 14:30:04 -0700
> v4:
> - Fix a compilation error in patch 5 when CONFIG_LOCKDEP is turned on and
> re-test it
>
> v3:
> - Merge a 'if else if' test in patch 4
> - Use rcu_dereference_protected in patch 5 to fix a sparse check when
> CONFIG_SPARSE_RCU_POINTER is enabled
>
> v2:
> - Add patch 4 and 5 to remove the spinlock
>
> v1:
> This patch series is to fix the dst refcnt bugs in ip6_tunnel.
>
> Patch 1 and 2 are the prep works. Patch 3 is the fix.
>
> I can reproduce the bug by adding and removing the ip6gre tunnel
> while running a super_netperf TCP_CRR test. I get the following
> trace by adding WARN_ON_ONCE(newrefcnt < 0) to dst_release():
...
Series applied, thanks.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-09-15 22:02 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-15 21:30 [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 1/5] ipv6: Refactor common ip6gre_tunnel_init codes Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 2/5] ipv6: Rename the dst_cache helper functions in ip6_tunnel Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 3/5] ipv6: Fix dst_entry refcnt bugs " Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 4/5] ipv6: Avoid double dst_free Martin KaFai Lau
2015-09-15 21:30 ` [PATCH v4 net 5/5] ipv6: Replace spinlock with seqlock and rcu in ip6_tunnel Martin KaFai Lau
2015-09-15 22:02 ` [PATCH v4 net 0/5] ipv6: Fix dst_entry refcnt bugs " David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.