linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
@ 2012-04-17  6:22 Peter Huang (Peng)
  2012-04-17 15:52 ` Stephen Hemminger
  2012-04-18 19:04 ` Jonathan Nieder
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Huang (Peng) @ 2012-04-17  6:22 UTC (permalink / raw)
  To: shemminger, 'David S. Miller', netdev
  Cc: eric.dumazet, linux-kernel, ctrix+debianbugs, peter.huangpeng,
	peter.huangpeng, harry.majun

When bridge is deleted before tap/vif device's delete, kernel may encounter an oops because of NULL reference to fake_rtable's dst.
Set fake_rtable's dst to NULL before sending packets out can solve this problem.


Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
---
include/linux/netfilter_bridge.h |    8 ++++++++
 net/bridge/br_forward.c          |    1 +
 net/bridge/br_netfilter.c        |    6 +-----
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 0ddd161..70744fe 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -104,9 +104,17 @@ struct bridge_skb_cb {
 	} daddr;
 };
 
+static inline void br_drop_fake_rtable(struct sk_buff *skb) {
+	struct dst_entry *dst = skb_dst(skb);
+	/* abuse fact that only fake_rtable has DST_NOPEER set */
+	if (dst && (dst->flags & DST_NOPEER))
+		skb_dst_drop(skb);
+}
+
 #else
 #define nf_bridge_maybe_copy_header(skb)	(0)
 #define nf_bridge_pad(skb)			(0)
+#define br_drop_fake_rtable(skb)		(0)
 #endif /* CONFIG_BRIDGE_NETFILTER */
 
 #endif /* __KERNEL__ */
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index 61f6534..a2098e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -47,6 +47,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
 		kfree_skb(skb);
 	} else {
 		skb_push(skb, ETH_HLEN);
+		br_drop_fake_rtable(skb);
 		dev_queue_xmit(skb);
 	}
 
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index dec4f38..946dcb0 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -694,11 +694,7 @@ static unsigned int br_nf_local_in(unsigned int hook, struct sk_buff *skb,
 				   const struct net_device *out,
 				   int (*okfn)(struct sk_buff *))
 {
-	struct rtable *rt = skb_rtable(skb);
-
-	if (rt && rt == bridge_parent_rtable(in))
-		skb_dst_drop(skb);
-
+	br_drop_fake_rtable(skb);
 	return NF_ACCEPT;
 }
--------------------------------
Peter Huang(peng)



^ permalink raw reply related	[flat|nested] 10+ messages in thread
* [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
@ 2012-03-29  6:21 Peter Huang (Peng)
  2012-03-29  6:36 ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Huang (Peng) @ 2012-03-29  6:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: harry.majun, zhoukang7

In our environment, we encountered a kernel Oops problem, and caused a
restart.

Below are what happened:
kernel: 2.6.32.36-0.5-xen OS:xen + dom-0 + guest(rhel5.5)
1.destroy one VM.
2.ipsan path have some problem and make destroy process delayed about 10s.
3.customer defined script find that VM no longer exsit through libvirt API.
4.br0(related to the VM we are destoryed before) was deleted by the script.
5.delayed VM destroy process come to tap device releasing, this will
decrement 
skb->_skb_dst's reference count(skb->_skb_dst points to fake_rtable), but
br0 
deleting already released this struct, and unfortunately OS reused this
memory 
and marked it read-only.
6.Oops happened, and caused restart.

After analyzing the stack dump info, we find out that during our VM destroy,
lots of ipv6 multicast pkts 
exsited, and skb->_skb_dst pointed to (stuct)fake_rtable.
through kernel source greping, will only find one reference to fake_rtable's
MTU setting.

So I'm wondering that what fake_rtable stands for, and where we are using
it.
If fake_rtable's dst is not used, we can make dst as NULL to avoid our
problem,.
I also added the patch which modified the skb->_skb_dst to NULL when
"skb->_skb_dst == (unsigned long)&to->br->fake_rtable".

BTW, we also verified a similar senario on kernel-3.3, that br0 has attached
eth0 and eth1, eth1 was 
connected to our guest which will multicast ipv6 packets, and you can get an
"WARNING: at net/core/dst.c:274 dst_release+0x6d/0x70()"
by using the fake_rtable_verify.c attached, 
#gcc fake_rtable_verify.c
#./a.out &
#sleep 30         //make sure ipv6 pkts was in tap00's receiving queue.
#ifconfig br0 down
#brctl delbr br0 //delete br0, will also delete net_device's fake_rtable.
#sleep 50
#kill -9 `pidof a.out` //tap00's delete will do dst_release, and this will
write to the memory already freed.

Below is the Oops stack dump info:
////////////////////////////////////////////////////////////////////////////
///
RIP: e030:[<ffffffff802ddbd1>]
<ffffffff802ddbd1>{dst_release+0x11}
RSP: e02b:ffff88008b185b70  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff880033d184c0 RCX: 0000000000000000
RDX: ffff88008b54f080 RSI: 0000000012df12df RDI: ffff88008b54efc0
RBP: ffff8800f4a3f500 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000002 R11: ffffffff8018c1e0 R12: ffff8800f4a3f400
R13: 0000000000000001 R14: ffff8800f4a3f4e0 R15: ffff8800351030c0
FS:  00007f4cbd080700(0000) GS:ffff880002008000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88008b54f080 CR3: 000000008a27c000 CR4: 0000000000002620
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       <ffffffff80009b05>{dump_trace+0x65}
       <ffffffff8037d897>{notifier_call_chain+0x37}
       <ffffffff8005a1ed>{notify_die+0x2d}
       <ffffffff8037bd0b>{__die+0x8b}
       <ffffffff8001bed1>{no_context+0xd1}
       <ffffffff8001c1f5>{__bad_area_nosemaphore+0x175}
       <ffffffff8037b298>{page_fault+0x28}
       <ffffffff802ddbd1>{dst_release+0x11}
       <ffffffff802cd69d>{skb_release_head_state+0xbd}
       <ffffffff802cd369>{__kfree_skb+0x9}
       <ffffffff802edaab>{pfifo_fast_reset+0x5b}
       <ffffffff802edbd3>{qdisc_reset+0x13}
       <ffffffff802edcc7>{dev_deactivate_queue+0x57}
       <ffffffff802ee4bf>{dev_deactivate+0x3f}
       <ffffffff802d9575>{dev_close+0x65}
       <ffffffff802d960e>{rollback_registered+0x3e}
       <ffffffff802d9715>{unregister_netdevice+0x15}
       <ffffffffa0807655>{tun:tun_chr_close+0xe5}
       <ffffffff800d9edd>{__fput+0xcd}
       <ffffffff800d6076>{filp_close+0x56}
       <ffffffff8003fd9a>{put_files_struct+0x7a}
       <ffffffff80040fb2>{do_exit+0x752}
       <ffffffff800410ef>{do_group_exit+0x3f}
       <ffffffff8004d9d9>{get_signal_to_deliver+0x229}
       <ffffffff80006acd>{do_notify_resume+0x11d}
       <ffffffff8000763c>{int_signal+0x12}
       [<00007f4cbc7fd57d>]
////////////////////////////////////////////////////////////////////////////
///

Signed-off-by: Peter Huang(Peng) <peter.huangpeng@huawei.com>
---
diff -Nur a/net/bridge/br_forward.c b/net/bridge/br_forward.c
@@ -91,6 +91,9 @@
        skb->dev = to->dev;
        skb_forward_csum(skb);

+       if (skb->_skb_dst == (unsigned long)&to->br->fake_rtable)
+               skb_dst_set(skb, NULL);
+
        NF_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD, skb, indev, skb->dev,
                br_forward_finish);
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-19  8:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17  6:22 [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops Peter Huang (Peng)
2012-04-17 15:52 ` Stephen Hemminger
2012-04-17 16:48   ` Eric Dumazet
2012-04-18  7:57     ` Peter Huang(Peng)
2012-04-18  8:23       ` Eric Dumazet
2012-04-18  8:41       ` David Miller
2012-04-18 19:04 ` Jonathan Nieder
2012-04-19  7:58   ` Massimo Cetra
  -- strict thread matches above, loose matches on Subject: below --
2012-03-29  6:21 Peter Huang (Peng)
2012-03-29  6:36 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).