linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
@ 2012-04-17  6:22 Peter Huang (Peng)
  2012-04-17 15:52 ` Stephen Hemminger
  2012-04-18 19:04 ` Jonathan Nieder
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Huang (Peng) @ 2012-04-17  6:22 UTC (permalink / raw)
  To: shemminger, 'David S. Miller', netdev
  Cc: eric.dumazet, linux-kernel, ctrix+debianbugs, peter.huangpeng,
	peter.huangpeng, harry.majun

When bridge is deleted before tap/vif device's delete, kernel may encounter an oops because of NULL reference to fake_rtable's dst.
Set fake_rtable's dst to NULL before sending packets out can solve this problem.


Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
---
include/linux/netfilter_bridge.h |    8 ++++++++
 net/bridge/br_forward.c          |    1 +
 net/bridge/br_netfilter.c        |    6 +-----
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 0ddd161..70744fe 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -104,9 +104,17 @@ struct bridge_skb_cb {
 	} daddr;
 };
 
+static inline void br_drop_fake_rtable(struct sk_buff *skb) {
+	struct dst_entry *dst = skb_dst(skb);
+	/* abuse fact that only fake_rtable has DST_NOPEER set */
+	if (dst && (dst->flags & DST_NOPEER))
+		skb_dst_drop(skb);
+}
+
 #else
 #define nf_bridge_maybe_copy_header(skb)	(0)
 #define nf_bridge_pad(skb)			(0)
+#define br_drop_fake_rtable(skb)		(0)
 #endif /* CONFIG_BRIDGE_NETFILTER */
 
 #endif /* __KERNEL__ */
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index 61f6534..a2098e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -47,6 +47,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
 		kfree_skb(skb);
 	} else {
 		skb_push(skb, ETH_HLEN);
+		br_drop_fake_rtable(skb);
 		dev_queue_xmit(skb);
 	}
 
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c index dec4f38..946dcb0 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -694,11 +694,7 @@ static unsigned int br_nf_local_in(unsigned int hook, struct sk_buff *skb,
 				   const struct net_device *out,
 				   int (*okfn)(struct sk_buff *))
 {
-	struct rtable *rt = skb_rtable(skb);
-
-	if (rt && rt == bridge_parent_rtable(in))
-		skb_dst_drop(skb);
-
+	br_drop_fake_rtable(skb);
 	return NF_ACCEPT;
 }
--------------------------------
Peter Huang(peng)



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-17  6:22 [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops Peter Huang (Peng)
@ 2012-04-17 15:52 ` Stephen Hemminger
  2012-04-17 16:48   ` Eric Dumazet
  2012-04-18 19:04 ` Jonathan Nieder
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2012-04-17 15:52 UTC (permalink / raw)
  To: Peter Huang (Peng)
  Cc: 'David S. Miller',
	netdev, eric.dumazet, linux-kernel, ctrix+debianbugs,
	peter.huangpeng, harry.majun

On Tue, 17 Apr 2012 14:22:26 +0800
"Peter Huang (Peng)" <peter.huangpeng@huawei.com> wrote:

> When bridge is deleted before tap/vif device's delete, kernel may encounter an oops because of NULL reference to fake_rtable's dst.
> Set fake_rtable's dst to NULL before sending packets out can solve this problem.
> 
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
> ---
> include/linux/netfilter_bridge.h |    8 ++++++++
>  net/bridge/br_forward.c          |    1 +
>  net/bridge/br_netfilter.c        |    6 +-----
>  3 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
> index 0ddd161..70744fe 100644
> --- a/include/linux/netfilter_bridge.h
> +++ b/include/linux/netfilter_bridge.h
> @@ -104,9 +104,17 @@ struct bridge_skb_cb {
>  	} daddr;
>  };
>  
> +static inline void br_drop_fake_rtable(struct sk_buff *skb) {
> +	struct dst_entry *dst = skb_dst(skb);
> +	/* abuse fact that only fake_rtable has DST_NOPEER set */
> +	if (dst && (dst->flags & DST_NOPEER))
> +		skb_dst_drop(skb);
> +}

This check seems like a disaster waiting to happen when the next
change to DST table happens.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-17 15:52 ` Stephen Hemminger
@ 2012-04-17 16:48   ` Eric Dumazet
  2012-04-18  7:57     ` Peter Huang(Peng)
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2012-04-17 16:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Peter Huang (Peng), 'David S. Miller',
	netdev, linux-kernel, ctrix+debianbugs, peter.huangpeng,
	harry.majun

On Tue, 2012-04-17 at 08:52 -0700, Stephen Hemminger wrote:

> This check seems like a disaster waiting to happen when the next
> change to DST table happens.

Please Peter Document this, adding a new DST_FAKE_RTABLE flag

#define DST_FAKE_RTABLE DST_NOPEER

or just use a bit, we have plenty of them available.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-17 16:48   ` Eric Dumazet
@ 2012-04-18  7:57     ` Peter Huang(Peng)
  2012-04-18  8:23       ` Eric Dumazet
  2012-04-18  8:41       ` David Miller
  0 siblings, 2 replies; 10+ messages in thread
From: Peter Huang(Peng) @ 2012-04-18  7:57 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger
  Cc: 'David S. Miller',
	netdev, linux-kernel, ctrix+debianbugs, peter.huangpeng,
	harry.majun

On 2012/4/18 0:48, Eric Dumazet wrote:
> On Tue, 2012-04-17 at 08:52 -0700, Stephen Hemminger wrote:
>
>> This check seems like a disaster waiting to happen when the next
>> change to DST table happens.
>
> Please Peter Document this, adding a new DST_FAKE_RTABLE flag
>
> #define DST_FAKE_RTABLE DST_NOPEER
>
> or just use a bit, we have plenty of them available.
>


Add DST_FAKE_RTABLE to dst_entry, this is the new patch.
Is this ok?

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
---
  include/linux/netfilter_bridge.h |    8 ++++++++
  include/net/dst.h                |    1 +
  net/bridge/br_forward.c          |    1 +
  net/bridge/br_netfilter.c        |    8 ++------
  4 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/include/linux/netfilter_bridge.h 
b/include/linux/netfilter_bridge.h
index 0ddd161..eb09e3b 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -104,9 +104,17 @@ struct bridge_skb_cb {
         } daddr;
  };

+static inline void br_drop_fake_rtable(struct sk_buff *skb) {
+       struct dst_entry *dst = skb_dst(skb);
+       /* abuse fact that only fake_rtable has DST_FAKE_RTABLE set */
+       if (dst && (dst->flags & DST_FAKE_RTABLE))
+               skb_dst_drop(skb);
+}
+
  #else
  #define nf_bridge_maybe_copy_header(skb)       (0)
  #define nf_bridge_pad(skb)                     (0)
+#define br_drop_fake_rtable(skb)               (0)
  #endif /* CONFIG_BRIDGE_NETFILTER */

  #endif /* __KERNEL__ */
diff --git a/include/net/dst.h b/include/net/dst.h
index 59c5d18..b094030 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -55,6 +55,7 @@ struct dst_entry {
  #define DST_NOCACHE            0x0010
  #define DST_NOCOUNT            0x0020
  #define DST_NOPEER             0x0040
+#define DST_FAKE_RTABLE                0x0080

         short                   error;
         short                   obsolete;
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 61f6534..a2098e3 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -47,6 +47,7 @@ int br_dev_queue_push_xmit(struct sk_buff *skb)
                 kfree_skb(skb);
         } else {
                 skb_push(skb, ETH_HLEN);
+               br_drop_fake_rtable(skb);
                 dev_queue_xmit(skb);
         }

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index dec4f38..d7f49b6 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -156,7 +156,7 @@ void br_netfilter_rtable_init(struct net_bridge *br)
         rt->dst.dev = br->dev;
         rt->dst.path = &rt->dst;
         dst_init_metrics(&rt->dst, br_dst_default_metrics, true);
-       rt->dst.flags   = DST_NOXFRM | DST_NOPEER;
+       rt->dst.flags   = DST_NOXFRM | DST_NOPEER | DST_FAKE_RTABLE;
         rt->dst.ops = &fake_dst_ops;
  }

@@ -694,11 +694,7 @@ static unsigned int br_nf_local_in(unsigned int 
hook, struct sk_buff *skb,
                                    const struct net_device *out,
                                    int (*okfn)(struct sk_buff *))
  {
-       struct rtable *rt = skb_rtable(skb);
-
-       if (rt && rt == bridge_parent_rtable(in))
-               skb_dst_drop(skb);
-
+       br_drop_fake_rtable(skb);
         return NF_ACCEPT;
  }


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-18  7:57     ` Peter Huang(Peng)
@ 2012-04-18  8:23       ` Eric Dumazet
  2012-04-18  8:41       ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2012-04-18  8:23 UTC (permalink / raw)
  To: Peter Huang(Peng)
  Cc: Stephen Hemminger, 'David S. Miller',
	netdev, linux-kernel, ctrix+debianbugs, peter.huangpeng,
	harry.majun

On Wed, 2012-04-18 at 15:57 +0800, Peter Huang(Peng) wrote:
> On 2012/4/18 0:48, Eric Dumazet wrote:
> > On Tue, 2012-04-17 at 08:52 -0700, Stephen Hemminger wrote:
> >
> >> This check seems like a disaster waiting to happen when the next
> >> change to DST table happens.
> >
> > Please Peter Document this, adding a new DST_FAKE_RTABLE flag
> >
> > #define DST_FAKE_RTABLE DST_NOPEER
> >
> > or just use a bit, we have plenty of them available.
> >
> 
> 
> Add DST_FAKE_RTABLE to dst_entry, this is the new patch.
> Is this ok?
> 

A full new patch is needed, with nice changelog, and proper formatting
(your mail was mangled)


>   };
> 
> +static inline void br_drop_fake_rtable(struct sk_buff *skb) {
> +       struct dst_entry *dst = skb_dst(skb);
> +       /* abuse fact that only fake_rtable has DST_FAKE_RTABLE set */

Remove the comment, since we dont abuse NOPEER anymore, we use a
dedicated flag. (keep an empty line)

> +       if (dst && (dst->flags & DST_FAKE_RTABLE))
> +               skb_dst_drop(skb);
> +}
> +
>   #else
>   #define nf_bridge_maybe_copy_header(skb)       (0)
>   #define nf_bridge_pad(skb)                     (0)
> +#define br_drop_fake_rtable(skb)               (0)
>   #endif /* CONFIG_BRIDGE_NETFILTER */
> 
>   #endif /* __KERNEL__ */
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 59c5d18..b094030 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -55,6 +55,7 @@ struct dst_entry {
>   #define DST_NOCACHE            0x0010
>   #define DST_NOCOUNT            0x0020
>   #define DST_NOPEER             0x0040
> +#define DST_FAKE_RTABLE                0x0080



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-18  7:57     ` Peter Huang(Peng)
  2012-04-18  8:23       ` Eric Dumazet
@ 2012-04-18  8:41       ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2012-04-18  8:41 UTC (permalink / raw)
  To: peter.huangpeng
  Cc: eric.dumazet, shemminger, netdev, linux-kernel, ctrix+debianbugs,
	peter.huangpeng, harry.majun

From: "Peter Huang(Peng)" <peter.huangpeng@huawei.com>
Date: Wed, 18 Apr 2012 15:57:23 +0800

> On 2012/4/18 0:48, Eric Dumazet wrote:
>> On Tue, 2012-04-17 at 08:52 -0700, Stephen Hemminger wrote:
>>
>>> This check seems like a disaster waiting to happen when the next
>>> change to DST table happens.
>>
>> Please Peter Document this, adding a new DST_FAKE_RTABLE flag
>>
>> #define DST_FAKE_RTABLE DST_NOPEER
>>
>> or just use a bit, we have plenty of them available.
>>
> 
> 
> Add DST_FAKE_RTABLE to dst_entry, this is the new patch.
> Is this ok?
> 
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>

Please post new patches as completely new emails, not as replies
to other emails.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-17  6:22 [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops Peter Huang (Peng)
  2012-04-17 15:52 ` Stephen Hemminger
@ 2012-04-18 19:04 ` Jonathan Nieder
  2012-04-19  7:58   ` Massimo Cetra
  1 sibling, 1 reply; 10+ messages in thread
From: Jonathan Nieder @ 2012-04-18 19:04 UTC (permalink / raw)
  To: Peter Huang (Peng)
  Cc: shemminger, 'David S. Miller',
	netdev, eric.dumazet, linux-kernel, ctrix+debianbugs,
	peter.huangpeng, harry.majun

Hi,

Peter Huang (Peng) wrote:

> When bridge is deleted before tap/vif device's delete, kernel may
> encounter an oops because of NULL reference to fake_rtable's dst.
>
> Set fake_rtable's dst to NULL before sending packets out can solve
> this problem.
>
> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Peter Huang <peter.huangpeng@huawei.com>
> ---
> include/linux/netfilter_bridge.h |    8 ++++++++
>  net/bridge/br_forward.c          |    1 +
>  net/bridge/br_netfilter.c        |    6 +-----
>  3 files changed, 10 insertions(+), 5 deletions(-)

Massimo Cetra (cc-ed) tested the patch against a 3.2.y kernel and wrote[1]:

> The patch i applied yesterday to the debian kernel has been installed 
> and the kernel is not panic-ing anymore.
>
> I'll try to keep this bug up to date.

So it seems to work.  Dave, please consider queuing this for stable@
when the final patch is ready.

Thanks,
Jonathan

[1] http://bugs.debian.org/668511#37

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-04-18 19:04 ` Jonathan Nieder
@ 2012-04-19  7:58   ` Massimo Cetra
  0 siblings, 0 replies; 10+ messages in thread
From: Massimo Cetra @ 2012-04-19  7:58 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Peter Huang (Peng), shemminger, 'David S. Miller',
	netdev, eric.dumazet, linux-kernel, ctrix+debianbugs,
	peter.huangpeng, harry.majun

On 18/04/2012 21:04, Jonathan Nieder wrote:
>
> Massimo Cetra (cc-ed) tested the patch against a 3.2.y kernel and wrote[1]:
>
>> The patch i applied yesterday to the debian kernel has been installed
>> and the kernel is not panic-ing anymore.
>>
>> I'll try to keep this bug up to date.
>
> So it seems to work.  Dave, please consider queuing this for stable@
> when the final patch is ready.

Also please consider that the bridge problems are not over and the 
kernel keeps panicing in a different way...

- http://marc.info/?t=133474395500004&r=1&w=2

Thanks,
Massimo



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
  2012-03-29  6:21 Peter Huang (Peng)
@ 2012-03-29  6:36 ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2012-03-29  6:36 UTC (permalink / raw)
  To: Peter Huang (Peng); +Cc: linux-kernel, harry.majun, zhoukang7, netdev

On Thu, 2012-03-29 at 14:21 +0800, Peter Huang (Peng) wrote:
> In our environment, we encountered a kernel Oops problem, and caused a
> restart.
> 

CC netdev, since its more appropriate

> Below are what happened:
> kernel: 2.6.32.36-0.5-xen OS:xen + dom-0 + guest(rhel5.5)
> 1.destroy one VM.
> 2.ipsan path have some problem and make destroy process delayed about 10s.
> 3.customer defined script find that VM no longer exsit through libvirt API.
> 4.br0(related to the VM we are destoryed before) was deleted by the script.
> 5.delayed VM destroy process come to tap device releasing, this will
> decrement 
> skb->_skb_dst's reference count(skb->_skb_dst points to fake_rtable), but
> br0 
> deleting already released this struct, and unfortunately OS reused this
> memory 
> and marked it read-only.
> 6.Oops happened, and caused restart.
> 
> After analyzing the stack dump info, we find out that during our VM destroy,
> lots of ipv6 multicast pkts 
> exsited, and skb->_skb_dst pointed to (stuct)fake_rtable.
> through kernel source greping, will only find one reference to fake_rtable's
> MTU setting.
> 
> So I'm wondering that what fake_rtable stands for, and where we are using
> it.
> If fake_rtable's dst is not used, we can make dst as NULL to avoid our
> problem,.
> I also added the patch which modified the skb->_skb_dst to NULL when
> "skb->_skb_dst == (unsigned long)&to->br->fake_rtable".
> 
> BTW, we also verified a similar senario on kernel-3.3, that br0 has attached
> eth0 and eth1, eth1 was 
> connected to our guest which will multicast ipv6 packets, and you can get an
> "WARNING: at net/core/dst.c:274 dst_release+0x6d/0x70()"
> by using the fake_rtable_verify.c attached, 
> #gcc fake_rtable_verify.c
> #./a.out &
> #sleep 30         //make sure ipv6 pkts was in tap00's receiving queue.
> #ifconfig br0 down
> #brctl delbr br0 //delete br0, will also delete net_device's fake_rtable.
> #sleep 50
> #kill -9 `pidof a.out` //tap00's delete will do dst_release, and this will
> write to the memory already freed.
> 
> Below is the Oops stack dump info:
> ////////////////////////////////////////////////////////////////////////////
> ///
> RIP: e030:[<ffffffff802ddbd1>]
> <ffffffff802ddbd1>{dst_release+0x11}
> RSP: e02b:ffff88008b185b70  EFLAGS: 00010286
> RAX: 00000000ffffffff RBX: ffff880033d184c0 RCX: 0000000000000000
> RDX: ffff88008b54f080 RSI: 0000000012df12df RDI: ffff88008b54efc0
> RBP: ffff8800f4a3f500 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000002 R11: ffffffff8018c1e0 R12: ffff8800f4a3f400
> R13: 0000000000000001 R14: ffff8800f4a3f4e0 R15: ffff8800351030c0
> FS:  00007f4cbd080700(0000) GS:ffff880002008000(0000) knlGS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff88008b54f080 CR3: 000000008a27c000 CR4: 0000000000002620
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>        <ffffffff80009b05>{dump_trace+0x65}
>        <ffffffff8037d897>{notifier_call_chain+0x37}
>        <ffffffff8005a1ed>{notify_die+0x2d}
>        <ffffffff8037bd0b>{__die+0x8b}
>        <ffffffff8001bed1>{no_context+0xd1}
>        <ffffffff8001c1f5>{__bad_area_nosemaphore+0x175}
>        <ffffffff8037b298>{page_fault+0x28}
>        <ffffffff802ddbd1>{dst_release+0x11}
>        <ffffffff802cd69d>{skb_release_head_state+0xbd}
>        <ffffffff802cd369>{__kfree_skb+0x9}
>        <ffffffff802edaab>{pfifo_fast_reset+0x5b}
>        <ffffffff802edbd3>{qdisc_reset+0x13}
>        <ffffffff802edcc7>{dev_deactivate_queue+0x57}
>        <ffffffff802ee4bf>{dev_deactivate+0x3f}
>        <ffffffff802d9575>{dev_close+0x65}
>        <ffffffff802d960e>{rollback_registered+0x3e}
>        <ffffffff802d9715>{unregister_netdevice+0x15}
>        <ffffffffa0807655>{tun:tun_chr_close+0xe5}
>        <ffffffff800d9edd>{__fput+0xcd}
>        <ffffffff800d6076>{filp_close+0x56}
>        <ffffffff8003fd9a>{put_files_struct+0x7a}
>        <ffffffff80040fb2>{do_exit+0x752}
>        <ffffffff800410ef>{do_group_exit+0x3f}
>        <ffffffff8004d9d9>{get_signal_to_deliver+0x229}
>        <ffffffff80006acd>{do_notify_resume+0x11d}
>        <ffffffff8000763c>{int_signal+0x12}
>        [<00007f4cbc7fd57d>]
> ////////////////////////////////////////////////////////////////////////////
> ///
> 
> Signed-off-by: Peter Huang(Peng) <peter.huangpeng@huawei.com>
> ---
> diff -Nur a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> @@ -91,6 +91,9 @@
>         skb->dev = to->dev;
>         skb_forward_csum(skb);
> 
> +       if (skb->_skb_dst == (unsigned long)&to->br->fake_rtable)
> +               skb_dst_set(skb, NULL);
> +
>         NF_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD, skb, indev, skb->dev,
>                 br_forward_finish);
> }

Did you check current kernel has this bug ?

I remember we already fix this, maybe you need a backport.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops.
@ 2012-03-29  6:21 Peter Huang (Peng)
  2012-03-29  6:36 ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Huang (Peng) @ 2012-03-29  6:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: harry.majun, zhoukang7

In our environment, we encountered a kernel Oops problem, and caused a
restart.

Below are what happened:
kernel: 2.6.32.36-0.5-xen OS:xen + dom-0 + guest(rhel5.5)
1.destroy one VM.
2.ipsan path have some problem and make destroy process delayed about 10s.
3.customer defined script find that VM no longer exsit through libvirt API.
4.br0(related to the VM we are destoryed before) was deleted by the script.
5.delayed VM destroy process come to tap device releasing, this will
decrement 
skb->_skb_dst's reference count(skb->_skb_dst points to fake_rtable), but
br0 
deleting already released this struct, and unfortunately OS reused this
memory 
and marked it read-only.
6.Oops happened, and caused restart.

After analyzing the stack dump info, we find out that during our VM destroy,
lots of ipv6 multicast pkts 
exsited, and skb->_skb_dst pointed to (stuct)fake_rtable.
through kernel source greping, will only find one reference to fake_rtable's
MTU setting.

So I'm wondering that what fake_rtable stands for, and where we are using
it.
If fake_rtable's dst is not used, we can make dst as NULL to avoid our
problem,.
I also added the patch which modified the skb->_skb_dst to NULL when
"skb->_skb_dst == (unsigned long)&to->br->fake_rtable".

BTW, we also verified a similar senario on kernel-3.3, that br0 has attached
eth0 and eth1, eth1 was 
connected to our guest which will multicast ipv6 packets, and you can get an
"WARNING: at net/core/dst.c:274 dst_release+0x6d/0x70()"
by using the fake_rtable_verify.c attached, 
#gcc fake_rtable_verify.c
#./a.out &
#sleep 30         //make sure ipv6 pkts was in tap00's receiving queue.
#ifconfig br0 down
#brctl delbr br0 //delete br0, will also delete net_device's fake_rtable.
#sleep 50
#kill -9 `pidof a.out` //tap00's delete will do dst_release, and this will
write to the memory already freed.

Below is the Oops stack dump info:
////////////////////////////////////////////////////////////////////////////
///
RIP: e030:[<ffffffff802ddbd1>]
<ffffffff802ddbd1>{dst_release+0x11}
RSP: e02b:ffff88008b185b70  EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff880033d184c0 RCX: 0000000000000000
RDX: ffff88008b54f080 RSI: 0000000012df12df RDI: ffff88008b54efc0
RBP: ffff8800f4a3f500 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000002 R11: ffffffff8018c1e0 R12: ffff8800f4a3f400
R13: 0000000000000001 R14: ffff8800f4a3f4e0 R15: ffff8800351030c0
FS:  00007f4cbd080700(0000) GS:ffff880002008000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88008b54f080 CR3: 000000008a27c000 CR4: 0000000000002620
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       <ffffffff80009b05>{dump_trace+0x65}
       <ffffffff8037d897>{notifier_call_chain+0x37}
       <ffffffff8005a1ed>{notify_die+0x2d}
       <ffffffff8037bd0b>{__die+0x8b}
       <ffffffff8001bed1>{no_context+0xd1}
       <ffffffff8001c1f5>{__bad_area_nosemaphore+0x175}
       <ffffffff8037b298>{page_fault+0x28}
       <ffffffff802ddbd1>{dst_release+0x11}
       <ffffffff802cd69d>{skb_release_head_state+0xbd}
       <ffffffff802cd369>{__kfree_skb+0x9}
       <ffffffff802edaab>{pfifo_fast_reset+0x5b}
       <ffffffff802edbd3>{qdisc_reset+0x13}
       <ffffffff802edcc7>{dev_deactivate_queue+0x57}
       <ffffffff802ee4bf>{dev_deactivate+0x3f}
       <ffffffff802d9575>{dev_close+0x65}
       <ffffffff802d960e>{rollback_registered+0x3e}
       <ffffffff802d9715>{unregister_netdevice+0x15}
       <ffffffffa0807655>{tun:tun_chr_close+0xe5}
       <ffffffff800d9edd>{__fput+0xcd}
       <ffffffff800d6076>{filp_close+0x56}
       <ffffffff8003fd9a>{put_files_struct+0x7a}
       <ffffffff80040fb2>{do_exit+0x752}
       <ffffffff800410ef>{do_group_exit+0x3f}
       <ffffffff8004d9d9>{get_signal_to_deliver+0x229}
       <ffffffff80006acd>{do_notify_resume+0x11d}
       <ffffffff8000763c>{int_signal+0x12}
       [<00007f4cbc7fd57d>]
////////////////////////////////////////////////////////////////////////////
///

Signed-off-by: Peter Huang(Peng) <peter.huangpeng@huawei.com>
---
diff -Nur a/net/bridge/br_forward.c b/net/bridge/br_forward.c
@@ -91,6 +91,9 @@
        skb->dev = to->dev;
        skb_forward_csum(skb);

+       if (skb->_skb_dst == (unsigned long)&to->br->fake_rtable)
+               skb_dst_set(skb, NULL);
+
        NF_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD, skb, indev, skb->dev,
                br_forward_finish);
}


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-19  8:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-17  6:22 [PATCH] set fake_rtable's dst to NULL to avoid kernel Oops Peter Huang (Peng)
2012-04-17 15:52 ` Stephen Hemminger
2012-04-17 16:48   ` Eric Dumazet
2012-04-18  7:57     ` Peter Huang(Peng)
2012-04-18  8:23       ` Eric Dumazet
2012-04-18  8:41       ` David Miller
2012-04-18 19:04 ` Jonathan Nieder
2012-04-19  7:58   ` Massimo Cetra
  -- strict thread matches above, loose matches on Subject: below --
2012-03-29  6:21 Peter Huang (Peng)
2012-03-29  6:36 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).