linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
@ 2012-05-22  9:48 kun.jiang
  2012-05-22 19:15 ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: kun.jiang @ 2012-05-22  9:48 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: davem, yanmin_zhang

From: Yanmin Zhang <yanmin_zhang@linux.intel.com>

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50

In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
nh_dev. kfree_rcu(fi, rcu) would release the whole area.

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
---
 net/ipv4/fib_semantics.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..68ea013 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -159,7 +159,6 @@ void free_fib_info(struct fib_info *fi)
 	change_nexthops(fi) {
 		if (nexthop_nh->nh_dev)
 			dev_put(nexthop_nh->nh_dev);
-		nexthop_nh->nh_dev = NULL;
 	} endfor_nexthops(fi);
 	fib_info_cnt--;
 	release_net(fi->fib_net);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-22  9:48 [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow kun.jiang
@ 2012-05-22 19:15 ` David Miller
  2012-05-23  3:02   ` Yanmin Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2012-05-22 19:15 UTC (permalink / raw)
  To: kunx.jiang; +Cc: netdev, linux-kernel, yanmin_zhang

From: "kun.jiang" <kunx.jiang@intel.com>
Date: Tue, 22 May 2012 17:48:53 +0800

> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> 
> We hit a kernel OOPS.
 ...
> In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
> nh_dev. kfree_rcu(fi, rcu) would release the whole area.
> 
> Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Signed-off-by: Kun Jiang <kunx.jiang@intel.com>

This isn't a fix.  You're keeping around a pointer to a completely
released object, which is therefore illegal to dereference.

That's why we must set it to NULL, to catch such illegal accesses.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-22 19:15 ` David Miller
@ 2012-05-23  3:02   ` Yanmin Zhang
  2012-05-23  3:23     ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  3:02 UTC (permalink / raw)
  To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel

On Tue, 2012-05-22 at 15:15 -0400, David Miller wrote:
> From: "kun.jiang" <kunx.jiang@intel.com>
> Date: Tue, 22 May 2012 17:48:53 +0800
> 
> > From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> > 
> > We hit a kernel OOPS.
>  ...
> > In function free_fib_info, we don't reset nexthop_nh->nh_dev to NULL before releasing
> > nh_dev. kfree_rcu(fi, rcu) would release the whole area.
> > 
> > Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> > Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
> 
> This isn't a fix.  You're keeping around a pointer to a completely
> released object, which is therefore illegal to dereference.
> 
> That's why we must set it to NULL, to catch such illegal accesses.
Thanks for the comments. The network stack in kernel is complicated.

Questions:
1) Why does free_fib_info call call_rcu instead of releasing fi directly?
I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
If other cpu are accessing it, here resetting to NULL would cause other
cpu panic.

void free_fib_info(struct fib_info *fi)
{
        if (fi->fib_dead == 0) {
                pr_warn("Freeing alive fib_info %p\n", fi);
                return;
        }
        change_nexthops(fi) {
                if (nexthop_nh->nh_dev)
                        dev_put(nexthop_nh->nh_dev);
                nexthop_nh->nh_dev = NULL;
        } endfor_nexthops(fi);
        fib_info_cnt--;
        release_net(fi->fib_net);
        call_rcu(&fi->rcu, free_fib_info_rcu);		<=====rcu
}
2) dev_put is to decrease the counter and doesn't release nh_dev.

If 2) is incorrect, how about below solution? It delays the resetting
in rcu.

================

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the resetting.

Signed-off-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: Kun Jiang <kunx.jiang@intel.com>
---
 net/ipv4/fib_semantics.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5063fa3..56d8a97 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -145,6 +145,14 @@ static void free_fib_info_rcu(struct rcu_head *head)
 {
 	struct fib_info *fi = container_of(head, struct fib_info, rcu);
 
+	change_nexthops(fi) {
+		if (nexthop_nh->nh_dev)
+			dev_put(nexthop_nh->nh_dev);
+		nexthop_nh->nh_dev = NULL;
+	} endfor_nexthops(fi);
+
+	fib_info_cnt--;
+	release_net(fi->fib_net);
 	if (fi->fib_metrics != (u32 *) dst_default_metrics)
 		kfree(fi->fib_metrics);
 	kfree(fi);
@@ -156,13 +164,6 @@ void free_fib_info(struct fib_info *fi)
 		pr_warn("Freeing alive fib_info %p\n", fi);
 		return;
 	}
-	change_nexthops(fi) {
-		if (nexthop_nh->nh_dev)
-			dev_put(nexthop_nh->nh_dev);
-		nexthop_nh->nh_dev = NULL;
-	} endfor_nexthops(fi);
-	fib_info_cnt--;
-	release_net(fi->fib_net);
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
 
-- 
1.7.5.4




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  3:02   ` Yanmin Zhang
@ 2012-05-23  3:23     ` David Miller
  2012-05-23  3:30       ` Yanmin Zhang
  2012-05-23  4:37       ` Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: David Miller @ 2012-05-23  3:23 UTC (permalink / raw)
  To: yanmin_zhang; +Cc: kunx.jiang, netdev, linux-kernel

From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date: Wed, 23 May 2012 11:02:03 +0800

> 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> If other cpu are accessing it, here resetting to NULL would cause other
> cpu panic.

Because fib trie lookups are done with RCU locking, therefore we must
use RCU freeing to release the object.

What I was trying to impart to you is that removing the NULL
assignment is wrong and that an alternative fix is warranted (hint:
consider moving something into the RCU release).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  3:23     ` David Miller
@ 2012-05-23  3:30       ` Yanmin Zhang
  2012-05-23  3:49         ` David Miller
  2012-05-23  4:37       ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  3:30 UTC (permalink / raw)
  To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel

On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:02:03 +0800
> 
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
> 
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
> 
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
Thanks for the explanation.

How about the new patch posted in the end of previous reply? It does move the
the resetting to RCU release.
https://lkml.org/lkml/2012/5/22/558?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  3:30       ` Yanmin Zhang
@ 2012-05-23  3:49         ` David Miller
  2012-05-23  4:41           ` Yanmin Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2012-05-23  3:49 UTC (permalink / raw)
  To: yanmin_zhang; +Cc: kunx.jiang, netdev, linux-kernel

From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Date: Wed, 23 May 2012 11:30:41 +0800

> How about the new patch posted in the end of previous reply? It does move the
> the resetting to RCU release.
> https://lkml.org/lkml/2012/5/22/558?

If you test it and submit it formally I'll probably apply it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  3:23     ` David Miller
  2012-05-23  3:30       ` Yanmin Zhang
@ 2012-05-23  4:37       ` Eric Dumazet
  2012-05-23  4:54         ` Yanmin Zhang
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  4:37 UTC (permalink / raw)
  To: David Miller; +Cc: yanmin_zhang, kunx.jiang, netdev, linux-kernel

On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:02:03 +0800
> 
> > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > If other cpu are accessing it, here resetting to NULL would cause other
> > cpu panic.
> 
> Because fib trie lookups are done with RCU locking, therefore we must
> use RCU freeing to release the object.
> 
> What I was trying to impart to you is that removing the NULL
> assignment is wrong and that an alternative fix is warranted (hint:
> consider moving something into the RCU release).
> --

Its more than that I'm afraid.

fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.

Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
protected by RTNL)




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  3:49         ` David Miller
@ 2012-05-23  4:41           ` Yanmin Zhang
  2012-05-23  5:08             ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  4:41 UTC (permalink / raw)
  To: David Miller; +Cc: kunx.jiang, netdev, linux-kernel

On Tue, 2012-05-22 at 23:49 -0400, David Miller wrote:
> From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> Date: Wed, 23 May 2012 11:30:41 +0800
> 
> > How about the new patch posted in the end of previous reply? It does move the
> > the resetting to RCU release.
> > https://lkml.org/lkml/2012/5/22/558?
> 
> If you test it and submit it formally I'll probably apply it.
We did test it. But it's hard to reproduce it. We hit it the issue
for a couple of times when running MTBF testing on Android smart phone.
Mostly, it happens after MTBF runs for more than 12 hours. To releasing
the product, MTBF testing should run for hundreds of hours. This bug
blocks it.

We would submit it formally and also run more testing.

Sorry for taking you too much time.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  4:37       ` Eric Dumazet
@ 2012-05-23  4:54         ` Yanmin Zhang
  2012-05-23  5:02           ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  4:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 06:37 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-22 at 23:23 -0400, David Miller wrote:
> > From: Yanmin Zhang <yanmin_zhang@linux.intel.com>
> > Date: Wed, 23 May 2012 11:02:03 +0800
> > 
> > > 1) Why does free_fib_info call call_rcu instead of releasing fi directly?
> > > I assume other cpu might be accessing it. nexthop_nh->nh_dev is in fi.
> > > If other cpu are accessing it, here resetting to NULL would cause other
> > > cpu panic.
> > 
> > Because fib trie lookups are done with RCU locking, therefore we must
> > use RCU freeing to release the object.
> > 
> > What I was trying to impart to you is that removing the NULL
> > assignment is wrong and that an alternative fix is warranted (hint:
> > consider moving something into the RCU release).
> > --
> 
> Its more than that I'm afraid.
> 
> fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
resetting to RCU protection.

> 
> Also the "fib_info_cnt--;" must stay in free_fib_info() (so that it is
> protected by RTNL)
We would move "fib_info_cnt--;" back to free_fib_info.

Thanks,
Yanmin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  4:54         ` Yanmin Zhang
@ 2012-05-23  5:02           ` Eric Dumazet
  2012-05-23  6:15             ` Yanmin Zhang
  2012-05-23  6:37             ` Eric Dumazet
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  5:02 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 12:54 +0800, Yanmin Zhang wrote:

> > fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
> The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
> resetting to RCU protection.

Its not enough.

We must take care that all users are in a RCU protected region.

They might be already, but a full check is needed.

For example 

net/ipv4/fib_trie.c:2563:    fi->fib_dev ? fi->fib_dev->name : "*"

looks to be safe (because already in a rcu_read_lock())

But its not.

Right thing would be to do :

struct net_device *ndev = rcu_dereference(fi->fib_dev)

	...
	ndev ? ndev->name : "*"





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  4:41           ` Yanmin Zhang
@ 2012-05-23  5:08             ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  5:08 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 12:41 +0800, Yanmin Zhang wrote:
> We did test it. But it's hard to reproduce it. We hit it the issue
> for a couple of times when running MTBF testing on Android smart phone.
> Mostly, it happens after MTBF runs for more than 12 hours. To releasing
> the product, MTBF testing should run for hundreds of hours. This bug
> blocks it.
> 

I have an idea how to trigger this in my test machine.

> We would submit it formally and also run more testing.
> 
> Sorry for taking you too much time.

Hey, you fix a bug, take your time and dont worry.

Thanks



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  5:02           ` Eric Dumazet
@ 2012-05-23  6:15             ` Yanmin Zhang
  2012-05-23  6:27               ` Eric Dumazet
  2012-05-23  6:37             ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  6:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 07:02 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 12:54 +0800, Yanmin Zhang wrote:
> 
> > > fi->fib_dev (aka fib_nh[0].nh_dev) need full RCU protection.
> > The new patch posted at https://lkml.org/lkml/2012/5/22/558 does move the
> > resetting to RCU protection.
> 
> Its not enough.
> 
> We must take care that all users are in a RCU protected region.
> 
> They might be already, but a full check is needed.
> 
> For example 
> 
> net/ipv4/fib_trie.c:2563:    fi->fib_dev ? fi->fib_dev->name : "*"
> 
> looks to be safe (because already in a rcu_read_lock())
> 
> But its not.
> 
> Right thing would be to do :
> 
> struct net_device *ndev = rcu_dereference(fi->fib_dev)
> 
> 	...
> 	ndev ? ndev->name : "*"
Thanks for the pointer.

Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
and nexthop_nh->nh_hash.

Yanmin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  6:15             ` Yanmin Zhang
@ 2012-05-23  6:27               ` Eric Dumazet
  2012-05-23  6:47                 ` Yanmin Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  6:27 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:

> 
> Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> and nexthop_nh->nh_hash.

No, its not needed.

I am sending a patch, because I feel this area is too complex for non
netdev guys.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  5:02           ` Eric Dumazet
  2012-05-23  6:15             ` Yanmin Zhang
@ 2012-05-23  6:37             ` Eric Dumazet
  2012-05-23  6:43               ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  6:37 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

From: Eric Dumazet <edumazet@google.com>

I am testing following patch, could you test it too ?

Thanks

[PATCH] ipv4: nh_dev should be rcu protected

Yanmin Zhang reported an OOPS and provided a patch.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial
prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial
prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
<4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
<4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
<4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
<4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50

But real fix is to provide appropriate RCU protection to nh_dev/fib_dev,
with appropriate __rcu annotation.

tested with CONFIG_PROVE_RCU and CONFIG_SPARSE_RCU_POINTER

Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Reported-by: Kun Jiang <kunx.jiang@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/inetdevice.h        |    8 ++-
 include/net/ip_fib.h              |    2 
 net/ipv4/devinet.c                |   11 ++--
 net/ipv4/fib_frontend.c           |    6 +-
 net/ipv4/fib_semantics.c          |   66 ++++++++++++++++++----------
 net/ipv4/fib_trie.c               |   11 ++--
 net/ipv4/netfilter/ipt_rpfilter.c |    4 -
 net/ipv4/route.c                  |    4 -
 8 files changed, 71 insertions(+), 41 deletions(-)

diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index 597f4a9..8cfa569 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -172,7 +172,13 @@ extern int		inet_addr_onlink(struct in_device *in_dev, __be32 a, __be32 b);
 extern int		devinet_ioctl(struct net *net, unsigned int cmd, void __user *);
 extern void		devinet_init(void);
 extern struct in_device	*inetdev_by_index(struct net *, int);
-extern __be32		inet_select_addr(const struct net_device *dev, __be32 dst, int scope);
+
+extern __be32	__inet_select_addr(struct net *net,
+				   const struct net_device *dev,
+				   __be32 dst, int scope);
+#define inet_select_addr(dev, dst, scope) \
+	__inet_select_addr(dev_net(dev), (dev), (dst), (scope))
+
 extern __be32		inet_confirm_addr(struct in_device *in_dev, __be32 dst, __be32 local, int scope);
 extern struct in_ifaddr *inet_ifa_byprefix(struct in_device *in_dev, __be32 prefix, __be32 mask);
 
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 78df0866..9d49426 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -46,7 +46,7 @@ struct fib_config {
 struct fib_info;
 
 struct fib_nh {
-	struct net_device	*nh_dev;
+	struct net_device __rcu	*nh_dev;
 	struct hlist_node	nh_hash;
 	struct fib_info		*nh_parent;
 	unsigned int		nh_flags;
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 10e15a1..be1e75c 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -164,7 +164,7 @@ struct net_device *__ip_dev_find(struct net *net, __be32 addr, bool devref)
 		if (local &&
 		    !fib_table_lookup(local, &fl4, &res, FIB_LOOKUP_NOREF) &&
 		    res.type == RTN_LOCAL)
-			result = FIB_RES_DEV(res);
+			result = rcu_dereference(FIB_RES_DEV(res));
 	}
 	if (result && devref)
 		dev_hold(result);
@@ -960,14 +960,15 @@ out:
 	return done;
 }
 
-__be32 inet_select_addr(const struct net_device *dev, __be32 dst, int scope)
+__be32 __inet_select_addr(struct net *net,
+			  const struct net_device *dev,
+			  __be32 dst, int scope)
 {
 	__be32 addr = 0;
 	struct in_device *in_dev;
-	struct net *net = dev_net(dev);
 
 	rcu_read_lock();
-	in_dev = __in_dev_get_rcu(dev);
+	in_dev = dev ? __in_dev_get_rcu(dev) : NULL;
 	if (!in_dev)
 		goto no_in_dev;
 
@@ -1007,7 +1008,7 @@ out_unlock:
 	rcu_read_unlock();
 	return addr;
 }
-EXPORT_SYMBOL(inet_select_addr);
+EXPORT_SYMBOL(__inet_select_addr);
 
 static __be32 confirm_addr_indev(struct in_device *in_dev, __be32 dst,
 			      __be32 local, int scope)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 3854411..2479884 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -159,7 +159,7 @@ static inline unsigned int __inet_dev_addr_type(struct net *net,
 		ret = RTN_UNICAST;
 		rcu_read_lock();
 		if (!fib_table_lookup(local_table, &fl4, &res, FIB_LOOKUP_NOREF)) {
-			if (!dev || dev == res.fi->fib_dev)
+			if (!dev || dev == rcu_dereference(res.fi->fib_dev))
 				ret = res.type;
 		}
 		rcu_read_unlock();
@@ -237,13 +237,13 @@ int fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, u8 tos,
 	for (ret = 0; ret < res.fi->fib_nhs; ret++) {
 		struct fib_nh *nh = &res.fi->fib_nh[ret];
 
-		if (nh->nh_dev == dev) {
+		if (rcu_dereference(nh->nh_dev) == dev) {
 			dev_match = true;
 			break;
 		}
 	}
 #else
-	if (FIB_RES_DEV(res) == dev)
+	if (rcu_dereference(FIB_RES_DEV(res)) == dev)
 		dev_match = true;
 #endif
 	if (dev_match) {
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index a8bdf74..a09f055 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -157,9 +157,13 @@ void free_fib_info(struct fib_info *fi)
 		return;
 	}
 	change_nexthops(fi) {
-		if (nexthop_nh->nh_dev)
-			dev_put(nexthop_nh->nh_dev);
-		nexthop_nh->nh_dev = NULL;
+		struct net_device *ndev;
+
+		ndev = rtnl_dereference(nexthop_nh->nh_dev);
+		if (ndev) {
+			dev_put(ndev);
+			RCU_INIT_POINTER(nexthop_nh->nh_dev, NULL);
+		}
 	} endfor_nexthops(fi);
 	fib_info_cnt--;
 	release_net(fi->fib_net);
@@ -273,7 +277,7 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev)
 	hash = fib_devindex_hashfn(dev->ifindex);
 	head = &fib_info_devhash[hash];
 	hlist_for_each_entry(nh, node, head, nh_hash) {
-		if (nh->nh_dev == dev &&
+		if (rcu_access_pointer(nh->nh_dev) == dev &&
 		    nh->nh_gw == gw &&
 		    !(nh->nh_flags & RTNH_F_DEAD)) {
 			spin_unlock(&fib_info_lock);
@@ -360,13 +364,17 @@ struct fib_alias *fib_find_alias(struct list_head *fah, u8 tos, u32 prio)
 	return NULL;
 }
 
+/* called in rcu_read_lock() section */
 int fib_detect_death(struct fib_info *fi, int order,
 		     struct fib_info **last_resort, int *last_idx, int dflt)
 {
-	struct neighbour *n;
+	struct neighbour *n = NULL;
 	int state = NUD_NONE;
+	struct net_device *ndev = rcu_dereference(fi->fib_dev);
+
+	if (ndev)
+		n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, ndev);
 
-	n = neigh_lookup(&arp_tbl, &fi->fib_nh[0].nh_gw, fi->fib_dev);
 	if (n) {
 		state = n->nud_state;
 		neigh_release(n);
@@ -551,7 +559,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 				return -ENODEV;
 			if (!(dev->flags & IFF_UP))
 				return -ENETDOWN;
-			nh->nh_dev = dev;
+			rcu_assign_pointer(nh->nh_dev, dev);
 			dev_hold(dev);
 			nh->nh_scope = RT_SCOPE_LINK;
 			return 0;
@@ -578,7 +586,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 			goto out;
 		nh->nh_scope = res.scope;
 		nh->nh_oif = FIB_RES_OIF(res);
-		nh->nh_dev = dev = FIB_RES_DEV(res);
+		dev = rcu_dereference(FIB_RES_DEV(res));
+		rcu_assign_pointer(nh->nh_dev, dev);
 		if (!dev)
 			goto out;
 		dev_hold(dev);
@@ -597,8 +606,8 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
 		err = -ENETDOWN;
 		if (!(in_dev->dev->flags & IFF_UP))
 			goto out;
-		nh->nh_dev = in_dev->dev;
-		dev_hold(nh->nh_dev);
+		rcu_assign_pointer(nh->nh_dev, in_dev->dev);
+		dev_hold(in_dev->dev);
 		nh->nh_scope = RT_SCOPE_HOST;
 		err = 0;
 	}
@@ -695,9 +704,14 @@ static void fib_info_hash_move(struct hlist_head *new_info_hash,
 
 __be32 fib_info_update_nh_saddr(struct net *net, struct fib_nh *nh)
 {
-	nh->nh_saddr = inet_select_addr(nh->nh_dev,
-					nh->nh_gw,
-					nh->nh_parent->fib_scope);
+	struct net_device *ndev;
+
+	rcu_read_lock();
+	ndev = rcu_dereference(nh->nh_dev);
+	nh->nh_saddr = __inet_select_addr(net, ndev,
+					  nh->nh_gw,
+					  nh->nh_parent->fib_scope);
+	rcu_read_unlock();
 	nh->nh_saddr_genid = atomic_read(&net->ipv4.dev_addr_genid);
 
 	return nh->nh_saddr;
@@ -843,7 +857,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
 		if (nhs != 1 || nh->nh_gw)
 			goto err_inval;
 		nh->nh_scope = RT_SCOPE_NOWHERE;
-		nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
+		RCU_INIT_POINTER(nh->nh_dev,
+				 dev_get_by_index(net, fi->fib_nh->nh_oif));
 		err = -ENODEV;
 		if (nh->nh_dev == NULL)
 			goto failure;
@@ -889,10 +904,11 @@ link_it:
 	change_nexthops(fi) {
 		struct hlist_head *head;
 		unsigned int hash;
+		struct net_device *ndev = rtnl_dereference(nexthop_nh->nh_dev);
 
-		if (!nexthop_nh->nh_dev)
+		if (!ndev)
 			continue;
-		hash = fib_devindex_hashfn(nexthop_nh->nh_dev->ifindex);
+		hash = fib_devindex_hashfn(ndev->ifindex);
 		head = &fib_info_devhash[hash];
 		hlist_add_head(&nexthop_nh->nh_hash, head);
 	} endfor_nexthops(fi)
@@ -1049,14 +1065,14 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 		int dead;
 
 		BUG_ON(!fi->fib_nhs);
-		if (nh->nh_dev != dev || fi == prev_fi)
+		if (rtnl_dereference(nh->nh_dev) != dev || fi == prev_fi)
 			continue;
 		prev_fi = fi;
 		dead = 0;
 		change_nexthops(fi) {
 			if (nexthop_nh->nh_flags & RTNH_F_DEAD)
 				dead++;
-			else if (nexthop_nh->nh_dev == dev &&
+			else if (rtnl_dereference(nexthop_nh->nh_dev) == dev &&
 				 nexthop_nh->nh_scope != scope) {
 				nexthop_nh->nh_flags |= RTNH_F_DEAD;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
@@ -1068,7 +1084,8 @@ int fib_sync_down_dev(struct net_device *dev, int force)
 				dead++;
 			}
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
-			if (force > 1 && nexthop_nh->nh_dev == dev) {
+			if (force > 1 &&
+			    rtnl_dereference(nexthop_nh->nh_dev) == dev) {
 				dead = fi->fib_nhs;
 				break;
 			}
@@ -1167,20 +1184,23 @@ int fib_sync_up(struct net_device *dev)
 		int alive;
 
 		BUG_ON(!fi->fib_nhs);
-		if (nh->nh_dev != dev || fi == prev_fi)
+		if (rtnl_dereference(nh->nh_dev) != dev || fi == prev_fi)
 			continue;
 
 		prev_fi = fi;
 		alive = 0;
 		change_nexthops(fi) {
+			struct net_device *ndev;
+
 			if (!(nexthop_nh->nh_flags & RTNH_F_DEAD)) {
 				alive++;
 				continue;
 			}
-			if (nexthop_nh->nh_dev == NULL ||
-			    !(nexthop_nh->nh_dev->flags & IFF_UP))
+			ndev = rtnl_dereference(nexthop_nh->nh_dev);
+			if (ndev == NULL ||
+			    !(ndev->flags & IFF_UP))
 				continue;
-			if (nexthop_nh->nh_dev != dev ||
+			if (ndev != dev ||
 			    !__in_dev_get_rtnl(dev))
 				continue;
 			alive++;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 30b88d7..3861ba0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2556,11 +2556,14 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 			    || fa->fa_type == RTN_MULTICAST)
 				continue;
 
-			if (fi)
+			if (fi) {
+				struct net_device *ndev;
+
+				ndev = rcu_dereference(fi->fib_dev);
 				seq_printf(seq,
 					 "%s\t%08X\t%08X\t%04X\t%d\t%u\t"
 					 "%d\t%08X\t%d\t%u\t%u%n",
-					 fi->fib_dev ? fi->fib_dev->name : "*",
+					 ndev ? ndev->name : "*",
 					 prefix,
 					 fi->fib_nh->nh_gw, flags, 0, 0,
 					 fi->fib_priority,
@@ -2569,13 +2572,13 @@ static int fib_route_seq_show(struct seq_file *seq, void *v)
 					  fi->fib_advmss + 40 : 0),
 					 fi->fib_window,
 					 fi->fib_rtt >> 3, &len);
-			else
+			} else {
 				seq_printf(seq,
 					 "*\t%08X\t%08X\t%04X\t%d\t%u\t"
 					 "%d\t%08X\t%d\t%u\t%u%n",
 					 prefix, 0, flags, 0, 0, 0,
 					 mask, 0, 0, 0, &len);
-
+			}
 			seq_printf(seq, "%*s\n", 127 - len, "");
 		}
 	}
diff --git a/net/ipv4/netfilter/ipt_rpfilter.c b/net/ipv4/netfilter/ipt_rpfilter.c
index 31371be..bdc9393 100644
--- a/net/ipv4/netfilter/ipt_rpfilter.c
+++ b/net/ipv4/netfilter/ipt_rpfilter.c
@@ -52,13 +52,13 @@ static bool rpfilter_lookup_reverse(struct flowi4 *fl4,
 	for (ret = 0; ret < res.fi->fib_nhs; ret++) {
 		struct fib_nh *nh = &res.fi->fib_nh[ret];
 
-		if (nh->nh_dev == dev) {
+		if (rcu_dereference(nh->nh_dev) == dev) {
 			dev_match = true;
 			break;
 		}
 	}
 #else
-	if (FIB_RES_DEV(res) == dev)
+	if (rcu_dereference(FIB_RES_DEV(res)) == dev)
 		dev_match = true;
 #endif
 	if (dev_match || flags & XT_RPFILTER_LOOSE)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ffcb3b0..b56b91e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2127,7 +2127,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	u32 itag;
 
 	/* get a working reference to the output device */
-	out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
+	out_dev = __in_dev_get_rcu(rcu_dereference(FIB_RES_DEV(*res)));
 	if (out_dev == NULL) {
 		net_crit_ratelimited("Bug in ip_route_input_slow(). Please report.\n");
 		return -EINVAL;
@@ -2786,7 +2786,7 @@ static struct rtable *ip_route_output_slow(struct net *net, struct flowi4 *fl4)
 	if (!fl4->saddr)
 		fl4->saddr = FIB_RES_PREFSRC(net, res);
 
-	dev_out = FIB_RES_DEV(res);
+	dev_out = rcu_dereference(FIB_RES_DEV(res));
 	fl4->flowi4_oif = dev_out->ifindex;
 
 



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  6:37             ` Eric Dumazet
@ 2012-05-23  6:43               ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  6:43 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 08:37 +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index ffcb3b0..b56b91e 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2127,7 +2127,7 @@ static int __mkroute_input(struct sk_buff *skb,
>  	u32 itag;
>  
>  	/* get a working reference to the output device */
> -	out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
> +	out_dev = __in_dev_get_rcu(rcu_dereference(FIB_RES_DEV(*res)));

This part might need additional fix (if FIB_RES_DEV(*res) is NULL),
because __in_dev_get_rcu() could crash dereferencing NULL pointer.








^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  6:27               ` Eric Dumazet
@ 2012-05-23  6:47                 ` Yanmin Zhang
  2012-05-23  6:55                   ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  6:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 08:27 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:
> 
> > 
> > Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> > and nexthop_nh->nh_hash.
> 
> No, its not needed.
I am checking it now.
fib_create_info=>fib_find_info, but fib_find_info is not protected by
fib_info_lock. See the codes:


link_it:
        ofi = fib_find_info(fi);
        if (ofi) {
                fi->fib_dead = 1;
                free_fib_info(fi);
                ofi->fib_treeref++;
                return ofi;
        }

> 
> I am sending a patch, because I feel this area is too complex for non
> netdev guys.
Indeed, it's complicated. I will test your patches.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  6:47                 ` Yanmin Zhang
@ 2012-05-23  6:55                   ` Eric Dumazet
  2012-05-23  7:13                     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  6:55 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 14:47 +0800, Yanmin Zhang wrote:
> On Wed, 2012-05-23 at 08:27 +0200, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 14:15 +0800, Yanmin Zhang wrote:
> > 
> > > 
> > > Besides fi->fib_dev here, we need check fi->fib_hash, fi->fib_lhash,
> > > and nexthop_nh->nh_hash.
> > 
> > No, its not needed.
> I am checking it now.
> fib_create_info=>fib_find_info, but fib_find_info is not protected by
> fib_info_lock. See the codes:
> 
> 
> link_it:
>         ofi = fib_find_info(fi);
>         if (ofi) {
>                 fi->fib_dead = 1;
>                 free_fib_info(fi);
>                 ofi->fib_treeref++;
>                 return ofi;
>         }
> 
> > 
> > I am sending a patch, because I feel this area is too complex for non
> > netdev guys.
> Indeed, it's complicated. I will test your patches.

Please hold on, I'll send a v2




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  6:55                   ` Eric Dumazet
@ 2012-05-23  7:13                     ` Eric Dumazet
  2012-05-23  7:24                       ` Yanmin Zhang
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  7:13 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:

> Please hold on, I'll send a v2

I believe your patch should be fine, if you move back the
fib_info_cnt--;

So only do the dev_put() in free_fib_info_rcu().

No need to clear nh_dev to NULL since we are freeing fi at the end of
function.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  7:13                     ` Eric Dumazet
@ 2012-05-23  7:24                       ` Yanmin Zhang
  2012-05-23  7:39                         ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  7:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
> 
> > Please hold on, I'll send a v2
> 
> I believe your patch should be fine, if you move back the
> fib_info_cnt--;
> 
> So only do the dev_put() in free_fib_info_rcu().
We would do so in a new patch.

> 
> No need to clear nh_dev to NULL since we are freeing fi at the end of
> function.
David suggests to reset it to NULL to detect other potential
race conditions.

Besides above suggestions, how do you think about:

fib_create_info=>fib_find_info, but fib_find_info is not protected by
fib_info_lock. See the codes:

fib_create_info()
{
	...
link_it:
        ofi = fib_find_info(fi);
        if (ofi) {
                fi->fib_dead = 1;
                free_fib_info(fi);
                ofi->fib_treeref++;
                return ofi;
        }
        fi->fib_treeref++;
        atomic_inc(&fi->fib_clntref);
        spin_lock_bh(&fib_info_lock);

	...
}

I plan to change it to hold fib_info_lock before calling fib_find_info. Is
it ok for you?

Thanks for the direct speaking.

Yanmin



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  7:24                       ` Yanmin Zhang
@ 2012-05-23  7:39                         ` Eric Dumazet
  2012-05-23  7:41                           ` Eric Dumazet
  2012-05-23  7:47                           ` Yanmin Zhang
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  7:39 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 15:24 +0800, Yanmin Zhang wrote:
> On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
> > 
> > > Please hold on, I'll send a v2
> > 
> > I believe your patch should be fine, if you move back the
> > fib_info_cnt--;
> > 
> > So only do the dev_put() in free_fib_info_rcu().
> We would do so in a new patch.
> 
> > 
> > No need to clear nh_dev to NULL since we are freeing fi at the end of
> > function.
> David suggests to reset it to NULL to detect other potential
> race conditions.
> 

Yes but no.

Users are in a RCU read lock and we should not change nh_dev to NULL
while they are running.

Thats what I tried to do (David message gave me this wrong hint) but it
came to a dead issue.

Only after a grace period we can :
	dev_put() all involved net_devices
	kfree(fi)

> Besides above suggestions, how do you think about:
> 
> fib_create_info=>fib_find_info, but fib_find_info is not protected by
> fib_info_lock. See the codes:
> 
> fib_create_info()
> {
> 	...
> link_it:
>         ofi = fib_find_info(fi);
>         if (ofi) {
>                 fi->fib_dead = 1;
>                 free_fib_info(fi);
>                 ofi->fib_treeref++;
>                 return ofi;
>         }
>         fi->fib_treeref++;
>         atomic_inc(&fi->fib_clntref);
>         spin_lock_bh(&fib_info_lock);
> 
> 	...
> }
> 
> I plan to change it to hold fib_info_lock before calling fib_find_info. Is
> it ok for you?

Its not needed we hold RTNL mutex.

spinlock is needed mainly because of ip_fib_check_default(), but this
could be changed to use RCU as well. 

(readers use RCU, writers already hold RTNL -> no more fib_info_lock )




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  7:39                         ` Eric Dumazet
@ 2012-05-23  7:41                           ` Eric Dumazet
  2012-05-23  7:47                           ` Yanmin Zhang
  1 sibling, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2012-05-23  7:41 UTC (permalink / raw)
  To: Yanmin Zhang; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 09:39 +0200, Eric Dumazet wrote:

> spinlock is needed mainly because of ip_fib_check_default(), but this
> could be changed to use RCU as well. 
> 
> (readers use RCU, writers already hold RTNL -> no more fib_info_lock )

Since its not a fast path and hash table can be resized, its probably
not worth the pain.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow
  2012-05-23  7:39                         ` Eric Dumazet
  2012-05-23  7:41                           ` Eric Dumazet
@ 2012-05-23  7:47                           ` Yanmin Zhang
  1 sibling, 0 replies; 22+ messages in thread
From: Yanmin Zhang @ 2012-05-23  7:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, kunx.jiang, netdev, linux-kernel

On Wed, 2012-05-23 at 09:39 +0200, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 15:24 +0800, Yanmin Zhang wrote:
> > On Wed, 2012-05-23 at 09:13 +0200, Eric Dumazet wrote:
> > > On Wed, 2012-05-23 at 08:55 +0200, Eric Dumazet wrote:
> > > 
> > > > Please hold on, I'll send a v2
> > > 
> > > I believe your patch should be fine, if you move back the
> > > fib_info_cnt--;
> > > 
> > > So only do the dev_put() in free_fib_info_rcu().
> > We would do so in a new patch.
> > 
> > > 
> > > No need to clear nh_dev to NULL since we are freeing fi at the end of
> > > function.
> > David suggests to reset it to NULL to detect other potential
> > race conditions.
> > 
> 
> Yes but no.
> 
> Users are in a RCU read lock and we should not change nh_dev to NULL
> while they are running.
> 
> Thats what I tried to do (David message gave me this wrong hint) but it
> came to a dead issue.
> 
> Only after a grace period we can :
> 	dev_put() all involved net_devices
> 	kfree(fi)
> 
> > Besides above suggestions, how do you think about:
> > 
> > fib_create_info=>fib_find_info, but fib_find_info is not protected by
> > fib_info_lock. See the codes:
> > 
> > fib_create_info()
> > {
> > 	...
> > link_it:
> >         ofi = fib_find_info(fi);
> >         if (ofi) {
> >                 fi->fib_dead = 1;
> >                 free_fib_info(fi);
> >                 ofi->fib_treeref++;
> >                 return ofi;
> >         }
> >         fi->fib_treeref++;
> >         atomic_inc(&fi->fib_clntref);
> >         spin_lock_bh(&fib_info_lock);
> > 
> > 	...
> > }
> > 
> > I plan to change it to hold fib_info_lock before calling fib_find_info. Is
> > it ok for you?
> 
> Its not needed we hold RTNL mutex.
Thanks. We would work out a new patch and post it here after running MTBF
testing.



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2012-05-23  7:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-22  9:48 [PATCH] ipv4: fix the rcu race between free_fib_info and ip_route_output_slow kun.jiang
2012-05-22 19:15 ` David Miller
2012-05-23  3:02   ` Yanmin Zhang
2012-05-23  3:23     ` David Miller
2012-05-23  3:30       ` Yanmin Zhang
2012-05-23  3:49         ` David Miller
2012-05-23  4:41           ` Yanmin Zhang
2012-05-23  5:08             ` Eric Dumazet
2012-05-23  4:37       ` Eric Dumazet
2012-05-23  4:54         ` Yanmin Zhang
2012-05-23  5:02           ` Eric Dumazet
2012-05-23  6:15             ` Yanmin Zhang
2012-05-23  6:27               ` Eric Dumazet
2012-05-23  6:47                 ` Yanmin Zhang
2012-05-23  6:55                   ` Eric Dumazet
2012-05-23  7:13                     ` Eric Dumazet
2012-05-23  7:24                       ` Yanmin Zhang
2012-05-23  7:39                         ` Eric Dumazet
2012-05-23  7:41                           ` Eric Dumazet
2012-05-23  7:47                           ` Yanmin Zhang
2012-05-23  6:37             ` Eric Dumazet
2012-05-23  6:43               ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).