netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
@ 2010-04-20 17:44 Jiri Bohac
  2010-04-20 17:57 ` Eric Dumazet
  2010-04-21 21:34 ` Jiri Bohac
  0 siblings, 2 replies; 28+ messages in thread
From: Jiri Bohac @ 2010-04-20 17:44 UTC (permalink / raw)
  To: netdev; +Cc: Hideaki YOSHIFUJI, David Miller

Hi,

I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
to dst_free().

__ipv6_ifa_notify() contains:

        case RTM_DELADDR:
                if (ifp->idev->cnf.forwarding)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
                if (ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;

AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
deletes the route: 
	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
	-> rt6_release() -> dst_free()

If it fails (like when it races with another invocation of ip6_del_rt()), it
will return nonzero and this will cause the above code to call dst_free() on its own.

dst_free() has no protection against concurrent invocation and if
two invocations make it through the "if (dst->obsolete > 1)"
check before one of them calls __dst_free(), the same dst_entry
may end up either:
	1) dst_destroy()ed and put on the dst_garbage.list, or
	2) put on the dst_garbage.list twice
both resulting in trouble once the GC is run.

One possible code path leading to two invocations of __ipv6_ifa_notify() seems
to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
the bonding master is in the DAD phase with a tentative address:

netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
 -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
 -> __ipv6_ifa_notify


What is the reason __ipv6_ifa_notify() calls dst_free() when
ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
with the dst still needing to be freed.

I am just testing whether the following will help:

--- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
+++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
@@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
 			addrconf_leave_anycast(ifp);
 		addrconf_leave_solict(ifp->idev, &ifp->addr);
 		dst_hold(&ifp->rt->u.dst);
-		if (ip6_del_rt(ifp->rt))
-			dst_free(&ifp->rt->u.dst);
+		ip6_del_rt(ifp->rt);
 		break;
 	}
 }

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 17:44 IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
@ 2010-04-20 17:57 ` Eric Dumazet
  2010-04-20 20:49   ` Jiri Bohac
  2010-04-21 21:34 ` Jiri Bohac
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-20 17:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller

Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> Hi,
> 
> I found what I believe is a race condition in __ipv6_ifa_notify(), in the call
> to dst_free().
> 
> __ipv6_ifa_notify() contains:
> 
>         case RTM_DELADDR:
>                 if (ifp->idev->cnf.forwarding)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
>                 if (ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;
> 
> AFAICT, ip6_del_rt() will call dst_free() itself if it finds and actually
> deletes the route: 
> 	ip6_del_rt() -> __ip6_del_rt() -> fib6_del() -> fib6_del_route() ->
> 	-> rt6_release() -> dst_free()
> 
> If it fails (like when it races with another invocation of ip6_del_rt()), it
> will return nonzero and this will cause the above code to call dst_free() on its own.
> 
> dst_free() has no protection against concurrent invocation and if

Sorry ? of course dst_free() has a protection...

By definition, the dst_destroy() is called only by the last thread with
the final refcount on object.

> two invocations make it through the "if (dst->obsolete > 1)"
> check before one of them calls __dst_free(), the same dst_entry
> may end up either:
> 	1) dst_destroy()ed and put on the dst_garbage.list, or
> 	2) put on the dst_garbage.list twice
> both resulting in trouble once the GC is run.
> 
> One possible code path leading to two invocations of __ipv6_ifa_notify() seems
> to be when two bonding slaves receive a NS/NA with the bonds IPv6 address when
> the bonding master is in the DAD phase with a tentative address:
> 
> netif_receive_skb() gets invoked on two CPUs and sets skb->dev to the bonding master ...
> ... ip6_mc_input() -> ip6_input_finish() -> icmpv6_rcv() -> ndisc_rcv() ->
>  -> ndisc_recv_ns() -> addrconf_dad_failure() -> ipv6_del_addr() -> ipv6_ifa_notify() ->
>  -> __ipv6_ifa_notify
> 
> 
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.
> 
> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }
> 


I dont understand the problem Jiri.

We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
dst_free(), or we leak a refcount.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 17:57 ` Eric Dumazet
@ 2010-04-20 20:49   ` Jiri Bohac
  2010-04-20 20:57     ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Jiri Bohac @ 2010-04-20 20:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller

On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> >  			addrconf_leave_anycast(ifp);
> >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> >  		dst_hold(&ifp->rt->u.dst);
> > -		if (ip6_del_rt(ifp->rt))
> > -			dst_free(&ifp->rt->u.dst);
> > +		ip6_del_rt(ifp->rt);
> >  		break;
> >  	}
> >  }
> > 
> 
> 
> I dont understand the problem Jiri.
> 
> We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> dst_free(), or we leak a refcount.

Well, no ... dst_free() does not decrement the refcount.
The "opposite" of dst_hold() is dst_release(). And it does not
automatically call dst_free() when the refcount drops to 0.
dst_free() needs to be called explicitly and it apparently
expects the caller to ensure that two dst_free()s won't be called
simultaneously. But my bonding example shows this is not the
case.


-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 20:49   ` Jiri Bohac
@ 2010-04-20 20:57     ` Eric Dumazet
  2010-04-20 21:16       ` Stephen Hemminger
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2010-04-20 20:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller, Stephen Hemminger

Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > >  			addrconf_leave_anycast(ifp);
> > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > >  		dst_hold(&ifp->rt->u.dst);
> > > -		if (ip6_del_rt(ifp->rt))
> > > -			dst_free(&ifp->rt->u.dst);
> > > +		ip6_del_rt(ifp->rt);
> > >  		break;
> > >  	}
> > >  }
> > > 
> > 
> > 
> > I dont understand the problem Jiri.
> > 
> > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > dst_free(), or we leak a refcount.
> 
> Well, no ... dst_free() does not decrement the refcount.
> The "opposite" of dst_hold() is dst_release(). And it does not
> automatically call dst_free() when the refcount drops to 0.
> dst_free() needs to be called explicitly and it apparently
> expects the caller to ensure that two dst_free()s won't be called
> simultaneously. But my bonding example shows this is not the
> case.
> 
> 

Ah yes you're right

Maybe ask Stephen ?

commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
Author: stephen hemminger <shemminger@vyatta.com>
Date:   Mon Apr 12 05:41:31 2010 +0000

    IPv6: keep route for tentative address
    
    Recent changes preserve IPv6 address when link goes down (good).
    But would cause address to point to dead dst entry (bad).
    The simplest fix is to just not delete route if address is
    being held for later use.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1b00bfe..a9913d2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
inet6_ifaddr *ifp)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
-               if (ip6_del_rt(ifp->rt))
+
+               if (ifp->dead && ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;
        }



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 20:57     ` Eric Dumazet
@ 2010-04-20 21:16       ` Stephen Hemminger
  2010-04-20 21:35         ` Jiri Bohac
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2010-04-20 21:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller

On Tue, 20 Apr 2010 22:57:23 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> > On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > > >  			addrconf_leave_anycast(ifp);
> > > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > > >  		dst_hold(&ifp->rt->u.dst);
> > > > -		if (ip6_del_rt(ifp->rt))
> > > > -			dst_free(&ifp->rt->u.dst);
> > > > +		ip6_del_rt(ifp->rt);
> > > >  		break;
> > > >  	}
> > > >  }
> > > > 
> > > 
> > > 
> > > I dont understand the problem Jiri.
> > > 
> > > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > > dst_free(), or we leak a refcount.
> > 
> > Well, no ... dst_free() does not decrement the refcount.
> > The "opposite" of dst_hold() is dst_release(). And it does not
> > automatically call dst_free() when the refcount drops to 0.
> > dst_free() needs to be called explicitly and it apparently
> > expects the caller to ensure that two dst_free()s won't be called
> > simultaneously. But my bonding example shows this is not the
> > case.
> > 
> > 
> 
> Ah yes you're right
> 
> Maybe ask Stephen ?
> 
> commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
> Author: stephen hemminger <shemminger@vyatta.com>
> Date:   Mon Apr 12 05:41:31 2010 +0000
> 
>     IPv6: keep route for tentative address
>     
>     Recent changes preserve IPv6 address when link goes down (good).
>     But would cause address to point to dead dst entry (bad).
>     The simplest fix is to just not delete route if address is
>     being held for later use.
>     
>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 1b00bfe..a9913d2 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
> inet6_ifaddr *ifp)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
> -               if (ip6_del_rt(ifp->rt))
> +
> +               if (ifp->dead && ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;


Is this problem new to net-next where we hold onto addresses, or was
the issue there before?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 21:16       ` Stephen Hemminger
@ 2010-04-20 21:35         ` Jiri Bohac
  0 siblings, 0 replies; 28+ messages in thread
From: Jiri Bohac @ 2010-04-20 21:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller

On Tue, Apr 20, 2010 at 02:16:55PM -0700, Stephen Hemminger wrote:
> Is this problem new to net-next where we hold onto addresses, or was
> the issue there before?

Before. I am seeing it on 2.6.32.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-20 17:44 IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
  2010-04-20 17:57 ` Eric Dumazet
@ 2010-04-21 21:34 ` Jiri Bohac
  2010-04-22  2:32   ` Herbert Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Jiri Bohac @ 2010-04-21 21:34 UTC (permalink / raw)
  To: Hideaki YOSHIFUJI, Herbert Xu; +Cc: netdev, David Miller, Stephen Hemminger

Hi,

On Tue, Apr 20, 2010 at 07:44:01PM +0200, Jiri Bohac wrote:
> What is the reason __ipv6_ifa_notify() calls dst_free() when
> ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> with the dst still needing to be freed.

checked again and I still think that if ip6_del_rt() fails,
ifp->rt must have been freed already. Anybody with a
counterexample?

> I am just testing whether the following will help:
> 
> --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
>  			addrconf_leave_anycast(ifp);
>  		addrconf_leave_solict(ifp->idev, &ifp->addr);
>  		dst_hold(&ifp->rt->u.dst);
> -		if (ip6_del_rt(ifp->rt))
> -			dst_free(&ifp->rt->u.dst);
> +		ip6_del_rt(ifp->rt);
>  		break;
>  	}
>  }

not surprisingly, it helps my case -- the race condition does not
happen and the resulting oopses disappears. Of course, this does
not prove the patch is correct.

Could anybody with more insight into the dst refcounting please
take a look? The code originally came from:

	Author: Hideaki Yoshifuji <yoshfuji@linux-ipv6.org>
	Date:   Wed Aug 18 05:25:16 2004 +0900

	    [IPV6] refer inet6 device via corresponding local route from address structure.

And has been modified later by:
	commit 4641e7a334adf6856300a98e7296dfc886c446af
	Author: Herbert Xu <herbert@gondor.apana.org.au>
	Date:   Thu Feb 2 16:55:45 2006 -0800

	    [IPV6]: Don't hold extra ref count in ipv6_ifa_notify

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-21 21:34 ` Jiri Bohac
@ 2010-04-22  2:32   ` Herbert Xu
  2010-04-22  7:43     ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-04-22  2:32 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Hideaki YOSHIFUJI, netdev, David Miller, Stephen Hemminger

On Wed, Apr 21, 2010 at 11:34:29PM +0200, Jiri Bohac wrote:
> Hi,
> 
> On Tue, Apr 20, 2010 at 07:44:01PM +0200, Jiri Bohac wrote:
> > What is the reason __ipv6_ifa_notify() calls dst_free() when
> > ip6_del_rt() fails? I don't see a way ip6_del_rt() could fail
> > with the dst still needing to be freed.
> 
> checked again and I still think that if ip6_del_rt() fails,
> ifp->rt must have been freed already. Anybody with a
> counterexample?

I agree with your diagnosis and the two duplicate NDISC messages
scenario sounds plausible.

Anyway, I think the root of the issue is the fact that NDISC is
calling addrconf_dad_failure with no locking whatsoever.  The
latter is not idempotent so some form of locking is needed.

This bug appears to have been around since the very start.

I'll dig deeper to see where we might be able to add some locks.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-22  2:32   ` Herbert Xu
@ 2010-04-22  7:43     ` David Miller
  2010-04-22 14:25       ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-04-22  7:43 UTC (permalink / raw)
  To: herbert; +Cc: jbohac, yoshfuji, netdev, shemminger

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 22 Apr 2010 10:32:11 +0800

> Anyway, I think the root of the issue is the fact that NDISC is
> calling addrconf_dad_failure with no locking whatsoever.  The
> latter is not idempotent so some form of locking is needed.
> 
> This bug appears to have been around since the very start.
> 
> I'll dig deeper to see where we might be able to add some locks.

Thanks Herbert.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-22  7:43     ` David Miller
@ 2010-04-22 14:25       ` Herbert Xu
  2010-04-22 15:49         ` Jiri Bohac
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-04-22 14:25 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger

On Thu, Apr 22, 2010 at 12:43:24AM -0700, David Miller wrote:
>
> Thanks Herbert.

No worries :)

BTW similar races exist in other NDISC receive functions, but
it's too late today so I'll look at this tomorrow unless someone
else wants to have a go at this.

ipv6: Prevent DAD races

The NDISC receive path has not been written in a way that handles
unexpected packets properly.

For example, if we get two identical simultaneous NA/NS packets
that result in a DAD failure, we may try to delete the same address
twice.

A similar problem occurs when we get a DAD failure just as we're
about to mark an address as having passed DAD.

This patch fixes this by using the DADFAILED bit to synchronise
the two paths while holding the ifp lock.  It relies on the fact
that the TENTATIVE bit is always set during DAD, and that the
DADFAILED bit is only set on failure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..1d15d5e 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1401,6 +1401,16 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
 void addrconf_dad_failure(struct inet6_ifaddr *ifp)
 {
 	struct inet6_dev *idev = ifp->idev;
+	int ignore;
+
+	spin_lock(&ifp->lock);
+	ignore = (ifp->flags & (IFA_F_DADFAILED|IFA_F_TENTATIVE)) ^
+		 IFA_F_TENTATIVE;
+	ifp->flags |= IFA_F_DADFAILED;
+	spin_unlock(&ifp->lock);
+
+	if (ignore)
+		return;
 
 	if (net_ratelimit())
 		printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
@@ -2789,7 +2799,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	read_lock_bh(&idev->lock);
 	if (ifp->dead)
 		goto out;
+
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED)
+		goto unlock_ifp;
 
 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
 	    idev->cnf.accept_dad < 1 ||
@@ -2824,6 +2837,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 		ip6_ins_rt(ifp->rt);
 
 	addrconf_dad_kick(ifp);
+unlock_ifp:
 	spin_unlock_bh(&ifp->lock);
 out:
 	read_unlock_bh(&idev->lock);
@@ -2841,6 +2855,11 @@ static void addrconf_dad_timer(unsigned long data)
 		goto out;
 	}
 	spin_lock_bh(&ifp->lock);
+	if (ifp->flags & IFA_F_DADFAILED) {
+		spin_unlock_bh(&ifp->lock);
+		read_unlock_bh(&idev->lock);
+		goto out;
+	}
 	if (ifp->probes == 0) {
 		/*
 		 * DAD was successful

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-22 14:25       ` Herbert Xu
@ 2010-04-22 15:49         ` Jiri Bohac
  2010-04-22 16:17           ` Stephen Hemminger
  2010-04-23  1:54           ` David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Jiri Bohac @ 2010-04-22 15:49 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev, shemminger

On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> This patch fixes this by using the DADFAILED bit to synchronise
> the two paths while holding the ifp lock.  It relies on the fact
> that the TENTATIVE bit is always set during DAD, and that the
> DADFAILED bit is only set on failure.

But the addr_dad_failure()->...->ipv6_del_addr() path will
still race with any other path calling ipv6_del_addr() (e.g. a
manual address removal). Won't it?

I still don't see why __ipv6_ifa_notify() needs to call
dst_free(). Shouldn't that be dst_release() instead, to drop the
reference obtained by dst_hold(&ifp->rt->u.dst)?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-22 15:49         ` Jiri Bohac
@ 2010-04-22 16:17           ` Stephen Hemminger
  2010-04-23  1:54           ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: Stephen Hemminger @ 2010-04-22 16:17 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Herbert Xu, David Miller, yoshfuji, netdev

On Thu, 22 Apr 2010 17:49:08 +0200
Jiri Bohac <jbohac@suse.cz> wrote:

> On Thu, Apr 22, 2010 at 10:25:06PM +0800, Herbert Xu wrote:
> > This patch fixes this by using the DADFAILED bit to synchronise
> > the two paths while holding the ifp lock.  It relies on the fact
> > that the TENTATIVE bit is always set during DAD, and that the
> > DADFAILED bit is only set on failure.
> 
> But the addr_dad_failure()->...->ipv6_del_addr() path will
> still race with any other path calling ipv6_del_addr() (e.g. a
> manual address removal). Won't it?
> 
> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

Yes, some more locking and race condition management is needed.

Something like the following (untested):

--- a/net/ipv6/addrconf.c	2010-04-22 09:11:54.594827858 -0700
+++ b/net/ipv6/addrconf.c	2010-04-22 09:15:59.224631752 -0700
@@ -720,13 +720,18 @@ static void ipv6_del_addr(struct inet6_i
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
+	write_lock_bh(&idev->lock);
+	if (ifp->dead) {
+		write_unlock(&idev->lock); /* lost race with DAD */
+		return;
+	}
+
 	ifp->dead = 1;
 
-	spin_lock_bh(&addrconf_hash_lock);
+	spin_lock(&addrconf_hash_lock);
 	hlist_del_init_rcu(&ifp->addr_lst);
-	spin_unlock_bh(&addrconf_hash_lock);
+	spin_unlock(&addrconf_hash_lock);
 
-	write_lock_bh(&idev->lock);
 #ifdef CONFIG_IPV6_PRIVACY
 	if (ifp->flags&IFA_F_TEMPORARY) {
 		list_del(&ifp->tmp_list);




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-22 15:49         ` Jiri Bohac
  2010-04-22 16:17           ` Stephen Hemminger
@ 2010-04-23  1:54           ` David Miller
  2010-04-23  2:10             ` Herbert Xu
  2010-04-27 15:50             ` IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
  1 sibling, 2 replies; 28+ messages in thread
From: David Miller @ 2010-04-23  1:54 UTC (permalink / raw)
  To: jbohac; +Cc: herbert, yoshfuji, netdev, shemminger

From: Jiri Bohac <jbohac@suse.cz>
Date: Thu, 22 Apr 2010 17:49:08 +0200

> I still don't see why __ipv6_ifa_notify() needs to call
> dst_free(). Shouldn't that be dst_release() instead, to drop the
> reference obtained by dst_hold(&ifp->rt->u.dst)?

It likely wants to do both.

Just doing dst_release() doesn't mark the 'dst' object as obsolete,
and therefore it won't get force garbage collected.

That's why the dst_free() is necessary, to really get rid of it when
the refcount does hit zero.

Actually, what's really interesting is that at the top of the
linux-2.6-history tree this code reads:

		dst_hold(&ifp->rt->u.dst);
		if (ip6_del_rt(ifp->rt, NULL, NULL))
			dst_free(&ifp->rt->u.dst);
		else
			dst_release(&ifp->rt->u.dst);

and in Linus's initial GIT import, it reads this way too.

Where did it change to the current form that lacks the
else block?

Aha!  Here it is:

commit 4641e7a334adf6856300a98e7296dfc886c446af
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date:   Thu Feb 2 16:55:45 2006 -0800

    [IPV6]: Don't hold extra ref count in ipv6_ifa_notify
    
    Currently the logic in ipv6_ifa_notify is to hold an extra reference
    count for addrconf dst's that get added to the routing table.  Thus,
    when addrconf dst entries are taken out of the routing table, we need
    to drop that dst.  However, addrconf dst entries may be removed from
    the routing table by means other than __ipv6_ifa_notify.
    
    So we're faced with the choice of either fixing up all places where
    addrconf dst entries are removed, or dropping the extra reference count
    altogether.
    
    I chose the latter because the ifp itself always holds a dst reference
    count of 1 while it's alive.  This is dropped just before we kfree the
    ifp object.  Therefore we know that in __ipv6_ifa_notify we will always
    hold that count.
    
    This bug was found by Eric W. Biederman.
    
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d328d59..1db5048 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3321,9 +3321,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 
 	switch (event) {
 	case RTM_NEWADDR:
-		dst_hold(&ifp->rt->u.dst);
-		if (ip6_ins_rt(ifp->rt, NULL, NULL, NULL))
-			dst_release(&ifp->rt->u.dst);
+		ip6_ins_rt(ifp->rt, NULL, NULL, NULL);
 		if (ifp->idev->cnf.forwarding)
 			addrconf_join_anycast(ifp);
 		break;
@@ -3334,8 +3332,6 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 		dst_hold(&ifp->rt->u.dst);
 		if (ip6_del_rt(ifp->rt, NULL, NULL, NULL))
 			dst_free(&ifp->rt->u.dst);
-		else
-			dst_release(&ifp->rt->u.dst);
 		break;
 	}
 }

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-23  1:54           ` David Miller
@ 2010-04-23  2:10             ` Herbert Xu
  2010-04-23 15:05               ` Herbert Xu
  2010-04-27 15:50             ` IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
  1 sibling, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-04-23  2:10 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger

On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.

Actually I don't think the problem is in __ipv6_ifa_notify.  The
fact is none of this stuff is meant to be idempotent.  So it's
up to the entity that is requesting the deletion to make sure that
a single object is not deleted more than once.

Yes the original symptom was in __ipv6_ifa_notify, but it is
merely pointing out that we have a problem further up.

My patch is indeed not sufficient as Jiri pointed out, because
I didn't deal with the case of an administrative deletion of
autmatically generated IPv6 addresses.

I will post an updated patch later today to deal with that.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-23  2:10             ` Herbert Xu
@ 2010-04-23 15:05               ` Herbert Xu
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-04-23 15:05 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger

On Fri, Apr 23, 2010 at 10:10:00AM +0800, Herbert Xu wrote:
> 
> I will post an updated patch later today to deal with that.

Just got back from a business trip.

This stuff is more broken than I thought.  For example, we perform
a number of actions when DAD succeeds, e.g., joining an anycast
group.  However, this is not synchronised with respect to address
deletion at all, so if DAD succeeds just as someone deletes the
address, you can easily get stuck on that anycast group.

I will try to untangle this mess tomorrow.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-23  1:54           ` David Miller
  2010-04-23  2:10             ` Herbert Xu
@ 2010-04-27 15:50             ` Jiri Bohac
  2010-04-27 15:55               ` Herbert Xu
  1 sibling, 1 reply; 28+ messages in thread
From: Jiri Bohac @ 2010-04-27 15:50 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, herbert, yoshfuji, netdev, shemminger

On Thu, Apr 22, 2010 at 06:54:00PM -0700, David Miller wrote:
> From: Jiri Bohac <jbohac@suse.cz>
> Date: Thu, 22 Apr 2010 17:49:08 +0200
> 
> > I still don't see why __ipv6_ifa_notify() needs to call
> > dst_free(). Shouldn't that be dst_release() instead, to drop the
> > reference obtained by dst_hold(&ifp->rt->u.dst)?
> 
> It likely wants to do both.
> 
> Just doing dst_release() doesn't mark the 'dst' object as obsolete,
> and therefore it won't get force garbage collected.

Sure. So If I understand it correctly, there are two problems:

- the reference taken by dst_hold() just above the ip6_del_rt()
  is never dropped if ip6_del_rt() fails; so shouldn't the code
  be like this?:

	dst_hold(&ifp->rt->u.dst);
	if (ip6_del_rt(ifp->rt)) {
		dst_release(&ifp->rt->u.dst);
		dst_free(&ifp->rt->u.dst);
	}

- if ip6_del_rt() fails because it races with something else
  deleting the address, dst_free() will be called twice. This is
  what Herbert is fixing with additional locking. However -- even
  when he fixes that -- how can ip6_del_rt() fail with the
  ifp->rt still needing a dst_free()?
  AFAICS, it can fail by:
  	- __ip6_del_rt() finding that (rt ==
	  net->ipv6.ip6_null_entry); we don't want to call
	  dst_free() on net->ipv6.ip6_null_entry, do we?

	- fib6_del() returning -ENOENT for multiple reasons;
	  but doesn't that mean that something else has removed
	  the route already and called dst_free on it?

  In either case, the dst_free() looks like not being needed. And it only
  does no harm in most cases, because these events are rare and
  it usually finds (obsolete > 2) and does nothing.
  
  I think that what Herbert is doing is only going to enforce that
  the ip6_del_rt() is never going to fail, so the dst_free()
  won't ever be called anyway, right?

Thanks,

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-27 15:50             ` IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
@ 2010-04-27 15:55               ` Herbert Xu
  2010-05-09  6:48                 ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-04-27 15:55 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, yoshfuji, netdev, shemminger

On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
>
>   I think that what Herbert is doing is only going to enforce that
>   the ip6_del_rt() is never going to fail, so the dst_free()
>   won't ever be called anyway, right?

Yes I think you're right.  But let's fix the bigger problem first,
and then we can audit this and possibly turn it into a WARN_ON.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
  2010-04-27 15:55               ` Herbert Xu
@ 2010-05-09  6:48                 ` Herbert Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-09  6:48 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: David Miller, yoshfuji, netdev, shemminger

On Tue, Apr 27, 2010 at 11:55:33PM +0800, Herbert Xu wrote:
> On Tue, Apr 27, 2010 at 05:50:34PM +0200, Jiri Bohac wrote:
> >
> >   I think that what Herbert is doing is only going to enforce that
> >   the ip6_del_rt() is never going to fail, so the dst_free()
> >   won't ever be called anyway, right?
> 
> Yes I think you're right.  But let's fix the bigger problem first,
> and then we can audit this and possibly turn it into a WARN_ON.

Sorry for not fixing this sooner, but I'm back on the case.

Also, the dst_free really is needed.  It's there for the case
where we delete the address before DAD completion.  In that
case the route object is allocated, but not inserted in the
routing table.  So we must manually dst_free it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [0/4] Fix addrconf race conditions
  2010-04-23 15:05               ` Herbert Xu
@ 2010-05-18 11:02                 ` Herbert Xu
  2010-05-18 11:04                   ` [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state Herbert Xu
                                     ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 11:02 UTC (permalink / raw)
  To: David Miller; +Cc: jbohac, yoshfuji, netdev, shemminger

On Fri, Apr 23, 2010 at 11:05:40PM +0800, Herbert Xu wrote:
>
> This stuff is more broken than I thought.  For example, we perform
> a number of actions when DAD succeeds, e.g., joining an anycast
> group.  However, this is not synchronised with respect to address
> deletion at all, so if DAD succeeds just as someone deletes the
> address, you can easily get stuck on that anycast group.
> 
> I will try to untangle this mess tomorrow.

Tomorrow took a while to arrive :)

Here is a first batch of patches.  Note that this is by no means
a comprehensive fix for all the ndisc/addrconf race conditions.
It is just a first step in trying to address the problems.

The patchset revolves around a new lock, ifp->state_lock.  I
added it instead of trying to reuse the existing ifp->lock because
the latter has serious nesting issues that prevent it from easily
being used.  My long term plan is to restructure the locking and
eventually phase out ifp->lock in favour of ifp->state_lock.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
@ 2010-05-18 11:04                   ` Herbert Xu
  2010-05-18 17:23                     ` Stephen Hemminger
  2010-05-18 11:04                   ` [PATCH 2/4] ipv6: Use state_lock to protect ifa state Herbert Xu
                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 11:04 UTC (permalink / raw)
  To: David Miller, jbohac, yoshfuji, netdev, shemminger

ipv6: Replace inet6_ifaddr->dead with state

This patch replaces the boolean dead flag on inet6_ifaddr with
a state enum.  This allows us to roll back changes when deleting
an address according to whether DAD has completed or not.

This patch only adds the state field and does not change the logic.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/net/if_inet6.h |   12 ++++++++++--
 net/ipv6/addrconf.c    |   11 ++++++-----
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 545d8b0..421a1e8 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -30,6 +30,13 @@
 #define IF_PREFIX_ONLINK	0x01
 #define IF_PREFIX_AUTOCONF	0x02
 
+enum {
+	INET6_IFADDR_STATE_DAD,
+	INET6_IFADDR_STATE_POSTDAD,
+	INET6_IFADDR_STATE_UP,
+	INET6_IFADDR_STATE_DEAD,
+};
+
 #ifdef __KERNEL__
 
 struct inet6_ifaddr {
@@ -40,6 +47,9 @@ struct inet6_ifaddr {
 	__u32			prefered_lft;
 	atomic_t		refcnt;
 	spinlock_t		lock;
+	spinlock_t		state_lock;
+
+	int			state;
 
 	__u8			probes;
 	__u8			flags;
@@ -62,8 +72,6 @@ struct inet6_ifaddr {
 	struct inet6_ifaddr	*ifpub;
 	int			regen_count;
 #endif
-
-	int			dead;
 };
 
 struct ip6_sf_socklist {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 413054f..14b2ccf 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -539,7 +539,7 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp)
 	if (del_timer(&ifp->timer))
 		printk("Timer is still running, when freeing ifa=%p\n", ifp);
 
-	if (!ifp->dead) {
+	if (ifp->state != INET6_IFADDR_STATE_DEAD) {
 		printk("Freeing alive inet6 address %p\n", ifp);
 		return;
 	}
@@ -642,6 +642,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr, int pfxlen,
 	ipv6_addr_copy(&ifa->addr, addr);
 
 	spin_lock_init(&ifa->lock);
+	spin_lock_init(&ifa->state_lock);
 	init_timer(&ifa->timer);
 	ifa->timer.data = (unsigned long) ifa;
 	ifa->scope = scope;
@@ -716,7 +717,7 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
-	ifp->dead = 1;
+	ifp->state = INET6_IFADDR_STATE_DEAD;
 
 	write_lock_bh(&addrconf_hash_lock);
 	for (ifap = &inet6_addr_lst[hash]; (ifa=*ifap) != NULL;
@@ -2679,7 +2680,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 	while ((ifa = idev->tempaddr_list) != NULL) {
 		idev->tempaddr_list = ifa->tmp_next;
 		ifa->tmp_next = NULL;
-		ifa->dead = 1;
+		ifa->state = INET6_IFADDR_STATE_DEAD;
 		write_unlock_bh(&idev->lock);
 		spin_lock_bh(&ifa->lock);
 
@@ -2724,7 +2725,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 			ifa->flags |= IFA_F_TENTATIVE;
 			in6_ifa_hold(ifa);
 		} else {
-			ifa->dead = 1;
+			ifa->state = INET6_IFADDR_STATE_DEAD;
 		}
 		write_unlock_bh(&idev->lock);
 
@@ -2827,7 +2828,7 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	net_srandom(ifp->addr.s6_addr32[3]);
 
 	read_lock_bh(&idev->lock);
-	if (ifp->dead)
+	if (ifp->state == INET6_IFADDR_STATE_DEAD)
 		goto out;
 
 	spin_lock(&ifp->lock);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] ipv6: Use state_lock to protect ifa state
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
  2010-05-18 11:04                   ` [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state Herbert Xu
@ 2010-05-18 11:04                   ` Herbert Xu
  2010-05-18 11:04                   ` [PATCH 3/4] ipv6: Use POSTDAD state Herbert Xu
                                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 11:04 UTC (permalink / raw)
  To: David Miller, jbohac, yoshfuji, netdev, shemminger

ipv6: Use state_lock to protect ifa state

This patch makes use of the new state_lock to synchronise between
updates to the ifa state.  This fixes the issue where a remotely
triggered address deletion (through DAD failure) coincides with a
local administrative address deletion, causing certain actions to
be performed twice incorrectly.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv6/addrconf.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 14b2ccf..d8d7c63 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -711,13 +711,20 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 {
 	struct inet6_ifaddr *ifa, **ifap;
 	struct inet6_dev *idev = ifp->idev;
+	int state;
 	int hash;
 	int deleted = 0, onlink = 0;
 	unsigned long expires = jiffies;
 
 	hash = ipv6_addr_hash(&ifp->addr);
 
+	spin_lock_bh(&ifp->state_lock);
+	state = ifp->state;
 	ifp->state = INET6_IFADDR_STATE_DEAD;
+	spin_unlock_bh(&ifp->state_lock);
+
+	if (state == INET6_IFADDR_STATE_DEAD)
+		goto out;
 
 	write_lock_bh(&addrconf_hash_lock);
 	for (ifap = &inet6_addr_lst[hash]; (ifa=*ifap) != NULL;
@@ -831,6 +838,7 @@ static void ipv6_del_addr(struct inet6_ifaddr *ifp)
 		dst_release(&rt->u.dst);
 	}
 
+out:
 	in6_ifa_put(ifp);
 }
 
@@ -2621,6 +2629,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 	struct inet6_dev *idev;
 	struct inet6_ifaddr *ifa, *keep_list, **bifa;
 	struct net *net = dev_net(dev);
+	int state;
 	int i;
 
 	ASSERT_RTNL();
@@ -2680,7 +2689,6 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 	while ((ifa = idev->tempaddr_list) != NULL) {
 		idev->tempaddr_list = ifa->tmp_next;
 		ifa->tmp_next = NULL;
-		ifa->state = INET6_IFADDR_STATE_DEAD;
 		write_unlock_bh(&idev->lock);
 		spin_lock_bh(&ifa->lock);
 
@@ -2723,14 +2731,25 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
+
+			write_unlock_bh(&idev->lock);
+
 			in6_ifa_hold(ifa);
 		} else {
+			write_unlock_bh(&idev->lock);
+			spin_lock_bh(&ifa->state_lock);
+			state = ifa->state;
 			ifa->state = INET6_IFADDR_STATE_DEAD;
+			spin_unlock_bh(&ifa->state_lock);
+
+			if (state == INET6_IFADDR_STATE_DEAD)
+				goto put_ifa;
 		}
-		write_unlock_bh(&idev->lock);
 
 		__ipv6_ifa_notify(RTM_DELADDR, ifa);
 		atomic_notifier_call_chain(&inet6addr_chain, NETDEV_DOWN, ifa);
+
+put_ifa:
 		in6_ifa_put(ifa);
 
 		write_lock_bh(&idev->lock);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] ipv6: Use POSTDAD state
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
  2010-05-18 11:04                   ` [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state Herbert Xu
  2010-05-18 11:04                   ` [PATCH 2/4] ipv6: Use state_lock to protect ifa state Herbert Xu
@ 2010-05-18 11:04                   ` Herbert Xu
  2010-05-18 11:04                   ` [PATCH 4/4] ipv6: Never schedule DAD timer on dead address Herbert Xu
  2010-05-18 17:25                   ` [0/4] Fix addrconf race conditions Stephen Hemminger
  4 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 11:04 UTC (permalink / raw)
  To: David Miller, jbohac, yoshfuji, netdev, shemminger

ipv6: Use POSTDAD state

This patch makes use of the new POSTDAD state.  This prevents
a race between DAD completion and failure.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv6/addrconf.c |   29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d8d7c63..e7e9643 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1412,10 +1412,27 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, int dad_failed)
 		ipv6_del_addr(ifp);
 }
 
+static int addrconf_dad_end(struct inet6_ifaddr *ifp)
+{
+	int err = -ENOENT;
+
+	spin_lock(&ifp->state_lock);
+	if (ifp->state == INET6_IFADDR_STATE_DAD) {
+		ifp->state = INET6_IFADDR_STATE_POSTDAD;
+		err = 0;
+	}
+	spin_unlock(&ifp->state_lock);
+
+	return err;
+}
+
 void addrconf_dad_failure(struct inet6_ifaddr *ifp)
 {
 	struct inet6_dev *idev = ifp->idev;
 
+	if (addrconf_dad_end(ifp))
+		return;
+
 	if (net_ratelimit())
 		printk(KERN_INFO "%s: IPv6 duplicate address %pI6c detected!\n",
 			ifp->idev->dev->name, &ifp->addr);
@@ -2731,6 +2748,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 
 			/* Flag it for later restoration when link comes up */
 			ifa->flags |= IFA_F_TENTATIVE;
+			ifa->state = INET6_IFADDR_STATE_DAD;
 
 			write_unlock_bh(&idev->lock);
 
@@ -2895,6 +2913,9 @@ static void addrconf_dad_timer(unsigned long data)
 	struct inet6_dev *idev = ifp->idev;
 	struct in6_addr mcaddr;
 
+	if (!ifp->probes && addrconf_dad_end(ifp))
+		goto out;
+
 	read_lock(&idev->lock);
 	if (idev->dead || !(idev->if_flags & IF_READY)) {
 		read_unlock(&idev->lock);
@@ -2967,12 +2988,10 @@ static void addrconf_dad_run(struct inet6_dev *idev) {
 	read_lock_bh(&idev->lock);
 	for (ifp = idev->addr_list; ifp; ifp = ifp->if_next) {
 		spin_lock(&ifp->lock);
-		if (!(ifp->flags & IFA_F_TENTATIVE)) {
-			spin_unlock(&ifp->lock);
-			continue;
-		}
+		if (ifp->flags & IFA_F_TENTATIVE &&
+		    ifp->state == INET6_IFADDR_STATE_DAD)
+			addrconf_dad_kick(ifp);
 		spin_unlock(&ifp->lock);
-		addrconf_dad_kick(ifp);
 	}
 	read_unlock_bh(&idev->lock);
 }

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] ipv6: Never schedule DAD timer on dead address
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
                                     ` (2 preceding siblings ...)
  2010-05-18 11:04                   ` [PATCH 3/4] ipv6: Use POSTDAD state Herbert Xu
@ 2010-05-18 11:04                   ` Herbert Xu
  2010-05-18 17:25                   ` [0/4] Fix addrconf race conditions Stephen Hemminger
  4 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 11:04 UTC (permalink / raw)
  To: David Miller, jbohac, yoshfuji, netdev, shemminger

ipv6: Never schedule DAD timer on dead address

This patch ensures that all places that schedule the DAD timer
look at the address state in a safe manner before scheduling the
timer.  This ensures that we don't end up with pending timers
after deleting an address.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 net/ipv6/addrconf.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e7e9643..dd72f9c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2865,10 +2865,10 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 	net_srandom(ifp->addr.s6_addr32[3]);
 
 	read_lock_bh(&idev->lock);
+	spin_lock(&ifp->lock);
 	if (ifp->state == INET6_IFADDR_STATE_DEAD)
 		goto out;
 
-	spin_lock(&ifp->lock);
 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
 	    idev->cnf.accept_dad < 1 ||
 	    !(ifp->flags&IFA_F_TENTATIVE) ||
@@ -2902,8 +2902,8 @@ static void addrconf_dad_start(struct inet6_ifaddr *ifp, u32 flags)
 		ip6_ins_rt(ifp->rt);
 
 	addrconf_dad_kick(ifp);
-	spin_unlock(&ifp->lock);
 out:
+	spin_unlock(&ifp->lock);
 	read_unlock_bh(&idev->lock);
 }
 
@@ -2923,6 +2923,12 @@ static void addrconf_dad_timer(unsigned long data)
 	}
 
 	spin_lock(&ifp->lock);
+	if (ifp->state == INET6_IFADDR_STATE_DEAD) {
+		spin_unlock(&ifp->lock);
+		read_unlock(&idev->lock);
+		goto out;
+	}
+
 	if (ifp->probes == 0) {
 		/*
 		 * DAD was successful

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state
  2010-05-18 11:04                   ` [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state Herbert Xu
@ 2010-05-18 17:23                     ` Stephen Hemminger
  2010-05-18 22:27                       ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2010-05-18 17:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev

On Tue, 18 May 2010 21:04:19 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

>  
> +enum {
> +	INET6_IFADDR_STATE_DAD,
> +	INET6_IFADDR_STATE_POSTDAD,
> +	INET6_IFADDR_STATE_UP,
> +	INET6_IFADDR_STATE_DEAD,
> +};
> +
>  #ifdef __KERNEL__

Does this really need to be visible to user applications?

-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [0/4] Fix addrconf race conditions
  2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
                                     ` (3 preceding siblings ...)
  2010-05-18 11:04                   ` [PATCH 4/4] ipv6: Never schedule DAD timer on dead address Herbert Xu
@ 2010-05-18 17:25                   ` Stephen Hemminger
  2010-05-18 22:27                     ` David Miller
  4 siblings, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2010-05-18 17:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, jbohac, yoshfuji, netdev

On Tue, 18 May 2010 21:02:43 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Apr 23, 2010 at 11:05:40PM +0800, Herbert Xu wrote:
> >
> > This stuff is more broken than I thought.  For example, we perform
> > a number of actions when DAD succeeds, e.g., joining an anycast
> > group.  However, this is not synchronised with respect to address
> > deletion at all, so if DAD succeeds just as someone deletes the
> > address, you can easily get stuck on that anycast group.
> > 
> > I will try to untangle this mess tomorrow.
> 
> Tomorrow took a while to arrive :)
> 
> Here is a first batch of patches.  Note that this is by no means
> a comprehensive fix for all the ndisc/addrconf race conditions.
> It is just a first step in trying to address the problems.
> 
> The patchset revolves around a new lock, ifp->state_lock.  I
> added it instead of trying to reuse the existing ifp->lock because
> the latter has serious nesting issues that prevent it from easily
> being used.  My long term plan is to restructure the locking and
> eventually phase out ifp->lock in favour of ifp->state_lock.

I wonder if so many fine grained locks are really necessary at
all.  Everything but timers looks like it is under RTNL mutex
already.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state
  2010-05-18 17:23                     ` Stephen Hemminger
@ 2010-05-18 22:27                       ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2010-05-18 22:27 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, jbohac, yoshfuji, netdev

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 18 May 2010 10:23:34 -0700

> On Tue, 18 May 2010 21:04:19 +1000
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
>>  
>> +enum {
>> +	INET6_IFADDR_STATE_DAD,
>> +	INET6_IFADDR_STATE_POSTDAD,
>> +	INET6_IFADDR_STATE_UP,
>> +	INET6_IFADDR_STATE_DEAD,
>> +};
>> +
>>  #ifdef __KERNEL__
> 
> Does this really need to be visible to user applications?

I'll fix this up when I apply Herbert's patches.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [0/4] Fix addrconf race conditions
  2010-05-18 17:25                   ` [0/4] Fix addrconf race conditions Stephen Hemminger
@ 2010-05-18 22:27                     ` David Miller
  2010-05-18 22:35                       ` Herbert Xu
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2010-05-18 22:27 UTC (permalink / raw)
  To: shemminger; +Cc: herbert, jbohac, yoshfuji, netdev

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Tue, 18 May 2010 10:25:50 -0700

> I wonder if so many fine grained locks are really necessary at
> all.  Everything but timers looks like it is under RTNL mutex
> already.

I can't see any reasonable alternative, and it took weeks to get
a fix at all.

So I'm going to apply Herbert's fixes with the __KERNEL__ header
bit fixed up.

Thanks Herbert!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [0/4] Fix addrconf race conditions
  2010-05-18 22:27                     ` David Miller
@ 2010-05-18 22:35                       ` Herbert Xu
  0 siblings, 0 replies; 28+ messages in thread
From: Herbert Xu @ 2010-05-18 22:35 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, jbohac, yoshfuji, netdev

On Tue, May 18, 2010 at 03:27:59PM -0700, David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Tue, 18 May 2010 10:25:50 -0700
> 
> > I wonder if so many fine grained locks are really necessary at
> > all.  Everything but timers looks like it is under RTNL mutex
> > already.
> 
> I can't see any reasonable alternative, and it took weeks to get
> a fix at all.

Right, the issue is with forcing an immediate effect after an
NDISC packet triggers an address addition/deletion.

>From what I can gather, the hard case was with NDISC address
addition with DAD disabled.  That is, you have an NDISC packet
that causes an addition followed immediately by a packet destined
(or otherwise relying on) that new address.  In that case, we
simply have no chance of using the RTNL.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-05-18 22:35 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-20 17:44 IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
2010-04-20 17:57 ` Eric Dumazet
2010-04-20 20:49   ` Jiri Bohac
2010-04-20 20:57     ` Eric Dumazet
2010-04-20 21:16       ` Stephen Hemminger
2010-04-20 21:35         ` Jiri Bohac
2010-04-21 21:34 ` Jiri Bohac
2010-04-22  2:32   ` Herbert Xu
2010-04-22  7:43     ` David Miller
2010-04-22 14:25       ` Herbert Xu
2010-04-22 15:49         ` Jiri Bohac
2010-04-22 16:17           ` Stephen Hemminger
2010-04-23  1:54           ` David Miller
2010-04-23  2:10             ` Herbert Xu
2010-04-23 15:05               ` Herbert Xu
2010-05-18 11:02                 ` [0/4] Fix addrconf race conditions Herbert Xu
2010-05-18 11:04                   ` [PATCH 1/4] ipv6: Replace inet6_ifaddr->dead with state Herbert Xu
2010-05-18 17:23                     ` Stephen Hemminger
2010-05-18 22:27                       ` David Miller
2010-05-18 11:04                   ` [PATCH 2/4] ipv6: Use state_lock to protect ifa state Herbert Xu
2010-05-18 11:04                   ` [PATCH 3/4] ipv6: Use POSTDAD state Herbert Xu
2010-05-18 11:04                   ` [PATCH 4/4] ipv6: Never schedule DAD timer on dead address Herbert Xu
2010-05-18 17:25                   ` [0/4] Fix addrconf race conditions Stephen Hemminger
2010-05-18 22:27                     ` David Miller
2010-05-18 22:35                       ` Herbert Xu
2010-04-27 15:50             ` IPv6: race condition in __ipv6_ifa_notify() and dst_free() ? Jiri Bohac
2010-04-27 15:55               ` Herbert Xu
2010-05-09  6:48                 ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).