All of lore.kernel.org
 help / color / mirror / Atom feed
* Race condition in ipv6 code
@ 2012-01-12  2:13 Francesco Ruggeri
  2012-01-12  6:31 ` Eric Dumazet
  2012-01-13  1:17 ` Eric W. Biederman
  0 siblings, 2 replies; 14+ messages in thread
From: Francesco Ruggeri @ 2012-01-12  2:13 UTC (permalink / raw)
  To: netdev

We have hit a race condition in ipv6 code when setting
/proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
has to be restarted.

I wonder if anyone else has run into the same issue.

The current sequence in addrconf_sysctl_forward() and
addrconf_fixup_forwarding()  is as follows:
- change the parameter in idev->cnf.forwarding (using proc_dointvec())
- try to get the rtnl lock
- if cannot get the lock then restore the original value in
idev->cnf.forwarding and restart the syscall.

While this is going on, the ipv6 code may access idev->cnf.forwarding
and get an incorrect value.
In our case we were in addrconf_ifdown (holding the rtnl lock)  and
calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
entries.
__ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
idev->cnf.forwarding is set. Because a process trying to set
forwarding to 0 was stuck in the restart_syscall sequence above
flipping the flag on and off, we erroneously read the flag as 0, with
the result that addrconf_leave_anycast() was not invoked, some
idev->ac_list entries were never released, idev was never freed and
kept a reference to its net_device, and the net_device was never freed
and caused the "unregister_netdevice: waiting for xxx to become free"
message forever. In our case this was a vlan interfaces that was being
deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
vlan_ioctl_mutex with further bad consequences.
The following diffs (for 2.6.38, but the same logic seems to be used
in 3.2) address the issue by modifying idev->cnf.forwarding only after
the rtnl lock is acquired. There is a similar situation for
disable_ipv6.
Any comments are appreciated.

Francesco Ruggeri


--- a/net/ipv6/addrconf.c    2011-03-14 18:20:32.000000000 -0700
+++ b/net/ipv6/addrconf.c    2012-01-10 12:56:01.458880292 -0800
@@ -507,29 +507,31 @@ static void addrconf_forward_change(stru
     rcu_read_unlock();
 }

-static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
+static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int newf)
 {
     struct net *net;
+    int old;

     net = (struct net *)table->extra2;
-    if (p == &net->ipv6.devconf_dflt->forwarding)
+    if (p == &net->ipv6.devconf_dflt->forwarding) {
+        *p = newf;
         return 0;
+    }

-    if (!rtnl_trylock()) {
-        /* Restore the original values before restarting */
-        *p = old;
+    if (!rtnl_trylock())
         return restart_syscall();
-    }
+
+    old = *p;
+    *p = newf;

     if (p == &net->ipv6.devconf_all->forwarding) {
-        __s32 newf = net->ipv6.devconf_all->forwarding;
         net->ipv6.devconf_dflt->forwarding = newf;
         addrconf_forward_change(net, newf);
-    } else if ((!*p) ^ (!old))
+    } else if ((!newf) ^ (!old))
         dev_forward_change((struct inet6_dev *)table->extra1);
     rtnl_unlock();

-    if (*p)
+    if (newf)
         rt6_purge_dflt_routers(net);
     return 1;
 }
@@ -4165,9 +4167,17 @@ int addrconf_sysctl_forward(ctl_table *c
     int *valp = ctl->data;
     int val = *valp;
     loff_t pos = *ppos;
+    ctl_table lctl;
     int ret;

-    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+    /*
+     * ctl->data points to idev->cnf.forwarding, we should
+     * not modify it until we get the rtnl lock.
+     */
+    lctl = *ctl;
+    lctl.data = &val;
+
+    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);

     if (write)
         ret = addrconf_fixup_forwarding(ctl, valp, val);
@@ -4205,26 +4215,28 @@ static void addrconf_disable_change(stru
     rcu_read_unlock();
 }

-static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int old)
+static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int newf)
 {
     struct net *net;
+    int old;

     net = (struct net *)table->extra2;

-    if (p == &net->ipv6.devconf_dflt->disable_ipv6)
+    if (p == &net->ipv6.devconf_dflt->disable_ipv6) {
+        *p = newf;
         return 0;
+    }

-    if (!rtnl_trylock()) {
-        /* Restore the original values before restarting */
-        *p = old;
+    if (!rtnl_trylock())
         return restart_syscall();
-    }
+
+    old = *p;
+    *p = newf;

     if (p == &net->ipv6.devconf_all->disable_ipv6) {
-        __s32 newf = net->ipv6.devconf_all->disable_ipv6;
         net->ipv6.devconf_dflt->disable_ipv6 = newf;
         addrconf_disable_change(net, newf);
-    } else if ((!*p) ^ (!old))
+    } else if ((!newf) ^ (!old))
         dev_disable_change((struct inet6_dev *)table->extra1);

     rtnl_unlock();
@@ -4238,9 +4250,17 @@ int addrconf_sysctl_disable(ctl_table *c
     int *valp = ctl->data;
     int val = *valp;
     loff_t pos = *ppos;
+    ctl_table lctl;
     int ret;

-    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+    /*
+     * ctl->data points to idev->cnf.disable_ipv6, we should
+     * not modify it until we get the rtnl lock.
+     */
+    lctl = *ctl;
+    lctl.data = &val;
+
+    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);

     if (write)
         ret = addrconf_disable_ipv6(ctl, valp, val);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-12  2:13 Race condition in ipv6 code Francesco Ruggeri
@ 2012-01-12  6:31 ` Eric Dumazet
  2012-01-12  6:44   ` David Miller
  2012-01-13  0:11   ` Eric W. Biederman
  2012-01-13  1:17 ` Eric W. Biederman
  1 sibling, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2012-01-12  6:31 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev

Le mercredi 11 janvier 2012 à 18:13 -0800, Francesco Ruggeri a écrit :
> We have hit a race condition in ipv6 code when setting
> /proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
> has to be restarted.
> 
> I wonder if anyone else has run into the same issue.
> 
> The current sequence in addrconf_sysctl_forward() and
> addrconf_fixup_forwarding()  is as follows:
> - change the parameter in idev->cnf.forwarding (using proc_dointvec())
> - try to get the rtnl lock
> - if cannot get the lock then restore the original value in
> idev->cnf.forwarding and restart the syscall.
> 
> While this is going on, the ipv6 code may access idev->cnf.forwarding
> and get an incorrect value.
> In our case we were in addrconf_ifdown (holding the rtnl lock)  and
> calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
> entries.
> __ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
> idev->cnf.forwarding is set. Because a process trying to set
> forwarding to 0 was stuck in the restart_syscall sequence above
> flipping the flag on and off, we erroneously read the flag as 0, with
> the result that addrconf_leave_anycast() was not invoked, some
> idev->ac_list entries were never released, idev was never freed and
> kept a reference to its net_device, and the net_device was never freed
> and caused the "unregister_netdevice: waiting for xxx to become free"
> message forever. In our case this was a vlan interfaces that was being
> deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
> vlan_ioctl_mutex with further bad consequences.
> The following diffs (for 2.6.38, but the same logic seems to be used
> in 3.2) address the issue by modifying idev->cnf.forwarding only after
> the rtnl lock is acquired. There is a similar situation for
> disable_ipv6.
> Any comments are appreciated.
> 
> Francesco Ruggeri

Real question is : why are we using this horrible thing at all

if (!rtnl_trylock())
	return restart_syscall();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-12  6:31 ` Eric Dumazet
@ 2012-01-12  6:44   ` David Miller
  2012-01-12 20:48     ` Francesco Ruggeri
  2012-01-13  0:11   ` Eric W. Biederman
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2012-01-12  6:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fruggeri, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 12 Jan 2012 07:31:51 +0100

> Real question is : why are we using this horrible thing at all
> 
> if (!rtnl_trylock())
> 	return restart_syscall();

Because:

--------------------
commit 88af182e389097997c5e2a0b42285b3522796759
Author: Eric W. Biederman <ebiederm@xmission.com>
Date:   Fri Feb 19 13:22:59 2010 +0000

    net: Fix sysctl restarts...
    
    Yuck.  It turns out that when we restart sysctls we were restarting
    with the values already changed.  Which unfortunately meant that
    the second time through we thought there was no change and skipped
    all kinds of work, despite the fact that there was indeed a change.
    
    I have fixed this the simplest way possible by restoring the changed
    values when we restart the sysctl write.
    
    One of my coworkers spotted this bug when after disabling forwarding
    on an interface pings were still forwarded.
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 040c4f0..26dec2b 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1317,14 +1317,19 @@ static int devinet_sysctl_forward(ctl_table *ctl, int write,
 {
 	int *valp = ctl->data;
 	int val = *valp;
+	loff_t pos = *ppos;
 	int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 
 	if (write && *valp != val) {
 		struct net *net = ctl->extra2;
 
 		if (valp != &IPV4_DEVCONF_DFLT(net, FORWARDING)) {
-			if (!rtnl_trylock())
+			if (!rtnl_trylock()) {
+				/* Restore the original values before restarting */
+				*valp = val;
+				*ppos = pos;
 				return restart_syscall();
+			}
 			if (valp == &IPV4_DEVCONF_ALL(net, FORWARDING)) {
 				inet_forward_change(net);
 			} else if (*valp) {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index de7a194..143791d 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -502,8 +502,11 @@ static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
 	if (p == &net->ipv6.devconf_dflt->forwarding)
 		return 0;
 
-	if (!rtnl_trylock())
+	if (!rtnl_trylock()) {
+		/* Restore the original values before restarting */
+		*p = old;
 		return restart_syscall();
+	}
 
 	if (p == &net->ipv6.devconf_all->forwarding) {
 		__s32 newf = net->ipv6.devconf_all->forwarding;
@@ -4028,12 +4031,15 @@ int addrconf_sysctl_forward(ctl_table *ctl, int write,
 {
 	int *valp = ctl->data;
 	int val = *valp;
+	loff_t pos = *ppos;
 	int ret;
 
 	ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 
 	if (write)
 		ret = addrconf_fixup_forwarding(ctl, valp, val);
+	if (ret)
+		*ppos = pos;
 	return ret;
 }
 
@@ -4075,8 +4081,11 @@ static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int old)
 	if (p == &net->ipv6.devconf_dflt->disable_ipv6)
 		return 0;
 
-	if (!rtnl_trylock())
+	if (!rtnl_trylock()) {
+		/* Restore the original values before restarting */
+		*p = old;
 		return restart_syscall();
+	}
 
 	if (p == &net->ipv6.devconf_all->disable_ipv6) {
 		__s32 newf = net->ipv6.devconf_all->disable_ipv6;
@@ -4095,12 +4104,15 @@ int addrconf_sysctl_disable(ctl_table *ctl, int write,
 {
 	int *valp = ctl->data;
 	int val = *valp;
+	loff_t pos = *ppos;
 	int ret;
 
 	ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
 
 	if (write)
 		ret = addrconf_disable_ipv6(ctl, valp, val);
+	if (ret)
+		*ppos = pos;
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-12  6:44   ` David Miller
@ 2012-01-12 20:48     ` Francesco Ruggeri
  0 siblings, 0 replies; 14+ messages in thread
From: Francesco Ruggeri @ 2012-01-12 20:48 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

I am not familiar with the process, how can one submit a patch for
inclusion in future releases?

On a related note, and independently of the race condition, the return
value of proc_dointvec() in addrconf_sysctl_forward() should be
checked or incorrect situations like the following could occur:

1) User successfully sets all/forwarding to 1.
2) User successfully sets eth1/forwarding to 0.
3) User tries to set all/forwarding to 0, but proc_dointvec() fails,
so addrconf_fixup_forwarding() is invoked with both old and new value
as 1. addrconf_forward_change() then executes with a new value of 1,
and it sets the value to 1 for all idevs where the value is 0.
So the result of proc_dointvec() failing when writing 0 to
all/forwarding is that the value of eth1/forwarding is changed back to
1.

Thanks,
Francesco Ruggeri


On Wed, Jan 11, 2012 at 10:44 PM, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 12 Jan 2012 07:31:51 +0100
>
>> Real question is : why are we using this horrible thing at all
>>
>> if (!rtnl_trylock())
>>       return restart_syscall();
>
> Because:
>
> --------------------
> commit 88af182e389097997c5e2a0b42285b3522796759
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date:   Fri Feb 19 13:22:59 2010 +0000
>
>    net: Fix sysctl restarts...
>
>    Yuck.  It turns out that when we restart sysctls we were restarting
>    with the values already changed.  Which unfortunately meant that
>    the second time through we thought there was no change and skipped
>    all kinds of work, despite the fact that there was indeed a change.
>
>    I have fixed this the simplest way possible by restoring the changed
>    values when we restart the sysctl write.
>
>    One of my coworkers spotted this bug when after disabling forwarding
>    on an interface pings were still forwarded.
>
>    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
>    Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index 040c4f0..26dec2b 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -1317,14 +1317,19 @@ static int devinet_sysctl_forward(ctl_table *ctl, int write,
>  {
>        int *valp = ctl->data;
>        int val = *valp;
> +       loff_t pos = *ppos;
>        int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>
>        if (write && *valp != val) {
>                struct net *net = ctl->extra2;
>
>                if (valp != &IPV4_DEVCONF_DFLT(net, FORWARDING)) {
> -                       if (!rtnl_trylock())
> +                       if (!rtnl_trylock()) {
> +                               /* Restore the original values before restarting */
> +                               *valp = val;
> +                               *ppos = pos;
>                                return restart_syscall();
> +                       }
>                        if (valp == &IPV4_DEVCONF_ALL(net, FORWARDING)) {
>                                inet_forward_change(net);
>                        } else if (*valp) {
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index de7a194..143791d 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -502,8 +502,11 @@ static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
>        if (p == &net->ipv6.devconf_dflt->forwarding)
>                return 0;
>
> -       if (!rtnl_trylock())
> +       if (!rtnl_trylock()) {
> +               /* Restore the original values before restarting */
> +               *p = old;
>                return restart_syscall();
> +       }
>
>        if (p == &net->ipv6.devconf_all->forwarding) {
>                __s32 newf = net->ipv6.devconf_all->forwarding;
> @@ -4028,12 +4031,15 @@ int addrconf_sysctl_forward(ctl_table *ctl, int write,
>  {
>        int *valp = ctl->data;
>        int val = *valp;
> +       loff_t pos = *ppos;
>        int ret;
>
>        ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>
>        if (write)
>                ret = addrconf_fixup_forwarding(ctl, valp, val);
> +       if (ret)
> +               *ppos = pos;
>        return ret;
>  }
>
> @@ -4075,8 +4081,11 @@ static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int old)
>        if (p == &net->ipv6.devconf_dflt->disable_ipv6)
>                return 0;
>
> -       if (!rtnl_trylock())
> +       if (!rtnl_trylock()) {
> +               /* Restore the original values before restarting */
> +               *p = old;
>                return restart_syscall();
> +       }
>
>        if (p == &net->ipv6.devconf_all->disable_ipv6) {
>                __s32 newf = net->ipv6.devconf_all->disable_ipv6;
> @@ -4095,12 +4104,15 @@ int addrconf_sysctl_disable(ctl_table *ctl, int write,
>  {
>        int *valp = ctl->data;
>        int val = *valp;
> +       loff_t pos = *ppos;
>        int ret;
>
>        ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>
>        if (write)
>                ret = addrconf_disable_ipv6(ctl, valp, val);
> +       if (ret)
> +               *ppos = pos;
>        return ret;
>  }
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-12  6:31 ` Eric Dumazet
  2012-01-12  6:44   ` David Miller
@ 2012-01-13  0:11   ` Eric W. Biederman
  2012-01-13  6:02     ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2012-01-13  0:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Francesco Ruggeri, netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le mercredi 11 janvier 2012 à 18:13 -0800, Francesco Ruggeri a écrit :
>> We have hit a race condition in ipv6 code when setting
>> /proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
>> has to be restarted.
>> 
>> I wonder if anyone else has run into the same issue.
>> 
>> The current sequence in addrconf_sysctl_forward() and
>> addrconf_fixup_forwarding()  is as follows:
>> - change the parameter in idev->cnf.forwarding (using proc_dointvec())
>> - try to get the rtnl lock
>> - if cannot get the lock then restore the original value in
>> idev->cnf.forwarding and restart the syscall.
>> 
>> While this is going on, the ipv6 code may access idev->cnf.forwarding
>> and get an incorrect value.
>> In our case we were in addrconf_ifdown (holding the rtnl lock)  and
>> calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
>> entries.
>> __ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
>> idev->cnf.forwarding is set. Because a process trying to set
>> forwarding to 0 was stuck in the restart_syscall sequence above
>> flipping the flag on and off, we erroneously read the flag as 0, with
>> the result that addrconf_leave_anycast() was not invoked, some
>> idev->ac_list entries were never released, idev was never freed and
>> kept a reference to its net_device, and the net_device was never freed
>> and caused the "unregister_netdevice: waiting for xxx to become free"
>> message forever. In our case this was a vlan interfaces that was being
>> deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
>> vlan_ioctl_mutex with further bad consequences.
>> The following diffs (for 2.6.38, but the same logic seems to be used
>> in 3.2) address the issue by modifying idev->cnf.forwarding only after
>> the rtnl lock is acquired. There is a similar situation for
>> disable_ipv6.
>> Any comments are appreciated.
>> 
>> Francesco Ruggeri
>
> Real question is : why are we using this horrible thing at all
>
> if (!rtnl_trylock())
> 	return restart_syscall();

Because the rtnl_lock is broad and we have ABBA deadlocks if we don't in
particular we hold the rtnl_lock over sysctl registration and removal.
sysctl removal blocks until all of the callers into the sysctl methods
namely addrconf_sysctl_forward in this case finish executing.

CPU 0                                         CPU 1

rtnl_lock()                                   use_count++
unregister_netdevice()                           addrconf_ctl_foward
  unregister_sysctl_table()                        rtnl_lock()
     wait for use_count of addrconf_ctl_forward
       to == 0
     

I smacked lockdep around so it would warn about the sysfs ones.
The proc and sysctl ones I never did manage to get lockdep warnings
but a ABBA deadlock is most definitely possible.

Any solutions better than simply restarting the system call are welcome.

Perhaps for these heavy weigh methods we should create a work struct
and go schedule work to perform the change instead of trying to do the
work synchronously in the sysctl handler.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-12  2:13 Race condition in ipv6 code Francesco Ruggeri
  2012-01-12  6:31 ` Eric Dumazet
@ 2012-01-13  1:17 ` Eric W. Biederman
  2012-01-13  1:57   ` Stephen Hemminger
  2012-01-13 22:02   ` Francesco Ruggeri
  1 sibling, 2 replies; 14+ messages in thread
From: Eric W. Biederman @ 2012-01-13  1:17 UTC (permalink / raw)
  To: Francesco Ruggeri; +Cc: netdev, Eric Dumazet, David Miller

Francesco Ruggeri <fruggeri@aristanetworks.com> writes:

> We have hit a race condition in ipv6 code when setting
> /proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
> has to be restarted.
>
> I wonder if anyone else has run into the same issue.
>
> The current sequence in addrconf_sysctl_forward() and
> addrconf_fixup_forwarding()  is as follows:
> - change the parameter in idev->cnf.forwarding (using proc_dointvec())
> - try to get the rtnl lock
> - if cannot get the lock then restore the original value in
> idev->cnf.forwarding and restart the syscall.
>
> While this is going on, the ipv6 code may access idev->cnf.forwarding
> and get an incorrect value.
> In our case we were in addrconf_ifdown (holding the rtnl lock)  and
> calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
> entries.
> __ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
> idev->cnf.forwarding is set. Because a process trying to set
> forwarding to 0 was stuck in the restart_syscall sequence above
> flipping the flag on and off, we erroneously read the flag as 0, with
> the result that addrconf_leave_anycast() was not invoked, some
> idev->ac_list entries were never released, idev was never freed and
> kept a reference to its net_device, and the net_device was never freed
> and caused the "unregister_netdevice: waiting for xxx to become free"
> message forever. In our case this was a vlan interfaces that was being
> deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
> vlan_ioctl_mutex with further bad consequences.
> The following diffs (for 2.6.38, but the same logic seems to be used
> in 3.2) address the issue by modifying idev->cnf.forwarding only after
> the rtnl lock is acquired. There is a similar situation for
> disable_ipv6.
> Any comments are appreciated.

Interesting.  So ultimately the problem is not the syscall restart
although that exacerbates it, the problem is that we expect 
idev->cnf.forwarding to be protected by the rtnl_lock and it is not.

At first read through your patch looks good.  I am a bit worried that
we have some versions of the value: aka
net->ipv6.devconf_dflt->forwarding not protected by the rtnl_lock
and other version of the value protected by the rtnl_lock.

That just seems confusing.

We can't hold the rtnl_lock around proc_dointvec because that can sleep
indefinitely in copy_from_user.  So it looks like your change to create
a temporary ctl_table and call proc_dointvec seems very reasonable,
and necessary however we do this.

I don't know if there are other places that need the rtnl_lock that
but your patch below looks like it makes things better for all of
the right reasons.  So on that score.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Unless someone wants to volunteer to sort out the impedance mismatch
between these tunables and the sysctl infrastructure.  I suggest
you resend this patch to David with [PATCH] in the subject line.

I would also suggest a little clearer description why
idev->cnf.forwarding and idev->cnf.disable_ipv6 need rntl_lock
protection.

But overall this looks like a pretty obvious bug fix, to the
problem that we need the rtnl_lock to protect idev->cnf.forwarding,
and we currently allow updates to idev->cnf.forwarding without
holding the rtnl_lock. 

Eric


> Francesco Ruggeri
>
>
> --- a/net/ipv6/addrconf.c    2011-03-14 18:20:32.000000000 -0700
> +++ b/net/ipv6/addrconf.c    2012-01-10 12:56:01.458880292 -0800
> @@ -507,29 +507,31 @@ static void addrconf_forward_change(stru
>      rcu_read_unlock();
>  }
>
> -static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
> +static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int newf)
>  {
>      struct net *net;
> +    int old;
>
>      net = (struct net *)table->extra2;
> -    if (p == &net->ipv6.devconf_dflt->forwarding)
> +    if (p == &net->ipv6.devconf_dflt->forwarding) {
> +        *p = newf;
>          return 0;
> +    }
>
> -    if (!rtnl_trylock()) {
> -        /* Restore the original values before restarting */
> -        *p = old;
> +    if (!rtnl_trylock())
>          return restart_syscall();
> -    }
> +
> +    old = *p;
> +    *p = newf;
>
>      if (p == &net->ipv6.devconf_all->forwarding) {
> -        __s32 newf = net->ipv6.devconf_all->forwarding;
>          net->ipv6.devconf_dflt->forwarding = newf;
>          addrconf_forward_change(net, newf);
> -    } else if ((!*p) ^ (!old))
> +    } else if ((!newf) ^ (!old))
>          dev_forward_change((struct inet6_dev *)table->extra1);
>      rtnl_unlock();
>
> -    if (*p)
> +    if (newf)
>          rt6_purge_dflt_routers(net);
>      return 1;
>  }
> @@ -4165,9 +4167,17 @@ int addrconf_sysctl_forward(ctl_table *c
>      int *valp = ctl->data;
>      int val = *valp;
>      loff_t pos = *ppos;
> +    ctl_table lctl;
>      int ret;
>
> -    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> +    /*
> +     * ctl->data points to idev->cnf.forwarding, we should
> +     * not modify it until we get the rtnl lock.
> +     */
> +    lctl = *ctl;
> +    lctl.data = &val;
> +
> +    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>
>      if (write)
>          ret = addrconf_fixup_forwarding(ctl, valp, val);
> @@ -4205,26 +4215,28 @@ static void addrconf_disable_change(stru
>      rcu_read_unlock();
>  }
>
> -static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int old)
> +static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int newf)
>  {
>      struct net *net;
> +    int old;
>
>      net = (struct net *)table->extra2;
>
> -    if (p == &net->ipv6.devconf_dflt->disable_ipv6)
> +    if (p == &net->ipv6.devconf_dflt->disable_ipv6) {
> +        *p = newf;
>          return 0;
> +    }
>
> -    if (!rtnl_trylock()) {
> -        /* Restore the original values before restarting */
> -        *p = old;
> +    if (!rtnl_trylock())
>          return restart_syscall();
> -    }
> +
> +    old = *p;
> +    *p = newf;
>
>      if (p == &net->ipv6.devconf_all->disable_ipv6) {
> -        __s32 newf = net->ipv6.devconf_all->disable_ipv6;
>          net->ipv6.devconf_dflt->disable_ipv6 = newf;
>          addrconf_disable_change(net, newf);
> -    } else if ((!*p) ^ (!old))
> +    } else if ((!newf) ^ (!old))
>          dev_disable_change((struct inet6_dev *)table->extra1);
>
>      rtnl_unlock();
> @@ -4238,9 +4250,17 @@ int addrconf_sysctl_disable(ctl_table *c
>      int *valp = ctl->data;
>      int val = *valp;
>      loff_t pos = *ppos;
> +    ctl_table lctl;
>      int ret;
>
> -    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> +    /*
> +     * ctl->data points to idev->cnf.disable_ipv6, we should
> +     * not modify it until we get the rtnl lock.
> +     */
> +    lctl = *ctl;
> +    lctl.data = &val;
> +
> +    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>
>      if (write)
>          ret = addrconf_disable_ipv6(ctl, valp, val);
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13  1:17 ` Eric W. Biederman
@ 2012-01-13  1:57   ` Stephen Hemminger
  2012-01-13 22:02   ` Francesco Ruggeri
  1 sibling, 0 replies; 14+ messages in thread
From: Stephen Hemminger @ 2012-01-13  1:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Francesco Ruggeri, netdev, Eric Dumazet, David Miller

On Thu, 12 Jan 2012 17:17:32 -0800
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Francesco Ruggeri <fruggeri@aristanetworks.com> writes:
> 
> > We have hit a race condition in ipv6 code when setting
> > /proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
> > has to be restarted.
> >
> > I wonder if anyone else has run into the same issue.
> >
> > The current sequence in addrconf_sysctl_forward() and
> > addrconf_fixup_forwarding()  is as follows:
> > - change the parameter in idev->cnf.forwarding (using proc_dointvec())
> > - try to get the rtnl lock
> > - if cannot get the lock then restore the original value in
> > idev->cnf.forwarding and restart the syscall.
> >
> > While this is going on, the ipv6 code may access idev->cnf.forwarding
> > and get an incorrect value.
> > In our case we were in addrconf_ifdown (holding the rtnl lock)  and
> > calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
> > entries.
> > __ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
> > idev->cnf.forwarding is set. Because a process trying to set
> > forwarding to 0 was stuck in the restart_syscall sequence above
> > flipping the flag on and off, we erroneously read the flag as 0, with
> > the result that addrconf_leave_anycast() was not invoked, some
> > idev->ac_list entries were never released, idev was never freed and
> > kept a reference to its net_device, and the net_device was never freed
> > and caused the "unregister_netdevice: waiting for xxx to become free"
> > message forever. In our case this was a vlan interfaces that was being
> > deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
> > vlan_ioctl_mutex with further bad consequences.
> > The following diffs (for 2.6.38, but the same logic seems to be used
> > in 3.2) address the issue by modifying idev->cnf.forwarding only after
> > the rtnl lock is acquired. There is a similar situation for
> > disable_ipv6.
> > Any comments are appreciated.
> 
> Interesting.  So ultimately the problem is not the syscall restart
> although that exacerbates it, the problem is that we expect 
> idev->cnf.forwarding to be protected by the rtnl_lock and it is not.
> 
> At first read through your patch looks good.  I am a bit worried that
> we have some versions of the value: aka
> net->ipv6.devconf_dflt->forwarding not protected by the rtnl_lock
> and other version of the value protected by the rtnl_lock.
> 
> That just seems confusing.
> 
> We can't hold the rtnl_lock around proc_dointvec because that can sleep
> indefinitely in copy_from_user.  So it looks like your change to create
> a temporary ctl_table and call proc_dointvec seems very reasonable,
> and necessary however we do this.
> 
> I don't know if there are other places that need the rtnl_lock that
> but your patch below looks like it makes things better for all of
> the right reasons.  So on that score.
> 
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> Unless someone wants to volunteer to sort out the impedance mismatch
> between these tunables and the sysctl infrastructure.  I suggest
> you resend this patch to David with [PATCH] in the subject line.
> 
> I would also suggest a little clearer description why
> idev->cnf.forwarding and idev->cnf.disable_ipv6 need rntl_lock
> protection.
> 
> But overall this looks like a pretty obvious bug fix, to the
> problem that we need the rtnl_lock to protect idev->cnf.forwarding,
> and we currently allow updates to idev->cnf.forwarding without
> holding the rtnl_lock. 
> 
> Eric
> 

Looks like a better function (proc_doint_rtnl?) needs to be built
that has the locking in the right place. I.e:
  get value from user
  get lock (with restart)
  do changes
  unlock

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13  0:11   ` Eric W. Biederman
@ 2012-01-13  6:02     ` Eric Dumazet
  2012-01-13  7:40       ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2012-01-13  6:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Francesco Ruggeri, netdev

Le jeudi 12 janvier 2012 à 16:11 -0800, Eric W. Biederman a écrit :

> Because the rtnl_lock is broad and we have ABBA deadlocks if we don't in
> particular we hold the rtnl_lock over sysctl registration and removal.
> sysctl removal blocks until all of the callers into the sysctl methods
> namely addrconf_sysctl_forward in this case finish executing.
> 
> CPU 0                                         CPU 1
> 
> rtnl_lock()                                   use_count++
> unregister_netdevice()                           addrconf_ctl_foward
>   unregister_sysctl_table()                        rtnl_lock()
>      wait for use_count of addrconf_ctl_forward
>        to == 0
>      
> 
> I smacked lockdep around so it would warn about the sysfs ones.
> The proc and sysctl ones I never did manage to get lockdep warnings
> but a ABBA deadlock is most definitely possible.
> 
> Any solutions better than simply restarting the system call are welcome.
> 
> Perhaps for these heavy weigh methods we should create a work struct
> and go schedule work to perform the change instead of trying to do the
> work synchronously in the sysctl handler.
> 

This idea was discussed in netconf 2011 (I focused on the ability to
dismantle netdevices at high rates), and I admit I had not implemented
yet.

The problem with rtnl_trylock() is that we make very litle progress if
many tasks are competing for rtnl, and we have no fairness.

The task holding RTNL might be descheduled and all cpus running other
threads spinning in rtnl_trylock()/restart_syscall(). Almost a deadlock.

Maybe waiting RTNL for at least a couple of ms would help, instead of
instantly failing rtnl_trylock() and looping.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13  6:02     ` Eric Dumazet
@ 2012-01-13  7:40       ` Eric W. Biederman
  2012-01-13 17:04         ` Ben Greear
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2012-01-13  7:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Francesco Ruggeri, netdev, Stephen Hemminger

Eric Dumazet <eric.dumazet@gmail.com> writes:

> Le jeudi 12 janvier 2012 à 16:11 -0800, Eric W. Biederman a écrit :
>
>> Because the rtnl_lock is broad and we have ABBA deadlocks if we don't in
>> particular we hold the rtnl_lock over sysctl registration and removal.
>> sysctl removal blocks until all of the callers into the sysctl methods
>> namely addrconf_sysctl_forward in this case finish executing.
>> 
>> CPU 0                                         CPU 1
>> 
>> rtnl_lock()                                   use_count++
>> unregister_netdevice()                           addrconf_ctl_foward
>>   unregister_sysctl_table()                        rtnl_lock()
>>      wait for use_count of addrconf_ctl_forward
>>        to == 0
>>      
>> 
>> I smacked lockdep around so it would warn about the sysfs ones.
>> The proc and sysctl ones I never did manage to get lockdep warnings
>> but a ABBA deadlock is most definitely possible.
>> 
>> Any solutions better than simply restarting the system call are welcome.
>> 
>> Perhaps for these heavy weigh methods we should create a work struct
>> and go schedule work to perform the change instead of trying to do the
>> work synchronously in the sysctl handler.
>> 
>
> This idea was discussed in netconf 2011 (I focused on the ability to
> dismantle netdevices at high rates), and I admit I had not implemented
> yet.

Interesting.  That is a discussion I would have been interested to be
a part of.

In my queue I have sysctl cleanup and speed improvements I should be
posting them for review in the next week or two.

That leaves /proc/net/dev_snmp6/.... as the scalability bottleneck for
the number of network devices.

I have gotten the rcu_barrier in device unregister out from under the
trylock.

At a practical level the rcu_barrier seems to be the primary bottleneck
to remove netdevices at a high rate, although I am certain there are
other issues that crop up.  If I go with a networking path that can
batch network devices, and am not bottlenecked by proc, sysfs or sysctl
I can remove 10,000 or so network devices a second.

I have been scratching my head trying to figure out if it is possible
to batch network device deletes that come in via a netlink DELLINK message.

> The problem with rtnl_trylock() is that we make very litle progress if
> many tasks are competing for rtnl, and we have no fairness.

Yes.  That code path uses rtnl_trylock() simply because I needed a
simple and correct fix.  It was never intended to be the long term
solution, but rather a stop gap to at least remove the possibility of
deadlock in the kernel.  It is deployed fairly widely because I found
a lot of examples of that bug.

> The task holding RTNL might be descheduled and all cpus running other
> threads spinning in rtnl_trylock()/restart_syscall(). Almost a deadlock.
>
> Maybe waiting RTNL for at least a couple of ms would help, instead of
> instantly failing rtnl_trylock() and looping.

Yes.  If the rtnl_trylock and syscall_restart looping is a problem
we should use a lock other than the rtnl_lock for those variables
or we should use a work queue and take the rtnl_lock in a context
where we don't need the try lock.

I don't think there is much point in getting sophisticated about
rtnl_trylock().  It was a good stopgap but we really should remove
the ABBA deadlock.

As for Stephen Hemminger's suggestion to make a: proc_do_intvec_and_rtnl_lock
function.  We could make a helper that does what Francesco did but in a
slightly nicer fashion.  What we can't do is make a version that does
not have the ABBA deadlock, which makes the nasty rtnl_trylock feasible.
Given that we still would have the nasty effects of the rtnl_trylock and
syscall_restart idiom I don't expect creating a helper and encouraging
people to build code that is going to need that logic is a particularly
good idea.

The other option is to look at dropping the rtnl_lock in addrconf_ifdown
around the addrconf_sysctl_unregister call.  It doesn't seem obvious to
me that we can.

I just thought through the sysctl case just to be certain that we can't
take the locks in the other order, and I just don't see a way to do it.
The problem is that the use tracking guarantees that none of the user
supplied data is being used and it is safe to remove a module.  If the
lock comes from the module kaboom.

Hmm.  I partially take back my conclusion that I can't fix the problem
at the sysctl level.  If we want to take rtnl_lock for every
networking sysctl I think I could enhance the ctl_table_set or the
ctl_table_root concept to have a mutex we could take over every
networking sysctl.  With copy_from_user being in the middle of that
I'm not particularly fond of that idea.  Nor am I fond of taking
the rtnl_mutex in even more situations than we take it now.  Plus it
seems like more deep magic that will confuse people.

So I really think the best solution to avoid the locking craziness is to
have a wrapper that gets the value from userspace and calls
schedule_work to get another thread to actually process the change.  I
don't see any problems with writing a helper function for that.  The
only downside with using schedule_work is that we return to userspace
before the change has been fully installed in the kernel.  I don't
expect that would be a problem but stranger things have happened.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13  7:40       ` Eric W. Biederman
@ 2012-01-13 17:04         ` Ben Greear
  2012-01-14  5:46           ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Greear @ 2012-01-13 17:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Eric Dumazet, Francesco Ruggeri, netdev, Stephen Hemminger

On 01/12/2012 11:40 PM, Eric W. Biederman wrote:

> So I really think the best solution to avoid the locking craziness is to
> have a wrapper that gets the value from userspace and calls
> schedule_work to get another thread to actually process the change.  I
> don't see any problems with writing a helper function for that.  The
> only downside with using schedule_work is that we return to userspace
> before the change has been fully installed in the kernel.  I don't
> expect that would be a problem but stranger things have happened.

That sounds a bit risky to me.  If something sets a value, and then
queries it, it should always show the proper result for the previous
calls.

If the queries also went through the the same sched-work queue
then maybe it would be OK.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13  1:17 ` Eric W. Biederman
  2012-01-13  1:57   ` Stephen Hemminger
@ 2012-01-13 22:02   ` Francesco Ruggeri
  1 sibling, 0 replies; 14+ messages in thread
From: Francesco Ruggeri @ 2012-01-13 22:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: netdev, Eric Dumazet, David Miller

Thanks.
I will submit a patch in a few days, then you guys can decide if you
want to go for a more comprehensive approach.

Francesco

On Thu, Jan 12, 2012 at 5:17 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Francesco Ruggeri <fruggeri@aristanetworks.com> writes:
>
>> We have hit a race condition in ipv6 code when setting
>> /proc/sys/net/ipv6/conf/*/forwarding. This happens when the syscall
>> has to be restarted.
>>
>> I wonder if anyone else has run into the same issue.
>>
>> The current sequence in addrconf_sysctl_forward() and
>> addrconf_fixup_forwarding()  is as follows:
>> - change the parameter in idev->cnf.forwarding (using proc_dointvec())
>> - try to get the rtnl lock
>> - if cannot get the lock then restore the original value in
>> idev->cnf.forwarding and restart the syscall.
>>
>> While this is going on, the ipv6 code may access idev->cnf.forwarding
>> and get an incorrect value.
>> In our case we were in addrconf_ifdown (holding the rtnl lock)  and
>> calling __ipv6_ifa_notify(RTM_DELADDR, ifa) on the idev->addr_list
>> entries.
>> __ipv6_ifa_notify() only invokes addrconf_leave_anycast() if
>> idev->cnf.forwarding is set. Because a process trying to set
>> forwarding to 0 was stuck in the restart_syscall sequence above
>> flipping the flag on and off, we erroneously read the flag as 0, with
>> the result that addrconf_leave_anycast() was not invoked, some
>> idev->ac_list entries were never released, idev was never freed and
>> kept a reference to its net_device, and the net_device was never freed
>> and caused the "unregister_netdevice: waiting for xxx to become free"
>> message forever. In our case this was a vlan interfaces that was being
>> deleted, so we ended up getting stuck in vlan_ioctl_handler() holding
>> vlan_ioctl_mutex with further bad consequences.
>> The following diffs (for 2.6.38, but the same logic seems to be used
>> in 3.2) address the issue by modifying idev->cnf.forwarding only after
>> the rtnl lock is acquired. There is a similar situation for
>> disable_ipv6.
>> Any comments are appreciated.
>
> Interesting.  So ultimately the problem is not the syscall restart
> although that exacerbates it, the problem is that we expect
> idev->cnf.forwarding to be protected by the rtnl_lock and it is not.
>
> At first read through your patch looks good.  I am a bit worried that
> we have some versions of the value: aka
> net->ipv6.devconf_dflt->forwarding not protected by the rtnl_lock
> and other version of the value protected by the rtnl_lock.
>
> That just seems confusing.
>
> We can't hold the rtnl_lock around proc_dointvec because that can sleep
> indefinitely in copy_from_user.  So it looks like your change to create
> a temporary ctl_table and call proc_dointvec seems very reasonable,
> and necessary however we do this.
>
> I don't know if there are other places that need the rtnl_lock that
> but your patch below looks like it makes things better for all of
> the right reasons.  So on that score.
>
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Unless someone wants to volunteer to sort out the impedance mismatch
> between these tunables and the sysctl infrastructure.  I suggest
> you resend this patch to David with [PATCH] in the subject line.
>
> I would also suggest a little clearer description why
> idev->cnf.forwarding and idev->cnf.disable_ipv6 need rntl_lock
> protection.
>
> But overall this looks like a pretty obvious bug fix, to the
> problem that we need the rtnl_lock to protect idev->cnf.forwarding,
> and we currently allow updates to idev->cnf.forwarding without
> holding the rtnl_lock.
>
> Eric
>
>
>> Francesco Ruggeri
>>
>>
>> --- a/net/ipv6/addrconf.c    2011-03-14 18:20:32.000000000 -0700
>> +++ b/net/ipv6/addrconf.c    2012-01-10 12:56:01.458880292 -0800
>> @@ -507,29 +507,31 @@ static void addrconf_forward_change(stru
>>      rcu_read_unlock();
>>  }
>>
>> -static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int old)
>> +static int addrconf_fixup_forwarding(struct ctl_table *table, int *p, int newf)
>>  {
>>      struct net *net;
>> +    int old;
>>
>>      net = (struct net *)table->extra2;
>> -    if (p == &net->ipv6.devconf_dflt->forwarding)
>> +    if (p == &net->ipv6.devconf_dflt->forwarding) {
>> +        *p = newf;
>>          return 0;
>> +    }
>>
>> -    if (!rtnl_trylock()) {
>> -        /* Restore the original values before restarting */
>> -        *p = old;
>> +    if (!rtnl_trylock())
>>          return restart_syscall();
>> -    }
>> +
>> +    old = *p;
>> +    *p = newf;
>>
>>      if (p == &net->ipv6.devconf_all->forwarding) {
>> -        __s32 newf = net->ipv6.devconf_all->forwarding;
>>          net->ipv6.devconf_dflt->forwarding = newf;
>>          addrconf_forward_change(net, newf);
>> -    } else if ((!*p) ^ (!old))
>> +    } else if ((!newf) ^ (!old))
>>          dev_forward_change((struct inet6_dev *)table->extra1);
>>      rtnl_unlock();
>>
>> -    if (*p)
>> +    if (newf)
>>          rt6_purge_dflt_routers(net);
>>      return 1;
>>  }
>> @@ -4165,9 +4167,17 @@ int addrconf_sysctl_forward(ctl_table *c
>>      int *valp = ctl->data;
>>      int val = *valp;
>>      loff_t pos = *ppos;
>> +    ctl_table lctl;
>>      int ret;
>>
>> -    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>> +    /*
>> +     * ctl->data points to idev->cnf.forwarding, we should
>> +     * not modify it until we get the rtnl lock.
>> +     */
>> +    lctl = *ctl;
>> +    lctl.data = &val;
>> +
>> +    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>>
>>      if (write)
>>          ret = addrconf_fixup_forwarding(ctl, valp, val);
>> @@ -4205,26 +4215,28 @@ static void addrconf_disable_change(stru
>>      rcu_read_unlock();
>>  }
>>
>> -static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int old)
>> +static int addrconf_disable_ipv6(struct ctl_table *table, int *p, int newf)
>>  {
>>      struct net *net;
>> +    int old;
>>
>>      net = (struct net *)table->extra2;
>>
>> -    if (p == &net->ipv6.devconf_dflt->disable_ipv6)
>> +    if (p == &net->ipv6.devconf_dflt->disable_ipv6) {
>> +        *p = newf;
>>          return 0;
>> +    }
>>
>> -    if (!rtnl_trylock()) {
>> -        /* Restore the original values before restarting */
>> -        *p = old;
>> +    if (!rtnl_trylock())
>>          return restart_syscall();
>> -    }
>> +
>> +    old = *p;
>> +    *p = newf;
>>
>>      if (p == &net->ipv6.devconf_all->disable_ipv6) {
>> -        __s32 newf = net->ipv6.devconf_all->disable_ipv6;
>>          net->ipv6.devconf_dflt->disable_ipv6 = newf;
>>          addrconf_disable_change(net, newf);
>> -    } else if ((!*p) ^ (!old))
>> +    } else if ((!newf) ^ (!old))
>>          dev_disable_change((struct inet6_dev *)table->extra1);
>>
>>      rtnl_unlock();
>> @@ -4238,9 +4250,17 @@ int addrconf_sysctl_disable(ctl_table *c
>>      int *valp = ctl->data;
>>      int val = *valp;
>>      loff_t pos = *ppos;
>> +    ctl_table lctl;
>>      int ret;
>>
>> -    ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
>> +    /*
>> +     * ctl->data points to idev->cnf.disable_ipv6, we should
>> +     * not modify it until we get the rtnl lock.
>> +     */
>> +    lctl = *ctl;
>> +    lctl.data = &val;
>> +
>> +    ret = proc_dointvec(&lctl, write, buffer, lenp, ppos);
>>
>>      if (write)
>>          ret = addrconf_disable_ipv6(ctl, valp, val);
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-13 17:04         ` Ben Greear
@ 2012-01-14  5:46           ` Eric W. Biederman
  2012-01-14 18:31             ` Ben Greear
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2012-01-14  5:46 UTC (permalink / raw)
  To: Ben Greear; +Cc: Eric Dumazet, Francesco Ruggeri, netdev, Stephen Hemminger

Ben Greear <greearb@candelatech.com> writes:

> On 01/12/2012 11:40 PM, Eric W. Biederman wrote:
>
>> So I really think the best solution to avoid the locking craziness is to
>> have a wrapper that gets the value from userspace and calls
>> schedule_work to get another thread to actually process the change.  I
>> don't see any problems with writing a helper function for that.  The
>> only downside with using schedule_work is that we return to userspace
>> before the change has been fully installed in the kernel.  I don't
>> expect that would be a problem but stranger things have happened.
>
> That sounds a bit risky to me.  If something sets a value, and then
> queries it, it should always show the proper result for the previous
> calls.

Which is easy to do if you keep two values.  One integer
for the userspace control and another integer for the internal
kernel state.

The problem is that we have exactly one integer currently.

> If the queries also went through the the same sched-work queue
> then maybe it would be OK.

We can't want for anything that has to take the rtnl_lock.  That would
be the same as taking the rtnl_lock from a locking perspective.

I expect I would use something like:
struct rtnl_protected_knob {
	struct work_struct work;
        int userspace_value;
        int *kernel_var;
        void (*func)(int new_value, *kernel_var);
};

userspace_value would be what userspace sees, and kernel_var would be a
pointer to the value that we manipulate in the kernel.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-14  5:46           ` Eric W. Biederman
@ 2012-01-14 18:31             ` Ben Greear
  2012-01-20  2:54               ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Ben Greear @ 2012-01-14 18:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Eric Dumazet, Francesco Ruggeri, netdev, Stephen Hemminger

On 01/13/2012 09:46 PM, Eric W. Biederman wrote:
> Ben Greear<greearb@candelatech.com>  writes:
>
>> On 01/12/2012 11:40 PM, Eric W. Biederman wrote:
>>
>>> So I really think the best solution to avoid the locking craziness is to
>>> have a wrapper that gets the value from userspace and calls
>>> schedule_work to get another thread to actually process the change.  I
>>> don't see any problems with writing a helper function for that.  The
>>> only downside with using schedule_work is that we return to userspace
>>> before the change has been fully installed in the kernel.  I don't
>>> expect that would be a problem but stranger things have happened.
>>
>> That sounds a bit risky to me.  If something sets a value, and then
>> queries it, it should always show the proper result for the previous
>> calls.
>
> Which is easy to do if you keep two values.  One integer
> for the userspace control and another integer for the internal
> kernel state.
>
> The problem is that we have exactly one integer currently.
>
>> If the queries also went through the the same sched-work queue
>> then maybe it would be OK.
>
> We can't want for anything that has to take the rtnl_lock.  That would
> be the same as taking the rtnl_lock from a locking perspective.
>
> I expect I would use something like:
> struct rtnl_protected_knob {
> 	struct work_struct work;
>          int userspace_value;
>          int *kernel_var;
>          void (*func)(int new_value, *kernel_var);
> };
>
> userspace_value would be what userspace sees, and kernel_var would be a
> pointer to the value that we manipulate in the kernel.

What if valid values are 0-5 and user sets value to 6 and then immediately
queries the value?  Would your method possibly return 6, when in fact when
the kernel does the work it will internally either reject the setting and
stay with the old value or round the 6 down to 5?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Race condition in ipv6 code
  2012-01-14 18:31             ` Ben Greear
@ 2012-01-20  2:54               ` Eric W. Biederman
  0 siblings, 0 replies; 14+ messages in thread
From: Eric W. Biederman @ 2012-01-20  2:54 UTC (permalink / raw)
  To: Ben Greear; +Cc: Eric Dumazet, Francesco Ruggeri, netdev, Stephen Hemminger

Ben Greear <greearb@candelatech.com> writes:

> On 01/13/2012 09:46 PM, Eric W. Biederman wrote:
>> Ben Greear<greearb@candelatech.com>  writes:
>>
>>> On 01/12/2012 11:40 PM, Eric W. Biederman wrote:
>>>
>>>> So I really think the best solution to avoid the locking craziness is to
>>>> have a wrapper that gets the value from userspace and calls
>>>> schedule_work to get another thread to actually process the change.  I
>>>> don't see any problems with writing a helper function for that.  The
>>>> only downside with using schedule_work is that we return to userspace
>>>> before the change has been fully installed in the kernel.  I don't
>>>> expect that would be a problem but stranger things have happened.
>>>
>>> That sounds a bit risky to me.  If something sets a value, and then
>>> queries it, it should always show the proper result for the previous
>>> calls.
>>
>> Which is easy to do if you keep two values.  One integer
>> for the userspace control and another integer for the internal
>> kernel state.
>>
>> The problem is that we have exactly one integer currently.
>>
>>> If the queries also went through the the same sched-work queue
>>> then maybe it would be OK.
>>
>> We can't want for anything that has to take the rtnl_lock.  That would
>> be the same as taking the rtnl_lock from a locking perspective.
>>
>> I expect I would use something like:
>> struct rtnl_protected_knob {
>> 	struct work_struct work;
>>          int userspace_value;
>>          int *kernel_var;
>>          void (*func)(int new_value, *kernel_var);
>> };
>>
>> userspace_value would be what userspace sees, and kernel_var would be a
>> pointer to the value that we manipulate in the kernel.
>
> What if valid values are 0-5 and user sets value to 6 and then immediately
> queries the value?  Would your method possibly return 6, when in fact when
> the kernel does the work it will internally either reject the setting and
> stay with the old value or round the 6 down to 5?

Cancel this schedule_work suggestion.  I have played with the problem a bit
and it looks feasible to add some unlocked notifications (called
asynchronously from workqueues) to add and delete the sysfs, proc and sysctl
bits.  At which point we can then just do a straight forward rtnl_lock
in the handlers.

I still think Francesco's patch looks like the best suggestion I have
seen so far.  Francesco's patch fixes a location where we don't have the
rtnl_lock when we want it and it gets it handles the rest in a simple
straight forward way.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-01-20  2:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-12  2:13 Race condition in ipv6 code Francesco Ruggeri
2012-01-12  6:31 ` Eric Dumazet
2012-01-12  6:44   ` David Miller
2012-01-12 20:48     ` Francesco Ruggeri
2012-01-13  0:11   ` Eric W. Biederman
2012-01-13  6:02     ` Eric Dumazet
2012-01-13  7:40       ` Eric W. Biederman
2012-01-13 17:04         ` Ben Greear
2012-01-14  5:46           ` Eric W. Biederman
2012-01-14 18:31             ` Ben Greear
2012-01-20  2:54               ` Eric W. Biederman
2012-01-13  1:17 ` Eric W. Biederman
2012-01-13  1:57   ` Stephen Hemminger
2012-01-13 22:02   ` Francesco Ruggeri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.