linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible netlink autobind regression
@ 2015-09-17  2:29 Tejun Heo
  2015-09-17  3:08 ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-17  2:29 UTC (permalink / raw)
  To: herbert; +Cc: davem, tom, kafai, kernel-team, linux-kernel, netdev

Hello,

We're seeing processes piling up on audit_cmd_mutex and after some
digging it turns out sometimes audit_receive() ends up recursing into
itself causing an A-A deadlock on audit_cmd_mutex.  Here's the
backtrace.

 PID: 1939995  TASK: ffff88085bdde360  CPU: 27  COMMAND: "crond"
  #0 [ffff880369513ae0] __schedule at ffffffff818c49e2
  #1 [ffff880369513b30] schedule at ffffffff818c50e7
  #2 [ffff880369513b50] schedule_preempt_disabled at ffffffff818c52ce
  #3 [ffff880369513b60] __mutex_lock_slowpath at ffffffff818c6ad2
  #4 [ffff880369513bc0] mutex_lock at ffffffff818c6b5b
  #5 [ffff880369513be0] audit_receive at ffffffff810de5f1
  #6 [ffff880369513c10] netlink_unicast at ffffffff817c419b
  #7 [ffff880369513c60] netlink_ack at ffffffff817c4764
  #8 [ffff880369513ca0] audit_receive at ffffffff810de65a
  #9 [ffff880369513cd0] netlink_unicast at ffffffff817c419b
 #10 [ffff880369513d20] netlink_sendmsg at ffffffff817c450a
 #11 [ffff880369513da0] sock_sendmsg at ffffffff817792ba
 #12 [ffff880369513e30] SYSC_sendto at ffffffff81779ce0
 #13 [ffff880369513f70] sys_sendto at ffffffff8177a27e
 #14 [ffff880369513f80] system_call_fastpath at ffffffff818c8772
     RIP: 00007fbf652e6913  RSP: 00007ffca11d08d8  RFLAGS: 00010202
     RAX: ffffffffffffffda  RBX: ffffffff818c8772  RCX: 0000000000000000
     RDX: 0000000000000070  RSI: 00007ffca11d2c50  RDI: 0000000000000003
     RBP: 00007ffca11d2c50   R8: 00007ffca11d08f0   R9: 000000000000000c
     R10: 0000000000000000  R11: 0000000000000246  R12: ffffffff8177a27e
     R13: ffff880369513f78  R14: 0000000000000060  R15: 0000000000000070
     ORIG_RAX: 000000000000002c  CS: 0033  SS: 002b

The weird thing is that netlink_ack() ends up trying to send ack to
the kernel socket instead of the userland sender.  This seems only
possible if the issuing socket's portid is 0 for some reason.

c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
made netlink_insert() reset nlk_sk(sk)->portid to zero if insertion
fails.  The description says that it prevents sockets which lost
autobinding from being stuck in limbo.  I could be mistaken but the
description doesn't seem to be enough - for unconnected sendmsg users,
it looks like the socket would just use the wrong port number likely
occupied by someone else from the second try which would hose ack
routing.

Anyways, after the patch, it seems something like the following could
happen.

  thread 1			thread 2
 -----------------------------------------------------------------------
  netlink_sendmsg()
  netlink_autobind()
  netlink_insert()
    nlk->portid = portid;
    __netlink_insert() fails

				 netlink_sendmsg()
				   nlk->portid is not zero,
				   skip autobinding
    nlk->portid = 0;
				 NETLINK_CB(skb).portid is set to zero
				 ...
				 netlink_ack(skb)
				   tries to send ack to kernel rx socket


If the above could happen, it would explain the deadlock.  The root
problem is that whether to autobind or not is determined by testing
whether nlk->portid is zero without synchronizing against autobind
path and the autobind path may temporarily set and then clear portid
thus tricking the unsynchronized test.  The issue can be fixed by only
setting portid after it's known that the port can be used.

That particular issue can be fixed by the following incorrect patch
which needs more work as it ends up temporarily hashing sockets with a
mismatching portid.

Does this make any sense or am I chasing wild geese?

Thanks.

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..6835426 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1037,11 +1037,12 @@ static struct sock *__netlink_lookup(struct netlink_table *table, u32 portid,
 				      netlink_rhashtable_params);
 }
 
-static int __netlink_insert(struct netlink_table *table, struct sock *sk)
+static int __netlink_insert(struct netlink_table *table, struct sock *sk,
+			    int portid)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1103,11 +1104,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
-	err = __netlink_insert(table, sk);
-	if (err) {
+	err = __netlink_insert(table, sk, portid);
+	if (!err) {
+		nlk_sk(sk)->portid = portid;
+	} else {
 		/* In case the hashtable backend returns with -EBUSY
 		 * from here, it must not escape to the caller.
 		 */
@@ -1115,7 +1117,6 @@ static int netlink_insert(struct sock *sk, u32 portid)
 			err = -EOVERFLOW;
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
 	}
 

Thanks.

-- 
tejun

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  2:29 Possible netlink autobind regression Tejun Heo
@ 2015-09-17  3:08 ` Herbert Xu
  2015-09-17  3:41   ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-17  3:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: davem, tom, kafai, kernel-team, linux-kernel, netdev, Linus Torvalds

On Wed, Sep 16, 2015 at 10:29:09PM -0400, Tejun Heo wrote:
>
> Anyways, after the patch, it seems something like the following could
> happen.

Good catch! I think your explanation makes perfect sense.  Linus
ran into this previously too after suspend-and-resume.

I'll look into it.

Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  3:08 ` Herbert Xu
@ 2015-09-17  3:41   ` Herbert Xu
  2015-09-17  5:02     ` Cong Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-17  3:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: davem, tom, kafai, kernel-team, linux-kernel, netdev,
	Linus Torvalds, Jiri Pirko, Cong Wang, Nicolas Dichtel,
	Thomas Graf, Scott Feldman

On Thu, Sep 17, 2015 at 11:08:45AM +0800, Herbert Xu wrote:
> 
> Good catch! I think your explanation makes perfect sense.  Linus
> ran into this previously too after suspend-and-resume.

Unfortunately you can't just postpone the setting of portid because
once you pass it onto rhashtable the portid must never change while
it's in custody.

So what I've done is essentially revert my previous fix and instead
add a new boolean "bound" to indicate whether the socket has been
bound.

---8<---
netlink: Fix autobind race condition that leads to zero port ID

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads tried to autobind the same socket
one of them may end up with a zero port ID.

This patch reverts that commit and instead fixes it by introducing
a separte "bound" variable to indicate whether a socket has been
bound.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..abcbca5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1095,7 +1095,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	lock_sock(sk);
 
 	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1115,10 +1115,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
 			err = -EOVERFLOW;
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->bound = true;
+
 err:
 	release_sock(sk);
 	return err;
@@ -1285,7 +1287,7 @@ static int netlink_release(struct socket *sock)
 
 	skb_queue_purge(&sk->sk_write_queue);
 
-	if (nlk->portid) {
+	if (nlk->bound) {
 		struct netlink_notify n = {
 						.net = sock_net(sk),
 						.protocol = sk->sk_protocol,
@@ -1519,7 +1521,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	if (nlk->bound)
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
 
@@ -1537,7 +1539,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1585,7 +1587,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2426,7 +2428,7 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..e6aae40 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -35,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  3:41   ` Herbert Xu
@ 2015-09-17  5:02     ` Cong Wang
  2015-09-17  5:15       ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Cong Wang @ 2015-09-17  5:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

On Wed, Sep 16, 2015 at 8:41 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Sep 17, 2015 at 11:08:45AM +0800, Herbert Xu wrote:
>>
>> Good catch! I think your explanation makes perfect sense.  Linus
>> ran into this previously too after suspend-and-resume.
>
> Unfortunately you can't just postpone the setting of portid because
> once you pass it onto rhashtable the portid must never change while
> it's in custody.
>
> So what I've done is essentially revert my previous fix and instead
> add a new boolean "bound" to indicate whether the socket has been
> bound.
>
> ---8<---
> netlink: Fix autobind race condition that leads to zero port ID
>
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads tried to autobind the same socket
> one of them may end up with a zero port ID.
>
> This patch reverts that commit and instead fixes it by introducing
> a separte "bound" variable to indicate whether a socket has been
> bound.
>
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>

We saw similar soft lockup with the one Tejun reported, in our data
center.


> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Just one comment below.

[...]

> @@ -1285,7 +1287,7 @@ static int netlink_release(struct socket *sock)
>
>         skb_queue_purge(&sk->sk_write_queue);
>
> -       if (nlk->portid) {
> +       if (nlk->bound) {
>                 struct netlink_notify n = {
>                                                 .net = sock_net(sk),
>                                                 .protocol = sk->sk_protocol,

This part doesn't look correct, seems it is checking if this is a kernel
netlink socket rather than if it is bound. But I am not sure...

Other than this, looks good to me:

Reviewed-by: Cong Wang <cwang@twopensource.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  5:02     ` Cong Wang
@ 2015-09-17  5:15       ` Herbert Xu
  2015-09-17 11:25         ` Thomas Graf
  2015-09-17 11:30         ` Tejun Heo
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-17  5:15 UTC (permalink / raw)
  To: Cong Wang
  Cc: Tejun Heo, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

On Wed, Sep 16, 2015 at 10:02:00PM -0700, Cong Wang wrote:
>
> This part doesn't look correct, seems it is checking if this is a kernel
> netlink socket rather than if it is bound. But I am not sure...

Good point.  I've changed it so that bound is only set for non-kernel
sockets.

---8<---
netlink: Fix autobind race condition that leads to zero port ID

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads tried to autobind the same socket
one of them may end up with a zero port ID.

This patch reverts that commit and instead fixes it by introducing
a separte "bound" variable to indicate whether a user-space socket
has been bound.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Cong Wang <cwang@twopensource.com>


diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dea9253..42013c5 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1068,7 +1068,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	lock_sock(sk);
 
 	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1083,10 +1083,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	if (err) {
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->bound = !!portid;
+
 err:
 	release_sock(sk);
 	return err;
@@ -1253,7 +1255,7 @@ static int netlink_release(struct socket *sock)
 
 	skb_queue_purge(&sk->sk_write_queue);
 
-	if (nlk->portid) {
+	if (nlk->bound) {
 		struct netlink_notify n = {
 						.net = sock_net(sk),
 						.protocol = sk->sk_protocol,
@@ -1487,7 +1489,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	if (nlk->bound)
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
 
@@ -1505,7 +1507,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1553,7 +1555,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2371,7 +2373,7 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..e6aae40 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -35,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;

-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  5:15       ` Herbert Xu
@ 2015-09-17 11:25         ` Thomas Graf
  2015-09-17 11:30         ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Thomas Graf @ 2015-09-17 11:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Cong Wang, Tejun Heo, David Miller, Tom Herbert, kafai,
	kernel-team, linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Scott Feldman

On 09/17/15 at 01:15pm, Herbert Xu wrote:
> On Wed, Sep 16, 2015 at 10:02:00PM -0700, Cong Wang wrote:
> >
> > This part doesn't look correct, seems it is checking if this is a kernel
> > netlink socket rather than if it is bound. But I am not sure...
> 
> Good point.  I've changed it so that bound is only set for non-kernel
> sockets.
> 
> ---8<---
> netlink: Fix autobind race condition that leads to zero port ID
> 
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads tried to autobind the same socket
> one of them may end up with a zero port ID.
> 
> This patch reverts that commit and instead fixes it by introducing
> a separte "bound" variable to indicate whether a user-space socket
> has been bound.
> 
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> Reviewed-by: Cong Wang <cwang@twopensource.com>

Acked-by: Thomas Graf <tgraf@suug.ch>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Possible netlink autobind regression
  2015-09-17  5:15       ` Herbert Xu
  2015-09-17 11:25         ` Thomas Graf
@ 2015-09-17 11:30         ` Tejun Heo
  2015-09-18  6:36           ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-17 11:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Cong Wang, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

Hello, Herbert.

On Thu, Sep 17, 2015 at 01:15:03PM +0800, Herbert Xu wrote:
> netlink: Fix autobind race condition that leads to zero port ID
> 
> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads tried to autobind the same socket
> one of them may end up with a zero port ID.
>
> This patch reverts that commit and instead fixes it by introducing
> a separte "bound" variable to indicate whether a user-space socket
> has been bound.
> 
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> Reviewed-by: Cong Wang <cwang@twopensource.com>

Maybe add that this led to a deadlock and add a Link tag to this
thread?

> @@ -1083,10 +1083,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  	if (err) {
>  		if (err == -EEXIST)
>  			err = -EADDRINUSE;
> -		nlk_sk(sk)->portid = 0;
>  		sock_put(sk);
> +		goto err;
>  	}
>  
> +	nlk_sk(sk)->bound = !!portid;

!! isn't necessasry and this creates ordering between two stores.
->bound must be visible only after ->portid is visible, so this should
be smp_store_release().

> @@ -2371,7 +2373,7 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
>  		dst_group = nlk->dst_group;
>  	}
>  
> -	if (!nlk->portid) {
> +	if (!nlk->bound) {

And all unlocked reads should be smp_load_acquire().

>  		err = netlink_autobind(sock);
>  		if (err)
>  			goto out;

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-17 11:30         ` Tejun Heo
@ 2015-09-18  6:36           ` Herbert Xu
  2015-09-18 11:16             ` [PATCH v4] " Herbert Xu
  2015-09-18 13:37             ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Tejun Heo
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-18  6:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cong Wang, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

On Thu, Sep 17, 2015 at 07:30:34AM -0400, Tejun Heo wrote:
>
> Maybe add that this led to a deadlock and add a Link tag to this
> thread?

I'll add a note about the deadlock but I don't like Link tags
because websites die and you can always just google the patch
subject.

> > +	nlk_sk(sk)->bound = !!portid;
> 
> !! isn't necessasry and this creates ordering between two stores.

!! was necessary because we're going from a u32 to a bool.

> ->bound must be visible only after ->portid is visible, so this should
> be smp_store_release().

But yes there are ordering issues here so I've decided to make
rhashtable use a new field for its hash instead.

Note that I've dropped the acks as this patch is now substantially
different.

---8<---
The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dea9253..7157bad 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -988,7 +988,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->portid != x->portid ||
+	return nlk->rhash_portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1014,7 +1014,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1076,17 +1076,19 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->portid = portid;
+	nlk_sk(sk)->rhash_portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
 	if (err) {
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->portid = portid;
+
 err:
 	release_sock(sk);
 	return err;
@@ -3197,7 +3199,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..c96dfa3 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,6 +25,7 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
+	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v4] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-18  6:36           ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Herbert Xu
@ 2015-09-18 11:16             ` Herbert Xu
  2015-09-21  5:55               ` David Miller
  2015-09-18 13:37             ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-18 11:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cong Wang, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

On Fri, Sep 18, 2015 at 02:36:09PM +0800, Herbert Xu wrote:
> 
> But yes there are ordering issues here so I've decided to make
> rhashtable use a new field for its hash instead.
> 
> Note that I've dropped the acks as this patch is now substantially
> different.

v3 doesn't apply to the current tree anymore so here is a respin
rebased on rc1.

---8<---
The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b..303efb7 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1015,7 +1015,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->portid != x->portid ||
+	return nlk->rhash_portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1041,7 +1041,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1103,7 +1103,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->portid = portid;
+	nlk_sk(sk)->rhash_portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1115,10 +1115,12 @@ static int netlink_insert(struct sock *sk, u32 portid)
 			err = -EOVERFLOW;
 		if (err == -EEXIST)
 			err = -EADDRINUSE;
-		nlk_sk(sk)->portid = 0;
 		sock_put(sk);
+		goto err;
 	}
 
+	nlk_sk(sk)->portid = portid;
+
 err:
 	release_sock(sk);
 	return err;
@@ -3255,7 +3257,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 8900840..c96dfa3 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,6 +25,7 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
+	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-18  6:36           ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Herbert Xu
  2015-09-18 11:16             ` [PATCH v4] " Herbert Xu
@ 2015-09-18 13:37             ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2015-09-18 13:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Cong Wang, David Miller, Tom Herbert, kafai, kernel-team,
	linux-kernel, netdev, Linus Torvalds, Jiri Pirko,
	Nicolas Dichtel, Thomas Graf, Scott Feldman

Hello, Herbert.

On Fri, Sep 18, 2015 at 02:36:10PM +0800, Herbert Xu wrote:
> On Thu, Sep 17, 2015 at 07:30:34AM -0400, Tejun Heo wrote:
> >
> > Maybe add that this led to a deadlock and add a Link tag to this
> > thread?
> 
> I'll add a note about the deadlock but I don't like Link tags
> because websites die and you can always just google the patch
> subject.

That's why we use http://lkml.kernel.org/r/MSG_ID links.

> > > +	nlk_sk(sk)->bound = !!portid;
> > 
> > !! isn't necessasry and this creates ordering between two stores.
> 
> !! was necessary because we're going from a u32 to a bool.

bool casting actually collapses the source value to a boolean value.
No need for casting regardless of data type.

> @@ -1076,17 +1076,19 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
>  		goto err;
>  
> -	nlk_sk(sk)->portid = portid;
> +	nlk_sk(sk)->rhash_portid = portid;
>  	sock_hold(sk);
>  
>  	err = __netlink_insert(table, sk);
>  	if (err) {
>  		if (err == -EEXIST)
>  			err = -EADDRINUSE;
> -		nlk_sk(sk)->portid = 0;
>  		sock_put(sk);
> +		goto err;
>  	}
>  
> +	nlk_sk(sk)->portid = portid;

So, this doesn't necessarily make the ordering problem go away.  The
hash lookup would be fine but imagine a code path like the following.

	rcu_read_lock();
	sock = rhash lookup(some port number);
	do some operation which may use sock->portid;
	rcu_read_unlock();

Now, that some operation may see 0 as the port number.  I don't think
you can avoid doing some type of memory barrier operations if you
wanna gate autobind w/o grabbing locks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-18 11:16             ` [PATCH v4] " Herbert Xu
@ 2015-09-21  5:55               ` David Miller
  2015-09-21  6:06                 ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2015-09-21  5:55 UTC (permalink / raw)
  To: herbert
  Cc: tj, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 18 Sep 2015 19:16:50 +0800

> The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> Reset portid after netlink_insert failure") introduced a race
> condition where if two threads try to autobind the same socket
> one of them may end up with a zero port ID.  This led to kernel
> deadlocks that were observed by multiple people.
> 
> This patch reverts that commit and instead fixes it by introducing
> a separte rhash_portid variable so that the real portid is only set
> after the socket has been successfully hashed.
> 
> Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Applied and queued up for -stable, thanks Herbert.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-21  5:55               ` David Miller
@ 2015-09-21  6:06                 ` Herbert Xu
  2015-09-21  6:11                   ` David Miller
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-21  6:06 UTC (permalink / raw)
  To: David Miller
  Cc: tj, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sun, Sep 20, 2015 at 10:55:21PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 18 Sep 2015 19:16:50 +0800
> 
> > The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
> > Reset portid after netlink_insert failure") introduced a race
> > condition where if two threads try to autobind the same socket
> > one of them may end up with a zero port ID.  This led to kernel
> > deadlocks that were observed by multiple people.
> > 
> > This patch reverts that commit and instead fixes it by introducing
> > a separte rhash_portid variable so that the real portid is only set
> > after the socket has been successfully hashed.
> > 
> > Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> Applied and queued up for -stable, thanks Herbert.

Sorry but Dave but there are still races with v4 as Tejun pointed
out.  I'm still working on it and I could post them as incremental
patches if that's the easiest.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v4] netlink: Fix autobind race condition that leads to zero port ID
  2015-09-21  6:06                 ` Herbert Xu
@ 2015-09-21  6:11                   ` David Miller
  2015-09-21 13:34                     ` netlink: Replace rhash_portid with bound Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: David Miller @ 2015-09-21  6:11 UTC (permalink / raw)
  To: herbert
  Cc: tj, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 21 Sep 2015 14:06:36 +0800

> On Sun, Sep 20, 2015 at 10:55:21PM -0700, David Miller wrote:
>> From: Herbert Xu <herbert@gondor.apana.org.au>
>> Date: Fri, 18 Sep 2015 19:16:50 +0800
>> 
>> > The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
>> > Reset portid after netlink_insert failure") introduced a race
>> > condition where if two threads try to autobind the same socket
>> > one of them may end up with a zero port ID.  This led to kernel
>> > deadlocks that were observed by multiple people.
>> > 
>> > This patch reverts that commit and instead fixes it by introducing
>> > a separte rhash_portid variable so that the real portid is only set
>> > after the socket has been successfully hashed.
>> > 
>> > Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
>> > Reported-by: Tejun Heo <tj@kernel.org>
>> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
>> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
>> 
>> Applied and queued up for -stable, thanks Herbert.
> 
> Sorry but Dave but there are still races with v4 as Tejun pointed
> out.  I'm still working on it and I could post them as incremental
> patches if that's the easiest.

Oops, sorry about that.

Yeah at this point incremental patches work the best.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* netlink: Replace rhash_portid with bound
  2015-09-21  6:11                   ` David Miller
@ 2015-09-21 13:34                     ` Herbert Xu
  2015-09-21 18:20                       ` Tejun Heo
  2015-09-21 20:52                       ` [PATCH] netlink: Replace rhash_portid with load_acquire protected boolean Tejun Heo
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-21 13:34 UTC (permalink / raw)
  To: David Miller
  Cc: tj, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sun, Sep 20, 2015 at 11:11:04PM -0700, David Miller wrote:
>
> Yeah at this point incremental patches work the best.

OK here is the patch:

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid is used.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 303efb7..f5362aae 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -24,6 +24,7 @@
 
 #include <linux/module.h>
 
+#include <asm/barrier.h>
 #include <linux/capability.h>
 #include <linux/kernel.h>
 #include <linux/init.h>
@@ -1015,7 +1016,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1041,7 +1042,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1095,7 +1096,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	lock_sock(sk);
 
 	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1103,7 +1104,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1119,7 +1120,11 @@ static int netlink_insert(struct sock *sk, u32 portid)
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* rhashtable_insert carries an implicit write memory barrier
+	 * so we don't need an smp_wmb here in order to ensure that
+	 * portid is set before bound.
+	 */
+	nlk_sk(sk)->bound = portid;
 
 err:
 	release_sock(sk);
@@ -1521,9 +1526,11 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	/* Ensure nlk->portid is up-to-date. */
+	if (smp_load_acquire(&nlk->bound)) {
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2428,7 +2435,8 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	/* Ensure nlk->portid is up-to-date. */
+	if (!smp_load_acquire(&nlk->bound)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
@@ -3257,7 +3265,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index c96dfa3..e6aae40 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,7 +25,6 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: netlink: Replace rhash_portid with bound
  2015-09-21 13:34                     ` netlink: Replace rhash_portid with bound Herbert Xu
@ 2015-09-21 18:20                       ` Tejun Heo
  2015-09-22  3:38                         ` [PATCH v2] " Herbert Xu
  2015-09-21 20:52                       ` [PATCH] netlink: Replace rhash_portid with load_acquire protected boolean Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-21 18:20 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Mon, Sep 21, 2015 at 09:34:16PM +0800, Herbert Xu wrote:
> @@ -1119,7 +1120,11 @@ static int netlink_insert(struct sock *sk, u32 portid)
>  		goto err;
>  	}
>  
> -	nlk_sk(sk)->portid = portid;
> +	/* rhashtable_insert carries an implicit write memory barrier
> +	 * so we don't need an smp_wmb here in order to ensure that
> +	 * portid is set before bound.
> +	 */
> +	nlk_sk(sk)->bound = portid;

store_release and load_acquire are different from the usual memory
barriers and can't be paired this way.  You have to pair store_release
and load_acquire.  Besides, it isn't a particularly good idea to
depend on memory barriers embedded in other data structures like the
above.  Here, especially, rhashtable_insert() would have write barrier
*before* the entry is hashed not necessarily *after*, which means that
in the above case, a socket which appears to have set bound to a
reader might not visible when the reader tries to look up the socket
on the hashtable.

There's no reason to be overly smart here.  This isn't a crazy hot
path, write barriers tend to be very cheap, store_release more so.
Please just do smp_store_release() and note what it's paired with.

> @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
>  		}
>  	}
>  
> -	if (!nlk->portid) {
> +	if (!nlk->bound) {

I don't think you can skip load_acquire here just because this is the
second deref of the variable.  That doesn't change anything.  Race
condition could still happen between the first and second tests and
skipping the second would lead to the same kind of bug.

> @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
>  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
>  		return -EPERM;
>  
> -	if (!nlk->portid)
> +	if (!nlk->bound)

Don't we need load_acquire here too?  Is this path holding a lock
which makes that unnecessary?

I'd suggest making it clear that ->bound is internal (name it
->__bound or sth) and provide a test macro which always uses
load_acquire.  It could be that there are a couple places which can
avoid load_acquire but it just isn't worth it.  load_acquire is very
cheap but bugs around it can be extremely subtle.  Let's please keep
it straight-forward.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH] netlink: Replace rhash_portid with load_acquire protected boolean
  2015-09-21 13:34                     ` netlink: Replace rhash_portid with bound Herbert Xu
  2015-09-21 18:20                       ` Tejun Heo
@ 2015-09-21 20:52                       ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2015-09-21 20:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

Here's an updated version of Herbert's patch which always uses
load_acquire through a helper.

Thanks.
----- 8< -----
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink: Fix
autobind race condition that leads to zero port ID") created some new
races that can occur due to inconsistencies between the two port IDs -
a reader may see zero nlk->portid after seeing non-zero
nlk->rhash_portid.

This patch reverts the original patch and instead uses a load_acquire
protected boolean to indicate that a user netlink socket has been
bound.  The boolean is set with store_release only after the portid is
assigned and the socket is hashed.  The readers test with load_acquire
so that the socket is guaranteed to be visible with a valid port
number and hashed on a true return.

As this sort of lockless tests can be broken in ways which are very
difficult to track down, the boolean field is prefixed with double
underscores and a dedicated test helper with load_acquire is always
used.  While a couple test sites might not strictly require
load_acquire, micro-optimization at this level doens't make sense
given the danger of subtle breakages.

tj: Took Herbert's patch and updated so that all readers test via a
    helper which does load_acquire.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Original-patch-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 net/netlink/af_netlink.c |   24 ++++++++++++++----------
 net/netlink/af_netlink.h |   20 +++++++++++++++++++-
 2 files changed, 33 insertions(+), 11 deletions(-)

--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1015,7 +1015,7 @@ static inline int netlink_compare(struct
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1041,7 +1041,7 @@ static int __netlink_insert(struct netli
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1095,7 +1095,7 @@ static int netlink_insert(struct sock *s
 	lock_sock(sk);
 
 	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	if (nlk_bound(nlk_sk(sk)))
 		goto err;
 
 	err = -ENOMEM;
@@ -1103,7 +1103,7 @@ static int netlink_insert(struct sock *s
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1119,7 +1119,8 @@ static int netlink_insert(struct sock *s
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* See nlk_bound(). */
+	smp_store_release(&nlk_sk(sk)->__bound, portid);
 
 err:
 	release_sock(sk);
@@ -1521,9 +1522,11 @@ static int netlink_bind(struct socket *s
 			return err;
 	}
 
-	if (nlk->portid)
+	/* Ensure nlk->portid is up-to-date. */
+	if (nlk_bound(nlk)) {
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1539,7 +1542,7 @@ static int netlink_bind(struct socket *s
 		}
 	}
 
-	if (!nlk->portid) {
+	if (!nlk_bound(nlk)) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1587,7 +1590,7 @@ static int netlink_connect(struct socket
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	if (!nlk_bound(nlk))
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2428,7 +2431,8 @@ static int netlink_sendmsg(struct socket
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	/* Ensure nlk->portid is up-to-date. */
+	if (!nlk_bound(nlk)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
@@ -3257,7 +3261,7 @@ static inline u32 netlink_hash(const voi
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -3,6 +3,7 @@
 
 #include <linux/rhashtable.h>
 #include <linux/atomic.h>
+#include <asm/barrier.h>
 #include <net/sock.h>
 
 #define NLGRPSZ(x)	(ALIGN(x, sizeof(unsigned long) * 8) / 8)
@@ -25,7 +26,6 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +36,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			__bound;	/* always use nlk_bound() */
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
@@ -60,6 +61,23 @@ static inline struct netlink_sock *nlk_s
 	return container_of(sk, struct netlink_sock, sk);
 }
 
+/**
+ * nlk_bound - test whether a netlink_sock is bound to a port number
+ * @nlk: netlink_sock of interest
+ *
+ * Test whether @nlk is bound to a port number.  Can be called without any
+ * locks and guarantees no false positive - @nlk has a valid port number
+ * and is hashed on a true return.
+ */
+static inline bool nlk_bound(struct netlink_sock *nlk)
+{
+	/*
+	 * Paired with smp_store_release() in netlink_insert() to guarantee
+	 * the visibility of port number and hashing.
+	 */
+	return smp_load_acquire(&nlk->__bound);
+}
+
 struct netlink_table {
 	struct rhashtable	hash;
 	struct hlist_head	mc_list;

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-21 18:20                       ` Tejun Heo
@ 2015-09-22  3:38                         ` Herbert Xu
  2015-09-22 16:10                           ` Tejun Heo
  2015-09-24 19:11                           ` David Miller
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-22  3:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >  
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
> 
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >  
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
> 
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 303efb7..2c15fae 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1015,7 +1015,7 @@ static inline int netlink_compare(struct rhashtable_compare_arg *arg,
 	const struct netlink_compare_arg *x = arg->key;
 	const struct netlink_sock *nlk = ptr;
 
-	return nlk->rhash_portid != x->portid ||
+	return nlk->portid != x->portid ||
 	       !net_eq(sock_net(&nlk->sk), read_pnet(&x->pnet));
 }
 
@@ -1041,7 +1041,7 @@ static int __netlink_insert(struct netlink_table *table, struct sock *sk)
 {
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(sk), nlk_sk(sk)->portid);
 	return rhashtable_lookup_insert_key(&table->hash, &arg,
 					    &nlk_sk(sk)->node,
 					    netlink_rhashtable_params);
@@ -1094,8 +1094,8 @@ static int netlink_insert(struct sock *sk, u32 portid)
 
 	lock_sock(sk);
 
-	err = -EBUSY;
-	if (nlk_sk(sk)->portid)
+	err = nlk_sk(sk)->portid == portid ? 0 : -EBUSY;
+	if (nlk_sk(sk)->bound)
 		goto err;
 
 	err = -ENOMEM;
@@ -1103,7 +1103,7 @@ static int netlink_insert(struct sock *sk, u32 portid)
 	    unlikely(atomic_read(&table->hash.nelems) >= UINT_MAX))
 		goto err;
 
-	nlk_sk(sk)->rhash_portid = portid;
+	nlk_sk(sk)->portid = portid;
 	sock_hold(sk);
 
 	err = __netlink_insert(table, sk);
@@ -1119,7 +1119,9 @@ static int netlink_insert(struct sock *sk, u32 portid)
 		goto err;
 	}
 
-	nlk_sk(sk)->portid = portid;
+	/* We need to ensure that the socket is hashed and visible. */
+	smp_wmb();
+	nlk_sk(sk)->bound = portid;
 
 err:
 	release_sock(sk);
@@ -1505,6 +1507,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 	struct sockaddr_nl *nladdr = (struct sockaddr_nl *)addr;
 	int err;
 	long unsigned int groups = nladdr->nl_groups;
+	bool bound;
 
 	if (addr_len < sizeof(struct sockaddr_nl))
 		return -EINVAL;
@@ -1521,9 +1524,14 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	if (nlk->portid)
+	bound = nlk->bound;
+	if (bound) {
+		/* Ensure nlk->portid is up-to-date. */
+		smp_rmb();
+
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
+	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1539,7 +1547,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	if (!nlk->portid) {
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
 			netlink_autobind(sock);
@@ -1587,7 +1598,10 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	if (!nlk->portid)
+	/* No need for barriers here as we return to user-space without
+	 * using any of the bound attributes.
+	 */
+	if (!nlk->bound)
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2428,10 +2442,13 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->portid) {
+	if (!nlk->bound) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
+	} else {
+		/* Ensure nlk is hashed and visible. */
+		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
@@ -3257,7 +3274,7 @@ static inline u32 netlink_hash(const void *data, u32 len, u32 seed)
 	const struct netlink_sock *nlk = data;
 	struct netlink_compare_arg arg;
 
-	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->rhash_portid);
+	netlink_compare_arg_init(&arg, sock_net(&nlk->sk), nlk->portid);
 	return jhash2((u32 *)&arg, netlink_compare_arg_len / sizeof(u32), seed);
 }
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index c96dfa3..e6aae40 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -25,7 +25,6 @@ struct netlink_ring {
 struct netlink_sock {
 	/* struct sock has to be the first member of netlink_sock */
 	struct sock		sk;
-	u32			rhash_portid;
 	u32			portid;
 	u32			dst_portid;
 	u32			dst_group;
@@ -36,6 +35,7 @@ struct netlink_sock {
 	unsigned long		state;
 	size_t			max_recvmsg_len;
 	wait_queue_head_t	wait;
+	bool			bound;
 	bool			cb_running;
 	struct netlink_callback	cb;
 	struct mutex		*cb_mutex;
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22  3:38                         ` [PATCH v2] " Herbert Xu
@ 2015-09-22 16:10                           ` Tejun Heo
  2015-09-22 18:42                             ` Linus Torvalds
  2015-09-23  6:13                             ` Herbert Xu
  2015-09-24 19:11                           ` David Miller
  1 sibling, 2 replies; 55+ messages in thread
From: Tejun Heo @ 2015-09-22 16:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Tue, Sep 22, 2015 at 11:38:56AM +0800, Herbert Xu wrote:
> On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
> > store_release and load_acquire are different from the usual memory
> > barriers and can't be paired this way.  You have to pair store_release
> > and load_acquire.  Besides, it isn't a particularly good idea to
> 
> OK I've decided to drop the acquire/release helpers as they don't
> help us at all and simply pessimises the code by using full memory
> barriers (on some architectures) where only a write or read barrier
> is needed.

That's a pentium pro era errata.  Virtually no working machine is
affected by that anymore and nobody builds kernel with that option.
In most cases, store_release and load_acquire are cheaper as they're
more specific.  On x86, store_release and load_acquire boil down to
compiler reordering barriers.  You're running in the opposite
direction.

> > There's no reason to be overly smart here.  This isn't a crazy hot
> > path, write barriers tend to be very cheap, store_release more so.
> > Please just do smp_store_release() and note what it's paired with.
> 
> It's not about being overly smart.  It's about actually understanding
> what's going on with the code.  I've seen too many instances of
> people simply sprinkling synchronisation primitives around without
> any knowledge of what is happening underneath, which is just a recipe
> for creating hard-to-debug races.

I mean, read this thread.  It's one subtle breakage after another, one
confusion after another.  The problematic usages of memory barriers
are usually of two types.

1. Misunderstand what barriers do and fail to, most frequently, pair
   the barriers correctly.  This leads to things like lone smp_wmb()s
   which don't do anything but provide false sense of security.

2. The usage itself is correct but not properly documented and it's
   not clear what's being synchronized.  Because there's nothing
   inherently pairing the matching barrier pairs, w/o proper
   documentation, it can be very challenging to track down what is
   being synchronized making it difficult to tell this case from 1.
   Note that this is another reason smp_store_release() and
   smp_load_acquire() are just better.  Their specificity not only
   makes them lighter but also makes it a lot easier to track down
   what's going on.

So, when using barriers, we want to first pick the most specific
barrier pairs that suit the use case and clearly identify the matching
pairs and describe any subtlties if there's any.  Here, it's a
straight-forward writer to reader interlocking.

Once we meet the above criteria, we do want to be as simple as the use
case allows it to be.  e.g. if there are multiples readers,
encapsulate them and document them centrally.  If there are different
classes of readers and it's worthwhile to apply different
synchronization schemes to them, use separate helpers or clearly mark
and document exceptions with rationale.

> > I don't think you can skip load_acquire here just because this is the
> > second deref of the variable.  That doesn't change anything.  Race
> > condition could still happen between the first and second tests and
> > skipping the second would lead to the same kind of bug.
> 
> The reason this one is OK is because we do not use nlk->portid or
> try to get nlk from the hash table before we return to user-space.

What happens if somebody later adds code below that which happens to
use portid?  You're creating a booby trap and the patch isn't even
properly documenting what's going on.

> ---8<---
> The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") created
> some new races that can occur due to inconcsistencies between the
> two port IDs.
> 
> Tejun is right that a barrier is unavoidable.  Therefore I am
> reverting to the original patch that used a boolean to indicate
> that a user netlink socket has been bound.
> 
> Barriers have been added where necessary to ensure that a valid
> portid and the hashed socket is visible.
> 
> I have also changed netlink_insert to only return EBUSY if the
> socket is bound to a portid different to the requested one.  This
> combined with only reading nlk->bound once in netlink_bind fixes
> a race where two threads that bind the socket at the same time
> with different port IDs may both succeed.
> 
> Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

This is a pretty misguided use of barriers.

Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 16:10                           ` Tejun Heo
@ 2015-09-22 18:42                             ` Linus Torvalds
  2015-09-22 18:53                               ` Tejun Heo
  2015-09-23  6:13                             ` Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2015-09-22 18:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Herbert Xu, David Miller, Cong Wang, tom, kafai, kernel-team,
	Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

On Tue, Sep 22, 2015 at 9:10 AM, Tejun Heo <tj@kernel.org> wrote:
>
> That's a pentium pro era errata.  Virtually no working machine is
> affected by that anymore and nobody builds kernel with that option.
> In most cases, store_release and load_acquire are cheaper as they're
> more specific.  On x86, store_release and load_acquire boil down to
> compiler reordering barriers.  You're running in the opposite
> direction.

Well, to be fair, there are lots of machines where acquire/release is
actually quite expensive.

In general, the cheapest barrier there is (apart from the "no barrier
at all" or just "compiler barrier") is "smp_wmb()".  If an
architecture gets that one wrong, the architects were f*cking morons.
It should be a fundamentally cheap operation, since writes are
buffered and it should simply be a buffer barrier.

The acquire/release things are generally fairly cheap on modern
architectures. Not free (except on x86), but fairly low-cost. HOWEVER,
they are not at all free on some older architectures, including 32-bit
ARM.

smp_rmb() should generally be about the same cost as an acquire. It
can go either way.

So *if* the algorithm is amenable to smp_wmb()/smp_rmb() kind of
barriers, that's actually quite possibly better than acquire/release.

smp_mb() is expensive pretty much everywhere.

Looking forward, I suspect long-term acquire/release is what hardware
is going to be "reasonably good at", but as things are right now, you
can't necessarily rely on them being fast.

                   Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 18:42                             ` Linus Torvalds
@ 2015-09-22 18:53                               ` Tejun Heo
  2015-09-22 19:28                                 ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-22 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, David Miller, Cong Wang, tom, kafai, kernel-team,
	Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

Hello, Linus.

On Tue, Sep 22, 2015 at 11:42:33AM -0700, Linus Torvalds wrote:
...
> smp_rmb() should generally be about the same cost as an acquire. It
> can go either way.
> 
> So *if* the algorithm is amenable to smp_wmb()/smp_rmb() kind of
> barriers, that's actually quite possibly better than acquire/release.

I see.  The write path here is cold so the competition is between rmb
and acquire.  Unless some significant archs completely screwed it up,
acquire still seems like the better option.  It's essentially free on
x86 after all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 18:53                               ` Tejun Heo
@ 2015-09-22 19:28                                 ` Linus Torvalds
  2015-09-22 19:50                                   ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2015-09-22 19:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Herbert Xu, David Miller, Cong Wang, tom, kafai, kernel-team,
	Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

On Tue, Sep 22, 2015 at 11:53 AM, Tejun Heo <tj@kernel.org> wrote:
>
> I see.  The write path here is cold so the competition is between rmb
> and acquire.  Unless some significant archs completely screwed it up,
> acquire still seems like the better option.  It's essentially free on
> x86 after all.

Both acquire and smp_rmb() are equally free on x86.

It appears that we don't do the X86_PPRO_FENCE bug handling for
acquire, but I guess we should.

Or possibly we should start deprecating the insane X86_PPRO_FENCE
entirely. It's very costly, for dubious advantages.

                  Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 19:28                                 ` Linus Torvalds
@ 2015-09-22 19:50                                   ` Tejun Heo
  2015-09-22 20:03                                     ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-22 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Herbert Xu, David Miller, Cong Wang, tom, kafai, kernel-team,
	Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

Hello,

On Tue, Sep 22, 2015 at 12:28:45PM -0700, Linus Torvalds wrote:
> Both acquire and smp_rmb() are equally free on x86.

Was this always like this?  Ah, okay, from 2007.  Was thinking it's
still doing an actual rmb.  Talk about being confused.

> It appears that we don't do the X86_PPRO_FENCE bug handling for
> acquire, but I guess we should.

Hmmm?  We're doing that.  PPRO_FENCE makes both acquire and release
use smp_mb().

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 19:50                                   ` Tejun Heo
@ 2015-09-22 20:03                                     ` Linus Torvalds
  2015-09-22 20:36                                       ` Bjørn Mork
  0 siblings, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2015-09-22 20:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Herbert Xu, David Miller, Cong Wang, Tom Herbert, kafai,
	kernel-team, Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

On Tue, Sep 22, 2015 at 12:50 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Hmmm?  We're doing that.  PPRO_FENCE makes both acquire and release
> use smp_mb().

Oh, right you are. I just grepped for rmb, and missed it because as
you say, it's a full mb.

The PPRO_FENCE_BUG thing is rather questionable. I'm not sure it's
even documented, but I can't find the official PPro errata now. I
think I discussed it with Andy Glew back when Intel was starting to
document the memory ordering rules.

I'd be willing to get rid of it. It was apparently just an early
stepping or two.

Anybody with better google-fu than me who can find the official Intel
PPro errata?

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 20:03                                     ` Linus Torvalds
@ 2015-09-22 20:36                                       ` Bjørn Mork
  2015-09-22 21:04                                         ` Linus Torvalds
  0 siblings, 1 reply; 55+ messages in thread
From: Bjørn Mork @ 2015-09-22 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Herbert Xu, David Miller, Cong Wang, Tom Herbert,
	kafai, kernel-team, Linux Kernel Mailing List,
	Network Development, Jiří Pírko, Nicolas Dichtel,
	Thomas Graf, Scott Feldman

Linus Torvalds <torvalds@linux-foundation.org> writes:

> The PPRO_FENCE_BUG thing is rather questionable. I'm not sure it's
> even documented, but I can't find the official PPro errata now. I
> think I discussed it with Andy Glew back when Intel was starting to
> document the memory ordering rules.
>
> I'd be willing to get rid of it. It was apparently just an early
> stepping or two.
>
> Anybody with better google-fu than me who can find the official Intel
> PPro errata?

http://download.intel.com/design/archives/processors/pro/docs/24268935.pdf

Says "NoFix" for erratas 66 and 92.


Bjørn

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 20:36                                       ` Bjørn Mork
@ 2015-09-22 21:04                                         ` Linus Torvalds
  0 siblings, 0 replies; 55+ messages in thread
From: Linus Torvalds @ 2015-09-22 21:04 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Tejun Heo, Herbert Xu, David Miller, Cong Wang, Tom Herbert,
	kafai, kernel-team, Linux Kernel Mailing List,
	Network Development, Jiří Pírko, Nicolas Dichtel,
	Thomas Graf, Scott Feldman

On Tue, Sep 22, 2015 at 1:36 PM, Bjørn Mork <bjorn@mork.no> wrote:
>
> http://download.intel.com/design/archives/processors/pro/docs/24268935.pdf
>
> Says "NoFix" for erratas 66 and 92.

Yeah, 66 and 92 do look like they could cause the apparent ordering of
accesses to be violated. That said, both of them <i>seem</i> to be
"processor had exclusive access to line A, and gave it away but ended
up still reading now-stale data".

And that's not what we use "smp_wmb()" or "smp_rmb()" to protect
against. If we did a write and then wanted to do an ordered read, we'd
use smp_mb(), which always does that barrier.

So I don't know whether either of those really merit our PPRO
workaround. Cache coherency is hard.

There's also errata 41, which looks like it would be a bad situation.

                 Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22 16:10                           ` Tejun Heo
  2015-09-22 18:42                             ` Linus Torvalds
@ 2015-09-23  6:13                             ` Herbert Xu
  2015-09-23 15:54                               ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-23  6:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Tue, Sep 22, 2015 at 12:10:56PM -0400, Tejun Heo wrote:
>
> That's a pentium pro era errata.  Virtually no working machine is
> affected by that anymore and nobody builds kernel with that option.
> In most cases, store_release and load_acquire are cheaper as they're
> more specific.  On x86, store_release and load_acquire boil down to
> compiler reordering barriers.  You're running in the opposite
> direction.

It doesn't matter on x86 but the semantics of smp_load_acquire
and smp_store_release explicitly includes a full barrier which
is something that we don't need.

> I mean, read this thread.  It's one subtle breakage after another, one
> confusion after another.  The problematic usages of memory barriers
> are usually of two types.

smp_load_acquire/store_release isn't some kind of magic dust
that you can just spread around to fix races.  Whoever is writing
the code had better understood what the code is doing or they will
end up creating more bugs.

Having said that, I very much appreciate the work you have done
in reviewing my patches and pointing out holes in them.  Please
continue to do so and I will look at any real issues that you
discover.

> 1. Misunderstand what barriers do and fail to, most frequently, pair
>    the barriers correctly.  This leads to things like lone smp_wmb()s
>    which don't do anything but provide false sense of security.

smp_store_release can give you exactly the same kind of false
sense of security.

> 2. The usage itself is correct but not properly documented and it's
>    not clear what's being synchronized.  Because there's nothing
>    inherently pairing the matching barrier pairs, w/o proper
>    documentation, it can be very challenging to track down what is
>    being synchronized making it difficult to tell this case from 1.
>    Note that this is another reason smp_store_release() and
>    smp_load_acquire() are just better.  Their specificity not only
>    makes them lighter but also makes it a lot easier to track down
>    what's going on.

Well if there is anything unclear in my patch please point them out
and I will rewrite and/or add comments where necessary.

> > The reason this one is OK is because we do not use nlk->portid or
> > try to get nlk from the hash table before we return to user-space.
> 
> What happens if somebody later adds code below that which happens to
> use portid?  You're creating a booby trap and the patch isn't even
> properly documenting what's going on.

The same thing can still happen even if you use smp_load_acquire.
Someone may add a read prior to it or add a second smp_load_acquire
and then screw up the logic as we have seen in netlink_bind.

As I said whoever is touching these lockless code paths need to
understand what is going on.  There is no easy way out.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-23  6:13                             ` Herbert Xu
@ 2015-09-23 15:54                               ` Tejun Heo
  2015-09-24  2:30                                 ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-23 15:54 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Wed, Sep 23, 2015 at 02:13:42PM +0800, Herbert Xu wrote:
> It doesn't matter on x86 but the semantics of smp_load_acquire
> and smp_store_release explicitly includes a full barrier which
> is something that we don't need.

Yeah, I was confused there.  I was thinking smp_rmb() was still doing
rmb() on x86.  smp_rmb/wmb() is the better pick here.

> > I mean, read this thread.  It's one subtle breakage after another, one
> > confusion after another.  The problematic usages of memory barriers
> > are usually of two types.
> 
> smp_load_acquire/store_release isn't some kind of magic dust
> that you can just spread around to fix races.  Whoever is writing
> the code had better understood what the code is doing or they will
> end up creating more bugs.

Hmm... lemme try again.  When using barriers or RCU, it's desirable to
establish certain invariants because it usually is extremely easy to
miss corner cases.  It is helpful to have an abstraction, even if just
conceptual, where people can go "this thing is barrier / RCU protected
to guarantee XYZ".  Going more towards RCU example, this is why we
annotate variables as RCU protected to detect incorrect usages.  There
sure are exceptions but most are of the sort "this is write path and
protected by something else which is annotated differently".  Doing
things this way makes it a lot easier to get right.

Now, going back to barrier example.  If we take a similar approach,
there can be two cases.

1. Read of the boolean is barrier protected.  It's known that the
   condition that the boolean indicates is visible as true once the
   test succeeds.

2. Directly test the boolean inside write locking.  As the whole thing
   is protected by the same lock, we know that the indicated condition
   always matches what's visible.

In a lot of cases, separating out 2 doesn't matter all that much and
we sometimes skip it.  If we do that, either the protection should be
fairly obvious or, if there are multiple of those and the code path
isn't quite trivial, a different helper with extra lockdep annotation
can be used.

No matter what we do, please realize that we keep the invariant that
once the test evaulates to true the visible state matches what's
expected from such result.

What you're proposing introduces a third case in the above scenario
where how the test is performed becomes dependent on not only the
context of the test (which often can be annotated) but also the
following code does with the result of the test.  There are
exceptional cases where this makes sense but it's inherently hairy and
we only do things like that only if it's absoultely necessary and with
a lot of caution.  That isn't the case here.

You said that barriers aren't magic bullets.  Exactly, they aren't
anything special and whatever we do with them should be a reasonable
trade-off.  We've had incorrect and/or incomprehensible barrier usages
but that's not because people weren't going to 11 with scrutinizing
each deref and mixing barriered and non-barriered accesses.  If
anything, doing things like that without good enough rationale and
documentation (both are lacking here) is likely to lead to code base
which is difficult to comprehend and easy to get wrong.

> Having said that, I very much appreciate the work you have done
> in reviewing my patches and pointing out holes in them.  Please
> continue to do so and I will look at any real issues that you
> discover.

It isn't an accident that so many attempts in this thread were
errorneous.  You are taking the wrong approach and it's gonna cause
the same kind of failures in the future and there is no actual benefit
we're getting out of it.

> The same thing can still happen even if you use smp_load_acquire.
> Someone may add a read prior to it or add a second smp_load_acquire
> and then screw up the logic as we have seen in netlink_bind.

So, you're worrying about the following?

	portid = nlk->portid;
	if (nlk_bound(nlk)) {
		// use portid
	}

How is that even comparable?

> As I said whoever is touching these lockless code paths need to
> understand what is going on.  There is no easy way out.

There is a pretty wide gap between "no easy way out" and "lemme make
this as painful as possible so that people including myself get it
wrong most of the time for no good reason".

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-23 15:54                               ` Tejun Heo
@ 2015-09-24  2:30                                 ` Herbert Xu
  2015-09-24  2:46                                   ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  2:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 11:54:40AM -0400, Tejun Heo wrote:
> 
> Hmm... lemme try again.  When using barriers or RCU, it's desirable to
> establish certain invariants because it usually is extremely easy to
> miss corner cases.  It is helpful to have an abstraction, even if just
> conceptual, where people can go "this thing is barrier / RCU protected
> to guarantee XYZ".  Going more towards RCU example, this is why we
> annotate variables as RCU protected to detect incorrect usages.  There
> sure are exceptions but most are of the sort "this is write path and
> protected by something else which is annotated differently".  Doing
> things this way makes it a lot easier to get right.

Well if someone provided helpers which

1) uses smp_wmb and smp_rmb instead of full barriers;
2) provides raw variants for the cases that barriers aren't needed

then I'm more than happy to use them.

Having reviewed the situation again I'm even more convincend
now that smp_load_acquire/smp_store_release aren't the appropriate
primitives for us.  They are meant for situations that are similar
to spin lock/unlock where you need to prevent all reads/writes from
floating above or below the load/store, respectively.

For our situation we only need write or read ordering, so they are
literally the wrong tool for the job and will only cause confusion
in future when someone tries to do a major rewrite of the code and
they will be scratching their head as to why we needed locking-like
semantics here.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  2:30                                 ` Herbert Xu
@ 2015-09-24  2:46                                   ` Tejun Heo
  2015-09-24  2:54                                     ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24  2:46 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Thu, Sep 24, 2015 at 10:30:11AM +0800, Herbert Xu wrote:
> Well if someone provided helpers which
> 
> 1) uses smp_wmb and smp_rmb instead of full barriers;

This part is fine.

> 2) provides raw variants for the cases that barriers aren't needed

Hmm... It looks like I'm failing at communicating.  Lemme try again.
There are two situations where we do this.

1. When there are different locking contexts.  In this case, the write
   path is.  It's already protected by the spinlock so the barrier
   isn't necessary.

2. When the path is hot enough for the cost of smp_rmb() to matter and
   the specifics of individual deref allows for micro optimization and
   justifies the added overhead in terms of increased fragility,
   complexity and need for documentation.

In both cases, we want to make reasonable trade-offs like any other
choices we make.  We don't go off and run to one extreme or the other
just because barriers are involved.  One good measure to use is
whether the extra documentation necessary is justifiable.  In this
case, on each unprotected derefs, we want to explain why the
unprotected deref is okay and justified.

> then I'm more than happy to use them.
> 
> Having reviewed the situation again I'm even more convincend
> now that smp_load_acquire/smp_store_release aren't the appropriate
> primitives for us.  They are meant for situations that are similar
> to spin lock/unlock where you need to prevent all reads/writes from
> floating above or below the load/store, respectively.
>
> For our situation we only need write or read ordering, so they are
> literally the wrong tool for the job and will only cause confusion
> in future when someone tries to do a major rewrite of the code and
> they will be scratching their head as to why we needed locking-like
> semantics here.

store_release/load_acquire vs. wmb/rmb is a separate issue.  I no
longer have objections against using wmb/rmb pairs here although I do
wanna note that eventually I think release/acquire are likely to be
more prevalent but that's a separate discussion.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  2:46                                   ` Tejun Heo
@ 2015-09-24  2:54                                     ` Herbert Xu
  2015-09-24  3:06                                       ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  2:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 10:46:08PM -0400, Tejun Heo wrote:
> 
> Hmm... It looks like I'm failing at communicating.  Lemme try again.
> There are two situations where we do this.
> 
> 1. When there are different locking contexts.  In this case, the write
>    path is.  It's already protected by the spinlock so the barrier
>    isn't necessary.
> 
> 2. When the path is hot enough for the cost of smp_rmb() to matter and
>    the specifics of individual deref allows for micro optimization and
>    justifies the added overhead in terms of increased fragility,
>    complexity and need for documentation.
> 
> In both cases, we want to make reasonable trade-offs like any other
> choices we make.  We don't go off and run to one extreme or the other
> just because barriers are involved.  One good measure to use is
> whether the extra documentation necessary is justifiable.  In this
> case, on each unprotected derefs, we want to explain why the
> unprotected deref is okay and justified.

What I am concerned about is the next guy who comes along and
does a rewrite like the one that introduced the netlink_bind
bug.  That person needs to fully understand what each primitive
is protecting against.

Using primitives where they're not needed can lead to misunderstandings
which may end up causing bugs.

Honestly I don't care whether you have a barrier there or not as
I only use x86.  But you very much should add a comment at least
saying that the barrier isn't needed for the cases where I left it
out so that future developers don't get confused.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  2:54                                     ` Herbert Xu
@ 2015-09-24  3:06                                       ` Tejun Heo
  2015-09-24  3:21                                         ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24  3:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

On Thu, Sep 24, 2015 at 10:54:36AM +0800, Herbert Xu wrote:
> What I am concerned about is the next guy who comes along and
> does a rewrite like the one that introduced the netlink_bind
> bug.  That person needs to fully understand what each primitive
> is protecting against.
> 
> Using primitives where they're not needed can lead to misunderstandings
> which may end up causing bugs.

I think this is where we're not agreeing.  My point is that better
understanding and lower likelihood of bug doesn't equate specializing
each usage site.  That's a lot more likely to lead to unnecessary
cognition overhead and naturally errors.  There's no reason to require
such error-prone and specific understanding of each usage site when we
can have agreed-upon abstractions which yield invariants which are a
lot easier for people to wrap their heads around.

This isn't an isolated one-off barrier hack.  This is a
well-established pattern and sure there are cases we wanna deconstruct
that and make exceptions but that needs to be justifiable.  The
overhead gotta buy us something.  Here it just doesn't.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:06                                       ` Tejun Heo
@ 2015-09-24  3:21                                         ` Herbert Xu
  2015-09-24  3:29                                           ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  3:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 11:06:09PM -0400, Tejun Heo wrote:
> 
> I think this is where we're not agreeing.  My point is that better
> understanding and lower likelihood of bug doesn't equate specializing
> each usage site.  That's a lot more likely to lead to unnecessary
> cognition overhead and naturally errors.  There's no reason to require
> such error-prone and specific understanding of each usage site when we
> can have agreed-upon abstractions which yield invariants which are a
> lot easier for people to wrap their heads around.

Well we'll have to agree to disagree on that one.  I have seen too
many instances over the years where people post patches that use
primitives such as RCU and think that they must be safe because
it compiles with no warnings (and probably even runs).

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:21                                         ` Herbert Xu
@ 2015-09-24  3:29                                           ` Tejun Heo
  2015-09-24  3:31                                             ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24  3:29 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

On Thu, Sep 24, 2015 at 11:21:00AM +0800, Herbert Xu wrote:
> Well we'll have to agree to disagree on that one.  I have seen too
> many instances over the years where people post patches that use
> primitives such as RCU and think that they must be safe because
> it compiles with no warnings (and probably even runs).

So, while that also has been a common failure mode that we've been
seeing with barrier usages, what you're suggesting isn't the right
balance either.  It's error-prone in a different way as amply
exemplified in this very thread.  It ended up making what should have
been a straight-forward writer-reader interlocking into a maze in
which one can easily be lost.  I think you should be able to see that
after this thread.

Both misusages can be solved by understanding and sticking to
established patterns and making exceptions only when explicitly
justifiable and with ample explanation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:29                                           ` Tejun Heo
@ 2015-09-24  3:31                                             ` Herbert Xu
  2015-09-24  3:41                                               ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  3:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 11:29:28PM -0400, Tejun Heo wrote:
> 
> So, while that also has been a common failure mode that we've been
> seeing with barrier usages, what you're suggesting isn't the right
> balance either.  It's error-prone in a different way as amply
> exemplified in this very thread.  It ended up making what should have
> been a straight-forward writer-reader interlocking into a maze in
> which one can easily be lost.  I think you should be able to see that
> after this thread.

No this isn't what happened.  My error was trying to see if there
is a way to do it without barriers.  In the end there wasn't.  This
has nothing to do with using primitives.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:31                                             ` Herbert Xu
@ 2015-09-24  3:41                                               ` Tejun Heo
  2015-09-24  3:42                                                 ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24  3:41 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Thu, Sep 24, 2015 at 11:31:17AM +0800, Herbert Xu wrote:
> No this isn't what happened.  My error was trying to see if there
> is a way to do it without barriers.  In the end there wasn't.  This
> has nothing to do with using primitives.

Hmmm... yeah, you can say that, but it still was a failure to
recognize and apply the common pattern and what you're suggesting is
deviating for no good reason.  It demands a lot of cognitive overhead
for something which should be routine and makes the code a lot more
fragile as a result.  Things like this make barrier usages difficult
to understand and verify because it takes away a lot of ready-made
cognitive tools.  So, let's please stick to the known pattern.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:41                                               ` Tejun Heo
@ 2015-09-24  3:42                                                 ` Herbert Xu
  2015-09-24  3:43                                                   ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  3:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 11:41:16PM -0400, Tejun Heo wrote:
> On Thu, Sep 24, 2015 at 11:31:17AM +0800, Herbert Xu wrote:
> > No this isn't what happened.  My error was trying to see if there
> > is a way to do it without barriers.  In the end there wasn't.  This
> > has nothing to do with using primitives.
> 
> Hmmm... yeah, you can say that, but it still was a failure to
> recognize and apply the common pattern and what you're suggesting is
> deviating for no good reason.  It demands a lot of cognitive overhead
> for something which should be routine and makes the code a lot more
> fragile as a result.  Things like this make barrier usages difficult
> to understand and verify because it takes away a lot of ready-made
> cognitive tools.  So, let's please stick to the known pattern.

Well I disagree so let's leave it at that.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:42                                                 ` Herbert Xu
@ 2015-09-24  3:43                                                   ` Tejun Heo
  2015-09-24  3:44                                                     ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24  3:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Thu, Sep 24, 2015 at 11:42:14AM +0800, Herbert Xu wrote:
> Well I disagree so let's leave it at that.

Leaving things disagreed is fine but there's still a patch to commit
here, so I get that you're still dead against just applying the
pattern?

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24  3:43                                                   ` Tejun Heo
@ 2015-09-24  3:44                                                     ` Herbert Xu
  0 siblings, 0 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-24  3:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Wed, Sep 23, 2015 at 11:43:21PM -0400, Tejun Heo wrote:
> On Thu, Sep 24, 2015 at 11:42:14AM +0800, Herbert Xu wrote:
> > Well I disagree so let's leave it at that.
> 
> Leaving things disagreed is fine but there's still a patch to commit
> here, so I get that you're still dead against just applying the
> pattern?

Honestly I don't care anymore.  Feel free to do whatever you
want.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-22  3:38                         ` [PATCH v2] " Herbert Xu
  2015-09-22 16:10                           ` Tejun Heo
@ 2015-09-24 19:11                           ` David Miller
  2015-09-24 20:05                             ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: David Miller @ 2015-09-24 19:11 UTC (permalink / raw)
  To: herbert
  Cc: tj, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 22 Sep 2015 11:38:56 +0800

> The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> Fix autobind race condition that leads to zero port ID") created
> some new races that can occur due to inconcsistencies between the
> two port IDs.
> 
> Tejun is right that a barrier is unavoidable.  Therefore I am
> reverting to the original patch that used a boolean to indicate
> that a user netlink socket has been bound.
> 
> Barriers have been added where necessary to ensure that a valid
> portid and the hashed socket is visible.
> 
> I have also changed netlink_insert to only return EBUSY if the
> socket is bound to a portid different to the requested one.  This
> combined with only reading nlk->bound once in netlink_bind fixes
> a race where two threads that bind the socket at the same time
> with different port IDs may both succeed.
> 
> Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
> Reported-by: Tejun Heo <tj@kernel.org>
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

I've decided to apply this and queue it up for -stable.

Thanks everyone.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2] netlink: Replace rhash_portid with bound
  2015-09-24 19:11                           ` David Miller
@ 2015-09-24 20:05                             ` Tejun Heo
  2015-09-25  1:43                               ` netlink: Add barrier to netlink_connect for theoretical case Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-24 20:05 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, cwang, tom, kafai, kernel-team, linux-kernel, netdev,
	torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, David.

On Thu, Sep 24, 2015 at 12:11:42PM -0700, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Tue, 22 Sep 2015 11:38:56 +0800
> 
> > The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
> > Fix autobind race condition that leads to zero port ID") created
> > some new races that can occur due to inconcsistencies between the
> > two port IDs.
...
> I've decided to apply this and queue it up for -stable.

This is mostly correct; however, if there are or can be in-kernel
users which create the client side of netlink socket, it isn't.  Let's
say such in-kernel user does kernel_connect() and then query the
assigned port number by kernel_getsockname().  That can just return
zero.  Maybe such scenario is not possible for some combination of
reasons but why leak this level of synchronization detail to the users
in the first place?  This should be terminated from the site where
such synchronization scheme is implemented.  This expands the scope of
correctness verification to all possible users of these functions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* netlink: Add barrier to netlink_connect for theoretical case
  2015-09-24 20:05                             ` Tejun Heo
@ 2015-09-25  1:43                               ` Herbert Xu
  2015-09-25  3:24                                 ` Linus Torvalds
  2015-09-25 15:01                                 ` Tejun Heo
  0 siblings, 2 replies; 55+ messages in thread
From: Herbert Xu @ 2015-09-25  1:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Thu, Sep 24, 2015 at 04:05:10PM -0400, Tejun Heo wrote:
>
> > I've decided to apply this and queue it up for -stable.

Thanks Dave!

> This is mostly correct; however, if there are or can be in-kernel
> users which create the client side of netlink socket, it isn't.  Let's
> say such in-kernel user does kernel_connect() and then query the
> assigned port number by kernel_getsockname().  That can just return
> zero.  Maybe such scenario is not possible for some combination of
> reasons but why leak this level of synchronization detail to the users
> in the first place?  This should be terminated from the site where
> such synchronization scheme is implemented.  This expands the scope of
> correctness verification to all possible users of these functions.

Well had you said this in the first place I would've fixed it a
long time ago.  There aren't any in-kernel users right now and
even if there were they'd have to do a connect/bind/sendmsg on
the same socket in two threads at the same time.  But let's close
this theoretical hole:

---8<---
If a netlink_connect call is followed by a netlink_getname call
the portid returned may not be up-to-date.  This patch adds a
barrier for that case.

As all nlk->bound dereferences now have barriers this patch also
adds a netlink_bound helper to encapsulate the barrier, as was
suggested by Tejun.

Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2c15fae..02121e1 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -125,6 +125,15 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static inline bool netlink_bound(struct netlink_sock *nlk)
+{
+	/* Ensure nlk is hashed and visible. */
+	if (nlk->bound)
+		smp_rmb();
+
+	return nlk->bound;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -1524,14 +1533,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	bound = nlk->bound;
-	if (bound) {
-		/* Ensure nlk->portid is up-to-date. */
-		smp_rmb();
-
+	bound = netlink_bound(nlk);
+	if (bound)
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
-	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1547,9 +1552,6 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	/* No need for barriers here as we return to user-space without
-	 * using any of the bound attributes.
-	 */
 	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
@@ -1598,10 +1600,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	/* No need for barriers here as we return to user-space without
-	 * using any of the bound attributes.
-	 */
-	if (!nlk->bound)
+	if (!netlink_bound(nlk))
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2442,13 +2441,10 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->bound) {
+	if (!netlink_bound(nlk)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
-	} else {
-		/* Ensure nlk is hashed and visible. */
-		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: netlink: Add barrier to netlink_connect for theoretical case
  2015-09-25  1:43                               ` netlink: Add barrier to netlink_connect for theoretical case Herbert Xu
@ 2015-09-25  3:24                                 ` Linus Torvalds
  2015-09-25  3:39                                   ` Herbert Xu
  2015-09-25 15:01                                 ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Linus Torvalds @ 2015-09-25  3:24 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Tejun Heo, David Miller, Cong Wang, Tom Herbert, kafai,
	kernel-team, Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

On Thu, Sep 24, 2015 at 6:43 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, Sep 24, 2015 at 04:05:10PM -0400, Tejun Heo wrote:
>
> +static inline bool netlink_bound(struct netlink_sock *nlk)
> +{
> +       /* Ensure nlk is hashed and visible. */
> +       if (nlk->bound)
> +               smp_rmb();
> +
> +       return nlk->bound;
> +}

The above looks very suspicious.

If "nlk->bound" isn't stable, then you might read 0 the first time,
not do the smp_rmb(), and then read 1 on the second access to
nlk->bound.

In other words, you just ended up returning 1 without actually doing
the mb, so there will be no serialization between the "bound" variable
and reading the portid afterwards.

That makes no sense.

And if nlk->bound *is* stable, then the smp_rmb() doesn't make any
sense that I can see.

So for the code to actually make sense, it should either do:

   int bound = nlk->bound;
   smp_rmb();
   return bound;

which is fine on x86, but might be expensive on other architectures
due to the unconditional rmb.

So you *could* write it with a conditional rmb, but then you need to
use a READ_ONCE(), to make sure that gcc really does the read exactly
once, because at that point the "rmb" no longer keeps gcc from playing
tricks. So

   int bound = READ_ONCE(nlk->bound);
   if (bound)
      smp_rmb();
   return bound;

could also be correct. Sadly, while "smp_rmb()" is a no-op on x86, it
*is* a barrier, so the above conditional smp_rmb() actually sucks on
x86, because I suspect that gcc will create a jump around an empty
asm. So the unconditional rmb is actually simpler better on at least
x86.

But the function as you wrote it does not make sense. When you do a
barrier, you really have to think about where the accesses are.

             Linus

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add barrier to netlink_connect for theoretical case
  2015-09-25  3:24                                 ` Linus Torvalds
@ 2015-09-25  3:39                                   ` Herbert Xu
  2015-09-25 15:09                                     ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-25  3:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, David Miller, Cong Wang, Tom Herbert, kafai,
	kernel-team, Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

On Thu, Sep 24, 2015 at 08:24:56PM -0700, Linus Torvalds wrote:
> 
> The above looks very suspicious.

You're right Linus.  I've added the READ_ONCE there.  The reason I
kept the conditional is because the helper is always called in a
context where the result is used as part of an if statement.  The
assembly actually looks sane, e.g., for netlink_bind:

    3a06:       41 0f b6 94 24 e8 02    movzbl 0x2e8(%r12),%edx
    3a0d:       00 00 
    3a0f:       84 d2                   test   %dl,%dl
    3a11:       0f 85 97 00 00 00       jne    3aae <netlink_bind+0x12e>
    3a17:       49 83 bc 24 98 03 00    cmpq   $0x0,0x398(%r12)
    3a1e:       00 00 

...

    3aae:       41 8b 84 24 a0 02 00    mov    0x2a0(%r12),%eax
    3ab5:       00 
    3ab6:       41 39 46 04             cmp    %eax,0x4(%r14)
    3aba:       0f 84 57 ff ff ff       je     3a17 <netlink_bind+0x97>
    3ac0:       b8 ea ff ff ff          mov    $0xffffffea,%eax
    3ac5:       eb d1                   jmp    3a98 <netlink_bind+0x118>

So there is no unnecessary jumping around.  I checked the other
two call sites and they look the same.  I'm on a fairly old compiler
though (4.7.2) so it is possible that newer gcc's may do silly things.

Thanks,

---8<---
If a netlink_connect call is followed by a netlink_getname call
the portid returned may not be up-to-date.  This patch adds a
barrier for that case.

As all nlk->bound dereferences now have barriers this patch also
adds a netlink_bound helper to encapsulate the barrier, as was
suggested by Tejun.  Also as suggested by Linus, the lockless
read of nlk->bound is now protected with READ_ONCE to ensure that
the compiler doesn't do double reads that may screw up our use
of the barrier.

Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2c15fae..dd0a294 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -125,6 +125,17 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static inline bool netlink_bound(struct netlink_sock *nlk)
+{
+	bool bound = READ_ONCE(nlk->bound);
+
+	/* Ensure nlk is hashed and visible. */
+	if (bound)
+		smp_rmb();
+
+	return bound;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -1524,14 +1535,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	bound = nlk->bound;
-	if (bound) {
-		/* Ensure nlk->portid is up-to-date. */
-		smp_rmb();
-
+	bound = netlink_bound(nlk);
+	if (bound)
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
-	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1547,9 +1554,6 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	/* No need for barriers here as we return to user-space without
-	 * using any of the bound attributes.
-	 */
 	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
@@ -1598,10 +1602,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
 	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
 		return -EPERM;
 
-	/* No need for barriers here as we return to user-space without
-	 * using any of the bound attributes.
-	 */
-	if (!nlk->bound)
+	if (!netlink_bound(nlk))
 		err = netlink_autobind(sock);
 
 	if (err == 0) {
@@ -2442,13 +2443,10 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->bound) {
+	if (!netlink_bound(nlk)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
-	} else {
-		/* Ensure nlk is hashed and visible. */
-		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped

-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: netlink: Add barrier to netlink_connect for theoretical case
  2015-09-25  1:43                               ` netlink: Add barrier to netlink_connect for theoretical case Herbert Xu
  2015-09-25  3:24                                 ` Linus Torvalds
@ 2015-09-25 15:01                                 ` Tejun Heo
  2015-09-26 13:16                                   ` netlink: Add netlink_bound helper and use it in netlink_getname Herbert Xu
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-25 15:01 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Fri, Sep 25, 2015 at 09:43:27AM +0800, Herbert Xu wrote:
> Well had you said this in the first place I would've fixed it a
> long time ago.  There aren't any in-kernel users right now and
> even if there were they'd have to do a connect/bind/sendmsg on
> the same socket in two threads at the same time.  But let's close
> this theoretical hole:

I'm not even sure we guarantee memory barrier on kernel/user
crossings.  In practice, we probably have enough barriers (e.g. some
syscall traps imply barrier) but I can't think of a reason why we'd
guarantee the existence of barrier there.  As an extreme example,
imagine UML on an architecture with relaxed memory model.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add barrier to netlink_connect for theoretical case
  2015-09-25  3:39                                   ` Herbert Xu
@ 2015-09-25 15:09                                     ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2015-09-25 15:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Linus Torvalds, David Miller, Cong Wang, Tom Herbert, kafai,
	kernel-team, Linux Kernel Mailing List, Network Development,
	Jiří Pírko, Nicolas Dichtel, Thomas Graf,
	Scott Feldman

Hello, Herbert.

On Fri, Sep 25, 2015 at 11:39:57AM +0800, Herbert Xu wrote:
> +static inline bool netlink_bound(struct netlink_sock *nlk)
> +{
> +	bool bound = READ_ONCE(nlk->bound);
> +
> +	/* Ensure nlk is hashed and visible. */
> +	if (bound)
> +		smp_rmb();
> +
> +	return bound;
> +}

While I can't see anything wrong with the above, I'm not a fan of it
for whatever worth that may be.  I don't think it adds anything in
terms of readability or clarity of the code.  It does avoid smp_rmb()
when @bound is false but that's unlikely to be helfpul - where the
barrier is being avoided is a cold path.  This is largely a generic
characteristic because if where the barrier is being avoided is a hot
path, why wouldn't the code just grab a lock in that path instead of
using a gated barrier?  So, there's a reason why we don't see code
like the above commonly.  It doesn't buy us anything meaningful while
making the code more complicated and sometimes more fragile.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-25 15:01                                 ` Tejun Heo
@ 2015-09-26 13:16                                   ` Herbert Xu
  2015-09-26 18:09                                     ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-26 13:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Fri, Sep 25, 2015 at 11:01:13AM -0400, Tejun Heo wrote:
> 
> I'm not even sure we guarantee memory barrier on kernel/user
> crossings.  In practice, we probably have enough barriers (e.g. some
> syscall traps imply barrier) but I can't think of a reason why we'd
> guarantee the existence of barrier there.  As an extreme example,
> imagine UML on an architecture with relaxed memory model.

You misunderstood what I wrote.  I was not basing this on whether
user-space transitions contained a barrier, but on the fact that
the next syscall must recheck nlk->bound before using nlk->portid.

In fact thanks to your email I now realise that my fix to the
getsockname problem is wrong.  Instead of adding a barrier to
netlink_connect I should be adding a nlk->bound check to getname.

---8<---
netlink_getname must check nlk->bound before using nlk->portid
as otherwise nlk->portid may contain garbage.

This patch also adds a netlink_bound helper that encapsulate the
barrier, as was suggested by Tejun.  Also as suggested by Linus,
the lockless read of nlk->bound in netlink_bound is protected with
READ_ONCE to ensure that the compiler doesn't do double reads that
may screw up our use of the barrier.

Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2c15fae..c860464 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -125,6 +125,17 @@ static inline u32 netlink_group_mask(u32 group)
 	return group ? 1 << (group - 1) : 0;
 }
 
+static inline bool netlink_bound(struct netlink_sock *nlk)
+{
+	bool bound = READ_ONCE(nlk->bound);
+
+	/* Ensure nlk is hashed and visible. */
+	if (bound)
+		smp_rmb();
+
+	return bound;
+}
+
 int netlink_add_tap(struct netlink_tap *nt)
 {
 	if (unlikely(nt->dev->type != ARPHRD_NETLINK))
@@ -1524,14 +1535,10 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 			return err;
 	}
 
-	bound = nlk->bound;
-	if (bound) {
-		/* Ensure nlk->portid is up-to-date. */
-		smp_rmb();
-
+	bound = netlink_bound(nlk);
+	if (bound)
 		if (nladdr->nl_pid != nlk->portid)
 			return -EINVAL;
-	}
 
 	if (nlk->netlink_bind && groups) {
 		int group;
@@ -1547,9 +1554,6 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
 		}
 	}
 
-	/* No need for barriers here as we return to user-space without
-	 * using any of the bound attributes.
-	 */
 	if (!bound) {
 		err = nladdr->nl_pid ?
 			netlink_insert(sk, nladdr->nl_pid) :
@@ -1628,7 +1632,7 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr,
 		nladdr->nl_pid = nlk->dst_portid;
 		nladdr->nl_groups = netlink_group_mask(nlk->dst_group);
 	} else {
-		nladdr->nl_pid = nlk->portid;
+		nladdr->nl_pid = netlink_bound(nlk) ? nlk->portid : 0;
 		nladdr->nl_groups = nlk->groups ? nlk->groups[0] : 0;
 	}
 	return 0;
@@ -2442,13 +2446,10 @@ static int netlink_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		dst_group = nlk->dst_group;
 	}
 
-	if (!nlk->bound) {
+	if (!netlink_bound(nlk)) {
 		err = netlink_autobind(sock);
 		if (err)
 			goto out;
-	} else {
-		/* Ensure nlk is hashed and visible. */
-		smp_rmb();
 	}
 
 	/* It's a really convoluted way for userland to ask for mmaped
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 13:16                                   ` netlink: Add netlink_bound helper and use it in netlink_getname Herbert Xu
@ 2015-09-26 18:09                                     ` Tejun Heo
  2015-09-26 19:41                                       ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-26 18:09 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Sat, Sep 26, 2015 at 09:16:21PM +0800, Herbert Xu wrote:
> You misunderstood what I wrote.  I was not basing this on whether
> user-space transitions contained a barrier, but on the fact that
> the next syscall must recheck nlk->bound before using nlk->portid.

But that isn't what you wrote in the comment.

	/* No need for barriers here as we return to user-space without
	 * using any of the bound attributes.
	 */

> In fact thanks to your email I now realise that my fix to the
> getsockname problem is wrong.  Instead of adding a barrier to
> netlink_connect I should be adding a nlk->bound check to getname.

I don't know, man.  This thread almost feels surreal at this point.

> @@ -1628,7 +1632,7 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr,
>  		nladdr->nl_pid = nlk->dst_portid;
>  		nladdr->nl_groups = netlink_group_mask(nlk->dst_group);
>  	} else {
> -		nladdr->nl_pid = nlk->portid;
> +		nladdr->nl_pid = netlink_bound(nlk) ? nlk->portid : 0;
>  		nladdr->nl_groups = nlk->groups ? nlk->groups[0] : 0;
>  	}
>  	return 0;

So, this is really weird because netlink_getname() doens't participate
in the autobind race and thus it's perfectly fine for it to not worry
about whether ->bound is set or the memory barrier - whoever its
caller may be, the caller is of course responsible for ensuring that
the port is bound and visible if it expects to read back the number -
ie. if the caller doesn't know (in memory ordering sense) that
bind/connect/sendmsg succeeded, it of course can't expect to reliably
read back the port number.  getname never needed the barrier.  The
above is shifting synchronization from the source to its users.  This
is a bad thing to do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 18:09                                     ` Tejun Heo
@ 2015-09-26 19:41                                       ` Herbert Xu
  2015-09-26 19:45                                         ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-26 19:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sat, Sep 26, 2015 at 02:09:03PM -0400, Tejun Heo wrote:
>
> > @@ -1628,7 +1632,7 @@ static int netlink_getname(struct socket *sock, struct sockaddr *addr,
> >  		nladdr->nl_pid = nlk->dst_portid;
> >  		nladdr->nl_groups = netlink_group_mask(nlk->dst_group);
> >  	} else {
> > -		nladdr->nl_pid = nlk->portid;
> > +		nladdr->nl_pid = netlink_bound(nlk) ? nlk->portid : 0;
> >  		nladdr->nl_groups = nlk->groups ? nlk->groups[0] : 0;
> >  	}
> >  	return 0;
> 
> So, this is really weird because netlink_getname() doens't participate
> in the autobind race and thus it's perfectly fine for it to not worry
> about whether ->bound is set or the memory barrier - whoever its
> caller may be, the caller is of course responsible for ensuring that
> the port is bound and visible if it expects to read back the number -
> ie. if the caller doesn't know (in memory ordering sense) that
> bind/connect/sendmsg succeeded, it of course can't expect to reliably
> read back the port number.  getname never needed the barrier.  The
> above is shifting synchronization from the source to its users.  This
> is a bad thing to do.

Thread 1			Thread 2
sendmsg				getsockname
	netlink_autobind		netlink_getname

Thread 2 should not have to do anything special to guarantee that
getsockname does not return garbage.  It must either be the bound
portid if the autobind completed in thread 1 and is visible or it
should return zero.

As it stands thread 2 may see a portid belonging to somebody else
if it catches the autobind in thread 1 trying different portids
while roving.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 19:41                                       ` Herbert Xu
@ 2015-09-26 19:45                                         ` Tejun Heo
  2015-09-26 19:49                                           ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-26 19:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello, Herbert.

On Sun, Sep 27, 2015 at 03:41:10AM +0800, Herbert Xu wrote:
> Thread 1			Thread 2
> sendmsg				getsockname
> 	netlink_autobind		netlink_getname
> 
> Thread 2 should not have to do anything special to guarantee that
> getsockname does not return garbage.  It must either be the bound
> portid if the autobind completed in thread 1 and is visible or it
> should return zero.
> 
> As it stands thread 2 may see a portid belonging to somebody else
> if it catches the autobind in thread 1 trying different portids
> while roving.

If the fact that thread 1 finished autobind isn't visible to thread 2,
it's valid for getsockname to return zero.  No ordering between the
two operations is defined.  If the fact that thread 1 finished
autobind is visible to thread 2, ordering is defined and because
ordering is transitive, by that very ordering, the port number is
visible to thread 2 too as long as thread 1 does proper barriering.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 19:45                                         ` Tejun Heo
@ 2015-09-26 19:49                                           ` Herbert Xu
  2015-09-26 19:52                                             ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-26 19:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sat, Sep 26, 2015 at 03:45:54PM -0400, Tejun Heo wrote:
> Hello, Herbert.
> 
> On Sun, Sep 27, 2015 at 03:41:10AM +0800, Herbert Xu wrote:
> > Thread 1			Thread 2
> > sendmsg				getsockname
> > 	netlink_autobind		netlink_getname
> > 
> > Thread 2 should not have to do anything special to guarantee that
> > getsockname does not return garbage.  It must either be the bound
> > portid if the autobind completed in thread 1 and is visible or it
> > should return zero.
> > 
> > As it stands thread 2 may see a portid belonging to somebody else
> > if it catches the autobind in thread 1 trying different portids
> > while roving.
> 
> If the fact that thread 1 finished autobind isn't visible to thread 2,
> it's valid for getsockname to return zero.  No ordering between the
> two operations is defined.  If the fact that thread 1 finished
> autobind is visible to thread 2, ordering is defined and because
> ordering is transitive, by that very ordering, the port number is
> visible to thread 2 too as long as thread 1 does proper barriering.

If the autobind is not complete then netlink_getname must return
zero rather than some garbage portid that belongs to somebody
else's socket.  That's what we did before any of this lockless
code was introduced.

If you don't check nlk->bound then you may return garbage.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 19:49                                           ` Herbert Xu
@ 2015-09-26 19:52                                             ` Tejun Heo
  2015-09-26 19:55                                               ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-26 19:52 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

On Sun, Sep 27, 2015 at 03:49:16AM +0800, Herbert Xu wrote:
> If the autobind is not complete then netlink_getname must return
> zero rather than some garbage portid that belongs to somebody
> else's socket.  That's what we did before any of this lockless
> code was introduced.
> 
> If you don't check nlk->bound then you may return garbage.

Ah, yeah, you're right.  We need to check that there because it may
contain a garbage value.  I still think it'd better to use
netlink_bound() test in connect() too tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 19:52                                             ` Tejun Heo
@ 2015-09-26 19:55                                               ` Herbert Xu
  2015-09-26 20:05                                                 ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-26 19:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sat, Sep 26, 2015 at 03:52:45PM -0400, Tejun Heo wrote:
> 
> Ah, yeah, you're right.  We need to check that there because it may
> contain a garbage value.  I still think it'd better to use
> netlink_bound() test in connect() too tho.

Well I disagree.  When I say that it returns to user-space I really
mean that the next time we use portid via the same call path that
triggered the connect we must be checking nlk->bound anyway.

Good luck finding more bugs in this code :)

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 19:55                                               ` Herbert Xu
@ 2015-09-26 20:05                                                 ` Tejun Heo
  2015-09-26 20:10                                                   ` Herbert Xu
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2015-09-26 20:05 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

On Sun, Sep 27, 2015 at 03:55:48AM +0800, Herbert Xu wrote:
> Well I disagree.  When I say that it returns to user-space I really
> mean that the next time we use portid via the same call path that
> triggered the connect we must be checking nlk->bound anyway.
> 
> Good luck finding more bugs in this code :)

Frankly, I don't understand what you've been trying to achieve.
You're actively disregarding best practices (like terminating
synchronization where it starts) and reach the target state by doing a
browian motion in the solution space.  Sure, if you do enough of that,
eventually you can arrive somewhere where it's not broken but it leads
to a lot more overhead for everyone involved - the author, reviewers
and later readers of the code and if you spread barrier usages like
this across the kernel, we'll end up with code base which is a lot
harder to verify and maintain.  I hope you stop doing things this way
but suppose that you're ignoring any conceptual arguments in this
thread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 20:05                                                 ` Tejun Heo
@ 2015-09-26 20:10                                                   ` Herbert Xu
  2015-09-26 20:17                                                     ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Herbert Xu @ 2015-09-26 20:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

On Sat, Sep 26, 2015 at 04:05:18PM -0400, Tejun Heo wrote:
> 
> Frankly, I don't understand what you've been trying to achieve.
> You're actively disregarding best practices (like terminating
> synchronization where it starts) and reach the target state by doing a
> browian motion in the solution space.  Sure, if you do enough of that,
> eventually you can arrive somewhere where it's not broken but it leads
> to a lot more overhead for everyone involved - the author, reviewers
> and later readers of the code and if you spread barrier usages like
> this across the kernel, we'll end up with code base which is a lot
> harder to verify and maintain.  I hope you stop doing things this way
> but suppose that you're ignoring any conceptual arguments in this
> thread.

Your point so far has been that if you add barriers or use primitives
everywhere then races magically disappear.

Well guess what the bug that you have discovered supposedly due to
a missing barrier in netlink_connect has nothing to do with the
barrier.  Instead it is caused by a logical error elsewhere that
would have gone unnoticed otherwise.

So I retain my position that blindly adding barriers do not make
bugs go away.  Instead you need to have real understanding of what
the code is doing and every spot where a barrier may be needed must
be audited manually.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: netlink: Add netlink_bound helper and use it in netlink_getname
  2015-09-26 20:10                                                   ` Herbert Xu
@ 2015-09-26 20:17                                                     ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2015-09-26 20:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, cwang, tom, kafai, kernel-team, linux-kernel,
	netdev, torvalds, jiri, nicolas.dichtel, tgraf, sfeldma

Hello,

On Sun, Sep 27, 2015 at 04:10:41AM +0800, Herbert Xu wrote:
> Well guess what the bug that you have discovered supposedly due to
> a missing barrier in netlink_connect has nothing to do with the
> barrier.  Instead it is caused by a logical error elsewhere that
> would have gone unnoticed otherwise.

It's a combination of two problems.  The garbage port number is a
logical error but there still is an ordering problem there between
->bound and ->portid.  We need to test ->bind there again because of
the garbage port problem.

> So I retain my position that blindly adding barriers do not make
> bugs go away.  Instead you need to have real understanding of what

That's a dishonest summary of what I've been saying.

> the code is doing and every spot where a barrier may be needed must
> be audited manually.

What I've been saying is that we do need to be careful and audit each
barrier usages but at the same time there are established patterns
that we can use to make the process significantly easier and more
reliable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2015-09-26 20:18 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-17  2:29 Possible netlink autobind regression Tejun Heo
2015-09-17  3:08 ` Herbert Xu
2015-09-17  3:41   ` Herbert Xu
2015-09-17  5:02     ` Cong Wang
2015-09-17  5:15       ` Herbert Xu
2015-09-17 11:25         ` Thomas Graf
2015-09-17 11:30         ` Tejun Heo
2015-09-18  6:36           ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Herbert Xu
2015-09-18 11:16             ` [PATCH v4] " Herbert Xu
2015-09-21  5:55               ` David Miller
2015-09-21  6:06                 ` Herbert Xu
2015-09-21  6:11                   ` David Miller
2015-09-21 13:34                     ` netlink: Replace rhash_portid with bound Herbert Xu
2015-09-21 18:20                       ` Tejun Heo
2015-09-22  3:38                         ` [PATCH v2] " Herbert Xu
2015-09-22 16:10                           ` Tejun Heo
2015-09-22 18:42                             ` Linus Torvalds
2015-09-22 18:53                               ` Tejun Heo
2015-09-22 19:28                                 ` Linus Torvalds
2015-09-22 19:50                                   ` Tejun Heo
2015-09-22 20:03                                     ` Linus Torvalds
2015-09-22 20:36                                       ` Bjørn Mork
2015-09-22 21:04                                         ` Linus Torvalds
2015-09-23  6:13                             ` Herbert Xu
2015-09-23 15:54                               ` Tejun Heo
2015-09-24  2:30                                 ` Herbert Xu
2015-09-24  2:46                                   ` Tejun Heo
2015-09-24  2:54                                     ` Herbert Xu
2015-09-24  3:06                                       ` Tejun Heo
2015-09-24  3:21                                         ` Herbert Xu
2015-09-24  3:29                                           ` Tejun Heo
2015-09-24  3:31                                             ` Herbert Xu
2015-09-24  3:41                                               ` Tejun Heo
2015-09-24  3:42                                                 ` Herbert Xu
2015-09-24  3:43                                                   ` Tejun Heo
2015-09-24  3:44                                                     ` Herbert Xu
2015-09-24 19:11                           ` David Miller
2015-09-24 20:05                             ` Tejun Heo
2015-09-25  1:43                               ` netlink: Add barrier to netlink_connect for theoretical case Herbert Xu
2015-09-25  3:24                                 ` Linus Torvalds
2015-09-25  3:39                                   ` Herbert Xu
2015-09-25 15:09                                     ` Tejun Heo
2015-09-25 15:01                                 ` Tejun Heo
2015-09-26 13:16                                   ` netlink: Add netlink_bound helper and use it in netlink_getname Herbert Xu
2015-09-26 18:09                                     ` Tejun Heo
2015-09-26 19:41                                       ` Herbert Xu
2015-09-26 19:45                                         ` Tejun Heo
2015-09-26 19:49                                           ` Herbert Xu
2015-09-26 19:52                                             ` Tejun Heo
2015-09-26 19:55                                               ` Herbert Xu
2015-09-26 20:05                                                 ` Tejun Heo
2015-09-26 20:10                                                   ` Herbert Xu
2015-09-26 20:17                                                     ` Tejun Heo
2015-09-21 20:52                       ` [PATCH] netlink: Replace rhash_portid with load_acquire protected boolean Tejun Heo
2015-09-18 13:37             ` [PATCH v3] netlink: Fix autobind race condition that leads to zero port ID Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).