All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/3] net: unix: fix use-after-free
@ 2015-10-09  4:15 Jason Baron
  2015-10-09  4:16 ` [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll() Jason Baron
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Jason Baron @ 2015-10-09  4:15 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

Hi,

These patches are against mainline, I can re-base to net-next, please
let me know.

They have been tested against: https://lkml.org/lkml/2015/9/13/195,
which causes the use-after-free quite quickly and here:
https://lkml.org/lkml/2015/10/2/693.

Thanks,

-Jason

v4:
-set UNIX_NOSPACE only if the peer socket has receive space

v3:
-beef up memory barrier comments in 3/3 (Peter Zijlstra)
-clean up unix_dgram_writable() function in 3/3 (Joe Perches)

Jason Baron (3):
  net: unix: fix use-after-free in unix_dgram_poll()
  net: unix: Convert gc_flags to flags
  net: unix: optimize wakeups in unix_dgram_recvmsg()

 include/net/af_unix.h |   4 +-
 net/unix/af_unix.c    | 124 ++++++++++++++++++++++++++++++++++++++++----------
 net/unix/garbage.c    |  12 ++---
 3 files changed, 108 insertions(+), 32 deletions(-)

-- 
2.6.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
  2015-10-09  4:15 [PATCH v4 0/3] net: unix: fix use-after-free Jason Baron
@ 2015-10-09  4:16 ` Jason Baron
  2015-10-09 14:38   ` Hannes Frederic Sowa
  2015-10-09  4:16 ` [PATCH v4 2/3] net: unix: Convert gc_flags to flags Jason Baron
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-10-09  4:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
queue associated with the socket s that we are poll'ing against, but also calls
sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
if we call poll()/select()/epoll() for the socket s, there are then
a couple of code paths in which the remote peer socket p and its associated
peer_wait queue can be freed before poll()/select()/epoll() have a chance
to remove themselves from the remote peer socket.

The way that remote peer socket can be freed are:

1. If s calls connect() to a connect to a new socket other than p, it will
drop its reference on p, and thus a close() on p will free it.

2. If we call close on p(), then a subsequent sendmsg() from s, will drop
the final reference to p, allowing it to be freed.

Address this issue, by reverting unix_dgram_poll() to only register with
the wait queue associated with s and register a callback with the remote peer
socket on connect() that will wake up the wait queue associated with s. If
scenarios 1 or 2 occur above we then simply remove the callback from the
remote peer. This then presents the expected semantics to poll()/select()/
epoll().

I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().

Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
DGRAM sockets").

Tested-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    | 32 +++++++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 4a167b3..9698aff 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -62,6 +62,7 @@ struct unix_sock {
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
+	wait_queue_t		wait;
 };
 #define unix_sk(__sk) ((struct unix_sock *)__sk)
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..f789423 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -420,6 +420,9 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	skpair = unix_peer(sk);
 
 	if (skpair != NULL) {
+		if (sk->sk_type != SOCK_STREAM)
+			remove_wait_queue(&unix_sk(skpair)->peer_wait,
+					  &u->wait);
 		if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) {
 			unix_state_lock(skpair);
 			/* No more writes */
@@ -636,6 +639,16 @@ static struct proto unix_proto = {
  */
 static struct lock_class_key af_unix_sk_receive_queue_lock_key;
 
+static int peer_wake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct unix_sock *u;
+
+	u = container_of(wait, struct unix_sock, wait);
+	wake_up_interruptible_sync_poll(sk_sleep(&u->sk), key);
+
+	return 0;
+}
+
 static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 {
 	struct sock *sk = NULL;
@@ -664,6 +677,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern)
 	INIT_LIST_HEAD(&u->link);
 	mutex_init(&u->readlock); /* single task reading lock */
 	init_waitqueue_head(&u->peer_wait);
+	init_waitqueue_func_entry(&u->wait, peer_wake);
 	unix_insert_socket(unix_sockets_unbound(sk), sk);
 out:
 	if (sk == NULL)
@@ -1030,7 +1044,11 @@ restart:
 	 */
 	if (unix_peer(sk)) {
 		struct sock *old_peer = unix_peer(sk);
+
+		remove_wait_queue(&unix_sk(old_peer)->peer_wait,
+				  &unix_sk(sk)->wait);
 		unix_peer(sk) = other;
+		add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
 		unix_state_double_unlock(sk, other);
 
 		if (other != old_peer)
@@ -1038,8 +1056,12 @@ restart:
 		sock_put(old_peer);
 	} else {
 		unix_peer(sk) = other;
+		add_wait_queue(&unix_sk(other)->peer_wait, &unix_sk(sk)->wait);
 		unix_state_double_unlock(sk, other);
 	}
+	/* New remote may have created write space for us */
+	wake_up_interruptible_sync_poll(sk_sleep(sk),
+					POLLOUT | POLLWRNORM | POLLWRBAND);
 	return 0;
 
 out_unlock:
@@ -1194,6 +1216,8 @@ restart:
 
 	sock_hold(sk);
 	unix_peer(newsk)	= sk;
+	if (sk->sk_type == SOCK_SEQPACKET)
+		add_wait_queue(&unix_sk(sk)->peer_wait, &unix_sk(newsk)->wait);
 	newsk->sk_state		= TCP_ESTABLISHED;
 	newsk->sk_type		= sk->sk_type;
 	init_peercred(newsk);
@@ -1220,6 +1244,8 @@ restart:
 
 	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
 	unix_peer(sk)	= newsk;
+	if (sk->sk_type == SOCK_SEQPACKET)
+		add_wait_queue(&unix_sk(newsk)->peer_wait, &unix_sk(sk)->wait);
 
 	unix_state_unlock(sk);
 
@@ -1254,6 +1280,10 @@ static int unix_socketpair(struct socket *socka, struct socket *sockb)
 	sock_hold(skb);
 	unix_peer(ska) = skb;
 	unix_peer(skb) = ska;
+	if (ska->sk_type != SOCK_STREAM) {
+		add_wait_queue(&unix_sk(ska)->peer_wait, &unix_sk(skb)->wait);
+		add_wait_queue(&unix_sk(skb)->peer_wait, &unix_sk(ska)->wait);
+	}
 	init_peercred(ska);
 	init_peercred(skb);
 
@@ -1565,6 +1595,7 @@ restart:
 		unix_state_lock(sk);
 		if (unix_peer(sk) == other) {
 			unix_peer(sk) = NULL;
+			remove_wait_queue(&unix_sk(other)->peer_wait, &u->wait);
 			unix_state_unlock(sk);
 
 			unix_dgram_disconnected(sk, other);
@@ -2441,7 +2472,6 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	other = unix_peer_get(sk);
 	if (other) {
 		if (unix_peer(other) != sk) {
-			sock_poll_wait(file, &unix_sk(other)->peer_wait, wait);
 			if (unix_recvq_full(other))
 				writable = 0;
 		}
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 2/3] net: unix: Convert gc_flags to flags
  2015-10-09  4:15 [PATCH v4 0/3] net: unix: fix use-after-free Jason Baron
  2015-10-09  4:16 ` [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll() Jason Baron
@ 2015-10-09  4:16 ` Jason Baron
  2015-10-09  4:16 ` [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg() Jason Baron
  2015-10-11 11:55   ` David Miller
  3 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-10-09  4:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

Convert gc_flags to flags in perparation for the subsequent patch, which will
make use of a flag bit for a non-gc purpose.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/af_unix.h |  2 +-
 net/unix/garbage.c    | 12 ++++++------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 9698aff..6a4a345 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -58,7 +58,7 @@ struct unix_sock {
 	atomic_long_t		inflight;
 	spinlock_t		lock;
 	unsigned char		recursion_level;
-	unsigned long		gc_flags;
+	unsigned long		flags;
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
 	struct socket_wq	peer_wq;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index a73a226..39794d9 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -179,7 +179,7 @@ static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 					 * have been added to the queues after
 					 * starting the garbage collection
 					 */
-					if (test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) {
+					if (test_bit(UNIX_GC_CANDIDATE, &u->flags)) {
 						hit = true;
 
 						func(u);
@@ -246,7 +246,7 @@ static void inc_inflight_move_tail(struct unix_sock *u)
 	 * of the list, so that it's checked even if it was already
 	 * passed over
 	 */
-	if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags))
+	if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->flags))
 		list_move_tail(&u->link, &gc_candidates);
 }
 
@@ -305,8 +305,8 @@ void unix_gc(void)
 		BUG_ON(total_refs < inflight_refs);
 		if (total_refs == inflight_refs) {
 			list_move_tail(&u->link, &gc_candidates);
-			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
-			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+			__set_bit(UNIX_GC_CANDIDATE, &u->flags);
+			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->flags);
 		}
 	}
 
@@ -332,7 +332,7 @@ void unix_gc(void)
 
 		if (atomic_long_read(&u->inflight) > 0) {
 			list_move_tail(&u->link, &not_cycle_list);
-			__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+			__clear_bit(UNIX_GC_MAYBE_CYCLE, &u->flags);
 			scan_children(&u->sk, inc_inflight_move_tail, NULL);
 		}
 	}
@@ -343,7 +343,7 @@ void unix_gc(void)
 	 */
 	while (!list_empty(&not_cycle_list)) {
 		u = list_entry(not_cycle_list.next, struct unix_sock, link);
-		__clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
+		__clear_bit(UNIX_GC_CANDIDATE, &u->flags);
 		list_move_tail(&u->link, &gc_inflight_list);
 	}
 
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
  2015-10-09  4:15 [PATCH v4 0/3] net: unix: fix use-after-free Jason Baron
  2015-10-09  4:16 ` [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll() Jason Baron
  2015-10-09  4:16 ` [PATCH v4 2/3] net: unix: Convert gc_flags to flags Jason Baron
@ 2015-10-09  4:16 ` Jason Baron
  2015-10-09  4:29   ` kbuild test robot
  2015-10-11 11:55   ` David Miller
  3 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-10-09  4:16 UTC (permalink / raw)
  To: davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    | 92 +++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
 	unsigned long		flags;
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
+#define UNIX_NOSPACE		2
 	struct socket_wq	peer_wq;
 	wait_queue_t		wait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..05fbd00 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
 	return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
 	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo)
 
 	prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);
 
+	set_bit(UNIX_NOSPACE, &u->flags);
+	/* Ensure that we either see space in the peer sk_receive_queue via the
+	 * unix_recvq_full() check below, or we receive a wakeup when it
+	 * empties. Pairs with the mb in unix_dgram_recvmsg().
+	 */
+	smp_mb__after_atomic();
 	sched = !sock_flag(other, SOCK_DEAD) &&
 		!(other->sk_shutdown & RCV_SHUTDOWN) &&
 		unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
 	if (unix_peer(other) != sk && unix_recvq_full(other)) {
 		if (!timeo) {
-			err = -EAGAIN;
-			goto out_unlock;
-		}
-
-		timeo = unix_wait_for_peer(other, timeo);
+			set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+			/* Ensure that we either see space in the peer
+			 * sk_receive_queue via the unix_recvq_full() check
+			 * below, or we receive a wakeup when it empties. This
+			 * makes sure that epoll ET triggers correctly. Pairs
+			 * with the mb in unix_dgram_recvmsg().
+			 */
+			smp_mb__after_atomic();
+			if (unix_recvq_full(other)) {
+				err = -EAGAIN;
+				goto out_unlock;
+			}
+		} else {
+			timeo = unix_wait_for_peer(other, timeo);
 
-		err = sock_intr_errno(timeo);
-		if (signal_pending(current))
-			goto out_free;
+			err = sock_intr_errno(timeo);
+			if (signal_pending(current))
+				goto out_free;
 
-		goto restart;
+			goto restart;
+		}
 	}
 
 	if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		goto out_unlock;
 	}
 
-	wake_up_interruptible_sync_poll(&u->peer_wait,
-					POLLOUT | POLLWRNORM | POLLWRBAND);
+	/* Ensure that waiters on our sk->sk_receive_queue draining that check
+	 * via unix_recvq_full() either see space in the queue or get a wakeup
+	 * below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+	 * call above. Pairs with the mb in unix_dgram_sendmsg(),
+	 *unix_dgram_poll(), and unix_wait_for_peer().
+	 */
+	smp_mb();
+	if (test_bit(UNIX_NOSPACE, &u->flags)) {
+		clear_bit(UNIX_NOSPACE, &u->flags);
+		wake_up_interruptible_sync_poll(&u->peer_wait,
+						POLLOUT | POLLWRNORM |
+						POLLWRBAND);
+	}
 
 	if (msg->msg_name)
 		unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2459,25 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 	return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other,
+				bool *other_nospace)
+{
+	*other_full = false;
+
+	if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
+		*other_full = true;
+		return false;
+	}
+
+	return unix_writable(sk);
+}
+
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 				    poll_table *wait)
 {
 	struct sock *sk = sock->sk, *other;
-	unsigned int mask, writable;
+	unsigned int mask;
+	bool other_nospace;
 
 	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
@@ -2468,20 +2509,23 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT)))
 		return mask;
 
-	writable = unix_writable(sk);
 	other = unix_peer_get(sk);
-	if (other) {
-		if (unix_peer(other) != sk) {
-			if (unix_recvq_full(other))
-				writable = 0;
-		}
-		sock_put(other);
-	}
-
-	if (writable)
+	if (unix_dgram_writable(sk, other, &other_nospace)) {
 		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
-	else
+	} else {
 		set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+		if (other_nospace)
+			set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+		/* Ensure that we either see space in the peer sk_receive_queue
+		 * via the unix_recvq_full() check below, or we receive a wakeup
+		 * when it empties. Pairs with the mb in unix_dgram_recvmsg().
+		 */
+		smp_mb__after_atomic();
+		if (unix_dgram_writable(sk, other, &other_nospace))
+			mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+	}
+	if (other)
+		sock_put(other);
 
 	return mask;
 }
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
  2015-10-09  4:16 ` [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg() Jason Baron
@ 2015-10-09  4:29   ` kbuild test robot
  2015-10-09 15:12     ` Jason Baron
  0 siblings, 1 reply; 16+ messages in thread
From: kbuild test robot @ 2015-10-09  4:29 UTC (permalink / raw)
  To: Jason Baron
  Cc: kbuild-all, davem, netdev, linux-kernel, minipli, normalperson,
	eric.dumazet, rweikusat, viro, davidel, dave, olivier, pageexec,
	torvalds, peterz, joe

[-- Attachment #1: Type: text/plain, Size: 1082 bytes --]

Hi Jason,

[auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please ignore]

config: x86_64-randconfig-i0-201540 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/unix/af_unix.c: In function 'unix_dgram_writable':
>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in this function)
     *other_full = false;
      ^
   net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported only once for each function it appears in

vim +/other_full +2465 net/unix/af_unix.c

  2459		return mask;
  2460	}
  2461	
  2462	static bool unix_dgram_writable(struct sock *sk, struct sock *other,
  2463					bool *other_nospace)
  2464	{
> 2465		*other_full = false;
  2466	
  2467		if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
  2468			*other_full = true;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 23229 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
  2015-10-09  4:16 ` [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll() Jason Baron
@ 2015-10-09 14:38   ` Hannes Frederic Sowa
  2015-10-11 13:30     ` Rainer Weikusat
  2015-10-12 19:41     ` Jason Baron
  0 siblings, 2 replies; 16+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-09 14:38 UTC (permalink / raw)
  To: Jason Baron, davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

Hi,

Jason Baron <jbaron@akamai.com> writes:

> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
> queue associated with the socket s that we are poll'ing against, but also calls
> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
> if we call poll()/select()/epoll() for the socket s, there are then
> a couple of code paths in which the remote peer socket p and its associated
> peer_wait queue can be freed before poll()/select()/epoll() have a chance
> to remove themselves from the remote peer socket.
>
> The way that remote peer socket can be freed are:
>
> 1. If s calls connect() to a connect to a new socket other than p, it will
> drop its reference on p, and thus a close() on p will free it.
>
> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
> the final reference to p, allowing it to be freed.
>
> Address this issue, by reverting unix_dgram_poll() to only register with
> the wait queue associated with s and register a callback with the remote peer
> socket on connect() that will wake up the wait queue associated with s. If
> scenarios 1 or 2 occur above we then simply remove the callback from the
> remote peer. This then presents the expected semantics to poll()/select()/
> epoll().
>
> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>
> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
> DGRAM sockets").
>
> Tested-by: Mathias Krause <minipli@googlemail.com>
> Signed-off-by: Jason Baron <jbaron@akamai.com>

While I think this approach works, I haven't seen where the current code
leaks a reference. Assignment to unix_peer(sk) in general take spin_lock
and increment refcount. Are there bugs at the two places you referred
to?

Is an easier fix just to use atomic_inc_not_zero(&sk->sk_refcnt) in
unix_peer_get() which could also help other places?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg()
  2015-10-09  4:29   ` kbuild test robot
@ 2015-10-09 15:12     ` Jason Baron
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Baron @ 2015-10-09 15:12 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, davem, netdev, linux-kernel, minipli, normalperson,
	eric.dumazet, rweikusat, viro, davidel, dave, olivier, pageexec,
	torvalds, peterz, joe

On 10/09/2015 12:29 AM, kbuild test robot wrote:
> Hi Jason,
> 
> [auto build test ERROR on v4.3-rc3 -- if it's inappropriate base, please ignore]
> 
> config: x86_64-randconfig-i0-201540 (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>    net/unix/af_unix.c: In function 'unix_dgram_writable':
>>> net/unix/af_unix.c:2465:3: error: 'other_full' undeclared (first use in this function)
>      *other_full = false;
>       ^
>    net/unix/af_unix.c:2465:3: note: each undeclared identifier is reported only once for each function it appears in
> 


Forgot to refresh this patch before sending. The one that I tested with
is below.

Thanks,

-Jason




Now that connect() permanently registers a callback routine, we can induce
extra overhead in unix_dgram_recvmsg(), which unconditionally wakes up
its peer_wait queue on every receive. This patch makes the wakeup there
conditional on there being waiters.

Tested using: http://www.spinics.net/lists/netdev/msg145533.html

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/af_unix.h |  1 +
 net/unix/af_unix.c    | 92 +++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 69 insertions(+), 24 deletions(-)

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 6a4a345..cf21ffd 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -61,6 +61,7 @@ struct unix_sock {
 	unsigned long		flags;
 #define UNIX_GC_CANDIDATE	0
 #define UNIX_GC_MAYBE_CYCLE	1
+#define UNIX_NOSPACE		2
 	struct socket_wq	peer_wq;
 	wait_queue_t		wait;
 };
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index f789423..ac9bcd8 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -326,7 +326,7 @@ found:
 	return s;
 }
 
-static inline int unix_writable(struct sock *sk)
+static inline bool unix_writable(struct sock *sk)
 {
 	return (atomic_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
@@ -1079,6 +1079,12 @@ static long unix_wait_for_peer(struct sock *other, long timeo)
 
 	prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);
 
+	set_bit(UNIX_NOSPACE, &u->flags);
+	/* Ensure that we either see space in the peer sk_receive_queue via the
+	 * unix_recvq_full() check below, or we receive a wakeup when it
+	 * empties. Pairs with the mb in unix_dgram_recvmsg().
+	 */
+	smp_mb__after_atomic();
 	sched = !sock_flag(other, SOCK_DEAD) &&
 		!(other->sk_shutdown & RCV_SHUTDOWN) &&
 		unix_recvq_full(other);
@@ -1623,17 +1629,27 @@ restart:
 
 	if (unix_peer(other) != sk && unix_recvq_full(other)) {
 		if (!timeo) {
-			err = -EAGAIN;
-			goto out_unlock;
-		}
-
-		timeo = unix_wait_for_peer(other, timeo);
+			set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+			/* Ensure that we either see space in the peer
+			 * sk_receive_queue via the unix_recvq_full() check
+			 * below, or we receive a wakeup when it empties. This
+			 * makes sure that epoll ET triggers correctly. Pairs
+			 * with the mb in unix_dgram_recvmsg().
+			 */
+			smp_mb__after_atomic();
+			if (unix_recvq_full(other)) {
+				err = -EAGAIN;
+				goto out_unlock;
+			}
+		} else {
+			timeo = unix_wait_for_peer(other, timeo);
 
-		err = sock_intr_errno(timeo);
-		if (signal_pending(current))
-			goto out_free;
+			err = sock_intr_errno(timeo);
+			if (signal_pending(current))
+				goto out_free;
 
-		goto restart;
+			goto restart;
+		}
 	}
 
 	if (sock_flag(other, SOCK_RCVTSTAMP))
@@ -1939,8 +1955,19 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
 		goto out_unlock;
 	}
 
-	wake_up_interruptible_sync_poll(&u->peer_wait,
-					POLLOUT | POLLWRNORM | POLLWRBAND);
+	/* Ensure that waiters on our sk->sk_receive_queue draining that check
+	 * via unix_recvq_full() either see space in the queue or get a wakeup
+	 * below. sk->sk_receive_queue is reduece by the __skb_recv_datagram()
+	 * call above. Pairs with the mb in unix_dgram_sendmsg(),
+	 *unix_dgram_poll(), and unix_wait_for_peer().
+	 */
+	smp_mb();
+	if (test_bit(UNIX_NOSPACE, &u->flags)) {
+		clear_bit(UNIX_NOSPACE, &u->flags);
+		wake_up_interruptible_sync_poll(&u->peer_wait,
+						POLLOUT | POLLWRNORM |
+						POLLWRBAND);
+	}
 
 	if (msg->msg_name)
 		unix_copy_addr(msg, skb->sk);
@@ -2432,11 +2459,25 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 	return mask;
 }
 
+static bool unix_dgram_writable(struct sock *sk, struct sock *other,
+				bool *other_nospace)
+{
+	*other_nospace = false;
+
+	if (other && unix_peer(other) != sk && unix_recvq_full(other)) {
+		*other_nospace = true;
+		return false;
+	}
+
+	return unix_writable(sk);
+}
+
 static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 				    poll_table *wait)
 {
 	struct sock *sk = sock->sk, *other;
-	unsigned int mask, writable;
+	unsigned int mask;
+	bool other_nospace;
 
 	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
@@ -2468,20 +2509,23 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	if (!(poll_requested_events(wait) & (POLLWRBAND|POLLWRNORM|POLLOUT)))
 		return mask;
 
-	writable = unix_writable(sk);
 	other = unix_peer_get(sk);
-	if (other) {
-		if (unix_peer(other) != sk) {
-			if (unix_recvq_full(other))
-				writable = 0;
-		}
-		sock_put(other);
-	}
-
-	if (writable)
+	if (unix_dgram_writable(sk, other, &other_nospace)) {
 		mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
-	else
+	} else {
 		set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+		if (other_nospace)
+			set_bit(UNIX_NOSPACE, &unix_sk(other)->flags);
+		/* Ensure that we either see space in the peer sk_receive_queue
+		 * via the unix_recvq_full() check below, or we receive a wakeup
+		 * when it empties. Pairs with the mb in unix_dgram_recvmsg().
+		 */
+		smp_mb__after_atomic();
+		if (unix_dgram_writable(sk, other, &other_nospace))
+			mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+	}
+	if (other)
+		sock_put(other);
 
 	return mask;
 }
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
  2015-10-09  4:15 [PATCH v4 0/3] net: unix: fix use-after-free Jason Baron
@ 2015-10-11 11:55   ` David Miller
  2015-10-09  4:16 ` [PATCH v4 2/3] net: unix: Convert gc_flags to flags Jason Baron
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2015-10-11 11:55 UTC (permalink / raw)
  To: jbaron
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: Text/Plain; charset=iso-8859-7, Size: 954 bytes --]

From: Jason Baron <jbaron@akamai.com>
Date: Fri,  9 Oct 2015 00:15:59 -0400

> These patches are against mainline, I can re-base to net-next, please
> let me know.
> 
> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
> which causes the use-after-free quite quickly and here:
> https://lkml.org/lkml/2015/10/2/693.

I'd like to understand how patches that don't even compile can be
"tested"?

net/unix/af_unix.c: In function ¡unix_dgram_writable¢:
net/unix/af_unix.c:2480:3: error: ¡other_full¢ undeclared (first use in this function)
net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only once for each function it appears in

Could you explain how that works, I'm having a hard time understanding
this?

Also please address Hannes's feedback, thanks.
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
@ 2015-10-11 11:55   ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2015-10-11 11:55 UTC (permalink / raw)
  To: jbaron
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

From: Jason Baron <jbaron@akamai.com>
Date: Fri,  9 Oct 2015 00:15:59 -0400

> These patches are against mainline, I can re-base to net-next, please
> let me know.
> 
> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
> which causes the use-after-free quite quickly and here:
> https://lkml.org/lkml/2015/10/2/693.

I'd like to understand how patches that don't even compile can be
"tested"?

net/unix/af_unix.c: In function ‘unix_dgram_writable’:
net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this function)
net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only once for each function it appears in

Could you explain how that works, I'm having a hard time understanding
this?

Also please address Hannes's feedback, thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
  2015-10-09 14:38   ` Hannes Frederic Sowa
@ 2015-10-11 13:30     ` Rainer Weikusat
  2015-10-12 19:41     ` Jason Baron
  1 sibling, 0 replies; 16+ messages in thread
From: Rainer Weikusat @ 2015-10-11 13:30 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jason Baron, davem, netdev, linux-kernel, minipli, normalperson,
	eric.dumazet, rweikusat, viro, davidel, dave, olivier, pageexec,
	torvalds, peterz, joe

Hannes Frederic Sowa <hannes@stressinduktion.org> writes:
> Jason Baron <jbaron@akamai.com> writes:
>
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also calls
>> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
>> if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket p and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from the remote peer socket.
>>
>> The way that remote peer socket can be freed are:
>>
>> 1. If s calls connect() to a connect to a new socket other than p, it will
>> drop its reference on p, and thus a close() on p will free it.
>>
>> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
>> the final reference to p, allowing it to be freed.
>>
>> Address this issue, by reverting unix_dgram_poll() to only register with
>> the wait queue associated with s and register a callback with the remote peer
>> socket on connect() that will wake up the wait queue associated with s. If
>> scenarios 1 or 2 occur above we then simply remove the callback from the
>> remote peer. This then presents the expected semantics to poll()/select()/
>> epoll().
>>
>> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
>> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>>
>> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
>> DGRAM sockets").
>>
>> Tested-by: Mathias Krause <minipli@googlemail.com>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>
> While I think this approach works, I haven't seen where the current code
> leaks a reference.

It doesn't "leak a reference" (strictly). It possibly registers a wait queue with
whatever invoked the poll-routine which belongs to the peer socket of
the socket poll was called on. And the inherent problem with that is
that the lifetime of the peer socket is not necessarily the same as
the lifetime of the polled socket. If the polled socket is disconnected
from its peer while still being polled (or registered for being
polled), the former peer may be freed despite the polling code (of whatever
provenience) still references the peer_wait member of the unix socket
structure for this socket.

As pointed out in the original mail, two ways for this to happen is to
call connect on the polled socket or cause a unix_dgram_sendmsg call
on that after the peer socket was closed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
  2015-10-11 11:55   ` David Miller
  (?)
@ 2015-10-12 12:54   ` Rainer Weikusat
  2015-10-12 13:36     ` Eric Dumazet
  -1 siblings, 1 reply; 16+ messages in thread
From: Rainer Weikusat @ 2015-10-12 12:54 UTC (permalink / raw)
  To: David Miller
  Cc: jbaron, netdev, linux-kernel, minipli, normalperson,
	eric.dumazet, rweikusat, viro, davidel, dave, olivier, pageexec,
	torvalds, peterz, joe

David Miller <davem@davemloft.net> writes:
> From: Jason Baron <jbaron@akamai.com>
> Date: Fri,  9 Oct 2015 00:15:59 -0400
>
>> These patches are against mainline, I can re-base to net-next, please
>> let me know.
>> 
>> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
>> which causes the use-after-free quite quickly and here:
>> https://lkml.org/lkml/2015/10/2/693.
>
> I'd like to understand how patches that don't even compile can be
> "tested"?
>
> net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this function)
> net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only once for each function it appears in
>
> Could you explain how that works, I'm having a hard time understanding
> this?

This is basicallly a workaround for the problem that it's not possible
to tell epoll to let go of a certain wait queue: Instead of registering
the peer_wait queue via sock_poll_wait, a wait_queue_t under control of
the af_unix.c code is linked onto it which relays a wake up on the
peer_wait queue to the 'ordinary' wait queue associated with the polled
socket via custom wake function. But (at least the code I looked it) it
enqueues a unix socket on connect which has certain side effects (in
particular, /dev/log will have a seriously large wait queue of entirely
uninterested peers) and in many cases, this is simply not necessary, as
the additional peer_wait event is only interesting in case a peer of a
fan-in socket (like /dev/log) happens to be waiting for writeabilty via
poll/ select/ epoll/ ...

Since the wait queue handling code is now under control of the af_unix.c
code, it can remove itself from the peer_wait queue prior to dropping
its reference to a peer on disconnect or on detecting a dead peer in
unix_dgram_sendmsg.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
  2015-10-12 12:54   ` Rainer Weikusat
@ 2015-10-12 13:36     ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2015-10-12 13:36 UTC (permalink / raw)
  To: Rainer Weikusat
  Cc: David Miller, jbaron, netdev, linux-kernel, minipli,
	normalperson, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

On Mon, 2015-10-12 at 13:54 +0100, Rainer Weikusat wrote:
> David Miller <davem@davemloft.net> writes:
> > From: Jason Baron <jbaron@akamai.com>
> > Date: Fri,  9 Oct 2015 00:15:59 -0400
> >
> >> These patches are against mainline, I can re-base to net-next, please
> >> let me know.
> >> 
> >> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
> >> which causes the use-after-free quite quickly and here:
> >> https://lkml.org/lkml/2015/10/2/693.
> >
> > I'd like to understand how patches that don't even compile can be
> > "tested"?
> >
> > net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> > net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this function)
> > net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only once for each function it appears in
> >
> > Could you explain how that works, I'm having a hard time understanding
> > this?
> 
> This is basicallly a workaround for the problem that it's not possible
> to tell epoll to let go of a certain wait queue: Instead of registering
> the peer_wait queue via sock_poll_wait, a wait_queue_t under control of
> the af_unix.c code is linked onto it which relays a wake up on the
> peer_wait queue to the 'ordinary' wait queue associated with the polled
> socket via custom wake function. But (at least the code I looked it) it
> enqueues a unix socket on connect which has certain side effects (in
> particular, /dev/log will have a seriously large wait queue of entirely
> uninterested peers) and in many cases, this is simply not necessary, as
> the additional peer_wait event is only interesting in case a peer of a
> fan-in socket (like /dev/log) happens to be waiting for writeabilty via
> poll/ select/ epoll/ ...
> 
> Since the wait queue handling code is now under control of the af_unix.c
> code, it can remove itself from the peer_wait queue prior to dropping
> its reference to a peer on disconnect or on detecting a dead peer in
> unix_dgram_sendmsg.
> --

Okay, but David was asking how the patch was supposed to be tested, and
applied, if it does not compile.

A patch is not only showing the idea, but must be ready for inclusion.

Please ?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
  2015-10-09 14:38   ` Hannes Frederic Sowa
  2015-10-11 13:30     ` Rainer Weikusat
@ 2015-10-12 19:41     ` Jason Baron
  2015-10-13 11:42       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-10-12 19:41 UTC (permalink / raw)
  To: Hannes Frederic Sowa, davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

On 10/09/2015 10:38 AM, Hannes Frederic Sowa wrote:
> Hi,
> 
> Jason Baron <jbaron@akamai.com> writes:
> 
>> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
>> queue associated with the socket s that we are poll'ing against, but also calls
>> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
>> if we call poll()/select()/epoll() for the socket s, there are then
>> a couple of code paths in which the remote peer socket p and its associated
>> peer_wait queue can be freed before poll()/select()/epoll() have a chance
>> to remove themselves from the remote peer socket.
>>
>> The way that remote peer socket can be freed are:
>>
>> 1. If s calls connect() to a connect to a new socket other than p, it will
>> drop its reference on p, and thus a close() on p will free it.
>>
>> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
>> the final reference to p, allowing it to be freed.
>>
>> Address this issue, by reverting unix_dgram_poll() to only register with
>> the wait queue associated with s and register a callback with the remote peer
>> socket on connect() that will wake up the wait queue associated with s. If
>> scenarios 1 or 2 occur above we then simply remove the callback from the
>> remote peer. This then presents the expected semantics to poll()/select()/
>> epoll().
>>
>> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
>> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
>>
>> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
>> DGRAM sockets").
>>
>> Tested-by: Mathias Krause <minipli@googlemail.com>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
> 
> While I think this approach works, I haven't seen where the current code
> leaks a reference. Assignment to unix_peer(sk) in general take spin_lock
> and increment refcount. Are there bugs at the two places you referred
> to?
> 
> Is an easier fix just to use atomic_inc_not_zero(&sk->sk_refcnt) in
> unix_peer_get() which could also help other places?
> 

Hi,

So we could potentially inc the refcnt on the remote peer such that the
remote peer does not free before the socket that has connected to it.
However, then the socket that has taken the reference against the peer
socket has to potentially record a number of remote sockets (all the ones
that it has connected to over its lifetime), and then drop all of their
refcnt's when it finally closes.

The reason for this is that with the current code when we do
poll()/select()/epoll() on a socket with a peer socket, those calls
take reference on the peer socket. Specifically, they record the remote
peer whead, such that they can remove their callbacks when they return.
So its not safe to just drop a reference on the remote peer when it
closes because their might be outstanding poll()/select()/epoll()
references pending.

Normally, poll()/select()/epoll() are waiting on a whead associated
directly with the fd/file that they are waiting for.

The other point here is that the way this patch structures things is
that when the socket connects to a new remote and hence disconnects from
an existing remote, POLLOUT events will continue to be correctly
delivered. That was not possible with the current structure of things
b/c there was no way to inform poll to re-register with the remote peer
whead. So, that means that the first test case here now works:

https://lkml.org/lkml/2015/10/4/154

Whereas with the old code test case would just hang for ever.

So yes there is a bit of code churn here, but I think it moves the
code-base in a direction that not only solves this issue, but corrects
additional poll() behaviors as well.

Thanks,

-Jason




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
  2015-10-11 11:55   ` David Miller
  (?)
  (?)
@ 2015-10-12 19:50   ` Jason Baron
  2015-10-13  1:47     ` David Miller
  -1 siblings, 1 reply; 16+ messages in thread
From: Jason Baron @ 2015-10-12 19:50 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

On 10/11/2015 07:55 AM, David Miller wrote:
> From: Jason Baron <jbaron@akamai.com>
> Date: Fri,  9 Oct 2015 00:15:59 -0400
> 
>> These patches are against mainline, I can re-base to net-next, please
>> let me know.
>>
>> They have been tested against: https://lkml.org/lkml/2015/9/13/195,
>> which causes the use-after-free quite quickly and here:
>> https://lkml.org/lkml/2015/10/2/693.
> 

Hi,

> I'd like to understand how patches that don't even compile can be
> "tested"?
> 
> net/unix/af_unix.c: In function ‘unix_dgram_writable’:
> net/unix/af_unix.c:2480:3: error: ‘other_full’ undeclared (first use in this function)
> net/unix/af_unix.c:2480:3: note: each undeclared identifier is reported only once for each function it appears in
> 
> Could you explain how that works, I'm having a hard time understanding
> this?
> 

Traveling this week, so responses a bit delayed.

Yes, I screwed up the posting. I had some outstanding code in my
local tree to make it compile, but I failed to refresh my patch series
with this outstanding code before mailing it out. So what I tested/built
was not quite what I mailed out.

As soon as I noticed this issue in patch 3/3 I re-posted it here:

http://marc.info/?l=linux-netdev&m=144440355808472&w=2

in an attempt to avoid this confusion. I'm happy to re-post the series
or whatever makes things easiest for you.

> Also please address Hannes's feedback, thanks.
> 

I've replied directly to Hannes.

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 0/3] net: unix: fix use-after-free
  2015-10-12 19:50   ` Jason Baron
@ 2015-10-13  1:47     ` David Miller
  0 siblings, 0 replies; 16+ messages in thread
From: David Miller @ 2015-10-13  1:47 UTC (permalink / raw)
  To: jbaron
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

From: Jason Baron <jbaron@akamai.com>
Date: Mon, 12 Oct 2015 15:50:30 -0400

> I'm happy to re-post the series or whatever makes things easiest for
> you.

This is what you should always do, reposting a single patch within a
series is not sufficient.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll()
  2015-10-12 19:41     ` Jason Baron
@ 2015-10-13 11:42       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Frederic Sowa @ 2015-10-13 11:42 UTC (permalink / raw)
  To: Jason Baron, davem
  Cc: netdev, linux-kernel, minipli, normalperson, eric.dumazet,
	rweikusat, viro, davidel, dave, olivier, pageexec, torvalds,
	peterz, joe

Hello,

On Mon, Oct 12, 2015, at 21:41, Jason Baron wrote:
> On 10/09/2015 10:38 AM, Hannes Frederic Sowa wrote:
> > Hi,
> > 
> > Jason Baron <jbaron@akamai.com> writes:
> > 
> >> The unix_dgram_poll() routine calls sock_poll_wait() not only for the wait
> >> queue associated with the socket s that we are poll'ing against, but also calls
> >> sock_poll_wait() for a remote peer socket p, if it is connected. Thus,
> >> if we call poll()/select()/epoll() for the socket s, there are then
> >> a couple of code paths in which the remote peer socket p and its associated
> >> peer_wait queue can be freed before poll()/select()/epoll() have a chance
> >> to remove themselves from the remote peer socket.
> >>
> >> The way that remote peer socket can be freed are:
> >>
> >> 1. If s calls connect() to a connect to a new socket other than p, it will
> >> drop its reference on p, and thus a close() on p will free it.
> >>
> >> 2. If we call close on p(), then a subsequent sendmsg() from s, will drop
> >> the final reference to p, allowing it to be freed.
> >>
> >> Address this issue, by reverting unix_dgram_poll() to only register with
> >> the wait queue associated with s and register a callback with the remote peer
> >> socket on connect() that will wake up the wait queue associated with s. If
> >> scenarios 1 or 2 occur above we then simply remove the callback from the
> >> remote peer. This then presents the expected semantics to poll()/select()/
> >> epoll().
> >>
> >> I've implemented this for sock-type, SOCK_RAW, SOCK_DGRAM, and SOCK_SEQPACKET
> >> but not for SOCK_STREAM, since SOCK_STREAM does not use unix_dgram_poll().
> >>
> >> Introduced in commit ec0d215f9420 ("af_unix: fix 'poll for write'/connected
> >> DGRAM sockets").
> >>
> >> Tested-by: Mathias Krause <minipli@googlemail.com>
> >> Signed-off-by: Jason Baron <jbaron@akamai.com>
> > 
> > While I think this approach works, I haven't seen where the current code
> > leaks a reference. Assignment to unix_peer(sk) in general take spin_lock
> > and increment refcount. Are there bugs at the two places you referred
> > to?
> > 
> > Is an easier fix just to use atomic_inc_not_zero(&sk->sk_refcnt) in
> > unix_peer_get() which could also help other places?
> > 
> 
> Hi,
> 
> So we could potentially inc the refcnt on the remote peer such that the
> remote peer does not free before the socket that has connected to it.
> However, then the socket that has taken the reference against the peer
> socket has to potentially record a number of remote sockets (all the ones
> that it has connected to over its lifetime), and then drop all of their
> refcnt's when it finally closes.
> 
> The reason for this is that with the current code when we do
> poll()/select()/epoll() on a socket with a peer socket, those calls
> take reference on the peer socket. Specifically, they record the remote
> peer whead, such that they can remove their callbacks when they return.
> So its not safe to just drop a reference on the remote peer when it
> closes because their might be outstanding poll()/select()/epoll()
> references pending.

Thanks for the explanation, it was very helpful. The eventpoll
infrastructure seems not to be easily able to handle these kind of
socket cross references easily, I understand.

> Normally, poll()/select()/epoll() are waiting on a whead associated
> directly with the fd/file that they are waiting for.

Exactly. The reference count is implicit by the current process to
handle the filedescriptor and deregister the wait heads during program
tear-down or close. So a sock_poll_wait call to a foreign socket's wait
queue will confuse this subsystem.

> The other point here is that the way this patch structures things is
> that when the socket connects to a new remote and hence disconnects from
> an existing remote, POLLOUT events will continue to be correctly
> delivered. That was not possible with the current structure of things
> b/c there was no way to inform poll to re-register with the remote peer
> whead. So, that means that the first test case here now works:
> 
> https://lkml.org/lkml/2015/10/4/154
> 
> Whereas with the old code test case would just hang for ever.
> 
> So yes there is a bit of code churn here, but I think it moves the
> code-base in a direction that not only solves this issue, but corrects
> additional poll() behaviors as well.

Agreed, the new semantics make sense to me and are an improvement.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-10-13 11:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09  4:15 [PATCH v4 0/3] net: unix: fix use-after-free Jason Baron
2015-10-09  4:16 ` [PATCH v4 1/3] net: unix: fix use-after-free in unix_dgram_poll() Jason Baron
2015-10-09 14:38   ` Hannes Frederic Sowa
2015-10-11 13:30     ` Rainer Weikusat
2015-10-12 19:41     ` Jason Baron
2015-10-13 11:42       ` Hannes Frederic Sowa
2015-10-09  4:16 ` [PATCH v4 2/3] net: unix: Convert gc_flags to flags Jason Baron
2015-10-09  4:16 ` [PATCH v4 3/3] net: unix: optimize wakeups in unix_dgram_recvmsg() Jason Baron
2015-10-09  4:29   ` kbuild test robot
2015-10-09 15:12     ` Jason Baron
2015-10-11 11:55 ` [PATCH v4 0/3] net: unix: fix use-after-free David Miller
2015-10-11 11:55   ` David Miller
2015-10-12 12:54   ` Rainer Weikusat
2015-10-12 13:36     ` Eric Dumazet
2015-10-12 19:50   ` Jason Baron
2015-10-13  1:47     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.