bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/2] sockmap fix for KASAN_VMALLOC and af_unix
@ 2023-10-16 19:08 John Fastabend
  2023-10-16 19:08 ` [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock John Fastabend
  2023-10-16 19:08 ` [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map John Fastabend
  0 siblings, 2 replies; 16+ messages in thread
From: John Fastabend @ 2023-10-16 19:08 UTC (permalink / raw)
  To: bpf, netdev; +Cc: yangyingliang, jakub, martin.lau, john.fastabend

The af_unix tests in sockmap_listen causes a splat from KASAN_VMALLOC.
Fix it here and include an extra test to catch case where both pairs
of the af_unix socket are included in a BPF sockmap.

John Fastabend (2):
  bpf: sockmap, af_unix sockets need to hold ref for pair sock
  bpf: sockmap, add af_unix test with both sockets in map

 include/linux/skmsg.h                         |  1 +
 include/net/af_unix.h                         |  1 +
 net/core/skmsg.c                              |  2 +
 net/unix/af_unix.c                            |  2 -
 net/unix/unix_bpf.c                           | 10 +++++
 .../selftests/bpf/prog_tests/sockmap_listen.c | 39 ++++++++++++++++---
 .../selftests/bpf/progs/test_sockmap_listen.c |  7 ++++
 7 files changed, 54 insertions(+), 8 deletions(-)

-- 
2.33.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-16 19:08 [PATCH bpf 0/2] sockmap fix for KASAN_VMALLOC and af_unix John Fastabend
@ 2023-10-16 19:08 ` John Fastabend
  2023-10-18 10:40   ` Jakub Sitnicki
  2023-11-06 12:35   ` Jakub Sitnicki
  2023-10-16 19:08 ` [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map John Fastabend
  1 sibling, 2 replies; 16+ messages in thread
From: John Fastabend @ 2023-10-16 19:08 UTC (permalink / raw)
  To: bpf, netdev; +Cc: yangyingliang, jakub, martin.lau, john.fastabend

AF_UNIX sockets are a paired socket. So sending on one of the pairs
will lookup the paired socket as part of the send operation. It is
possible however to put just one of the pairs in a BPF map. This
currently increments the refcnt on the sock in the sockmap to
ensure it is not free'd by the stack before sockmap cleans up its
state and stops any skbs being sent/recv'd to that socket.

But we missed a case. If the peer socket is closed it will be
free'd by the stack. However, the paired socket can still be
referenced from BPF sockmap side because we hold a reference
there. Then if we are sending traffic through BPF sockmap to
that socket it will try to dereference the free'd pair in its
send logic creating a use after free.  And following splat,

   [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
   [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
   [...]
   [59.905468] Call Trace:
   [59.905787]  <TASK>
   [59.906066]  dump_stack_lvl+0x130/0x1d0
   [59.908877]  print_report+0x16f/0x740
   [59.910629]  kasan_report+0x118/0x160
   [59.912576]  sk_wake_async+0x31/0x1b0
   [59.913554]  sock_def_readable+0x156/0x2a0
   [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
   [59.916398]  sock_sendmsg+0x20e/0x250
   [59.916854]  skb_send_sock+0x236/0xac0
   [59.920527]  sk_psock_backlog+0x287/0xaa0

To fix let BPF sockmap hold a refcnt on both the socket in the
sockmap and its paired socket.  It wasn't obvious how to contain
the fix to bpf_unix logic. The primarily problem with keeping this
logic in bpf_unix was: In the sock close() we could handle the
deref by having a close handler. But, when we are destroying the
psock through a map delete operation we wouldn't have gotten any
signal thorugh the proto struct other than it being replaced.
If we do the deref from the proto replace its too early because
we need to deref the skpair after the backlog worker has been
stopped.

Given all this it seems best to just cache it at the end of the
psock and eat 8B for the af_unix and vsock users.

Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |  1 +
 include/net/af_unix.h |  1 +
 net/core/skmsg.c      |  2 ++
 net/unix/af_unix.c    |  2 --
 net/unix/unix_bpf.c   | 10 ++++++++++
 5 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index c1637515a8a4..fbe628961cf8 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -106,6 +106,7 @@ struct sk_psock {
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
 	struct delayed_work		work;
+	struct sock			*skpair;
 	struct rcu_work			rwork;
 };
 
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index 824c258143a3..49c4640027d8 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -75,6 +75,7 @@ struct unix_sock {
 };
 
 #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk)
+#define unix_peer(sk) (unix_sk(sk)->peer)
 
 #define peer_wait peer_wq.wait
 
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6c31eefbd777..6236164b9bce 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -826,6 +826,8 @@ static void sk_psock_destroy(struct work_struct *work)
 
 	if (psock->sk_redir)
 		sock_put(psock->sk_redir);
+	if (psock->skpair)
+		sock_put(psock->skpair);
 	sock_put(psock->sk);
 	kfree(psock);
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..87dd723aacf9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -212,8 +212,6 @@ static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
 }
 #endif /* CONFIG_SECURITY_NETWORK */
 
-#define unix_peer(sk) (unix_sk(sk)->peer)
-
 static inline int unix_our_peer(struct sock *sk, struct sock *osk)
 {
 	return unix_peer(osk) == sk;
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index 2f9d8271c6ec..705eeed10be3 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -143,6 +143,8 @@ static void unix_stream_bpf_check_needs_rebuild(struct proto *ops)
 
 int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
+	struct sock *skpair;
+
 	if (sk->sk_type != SOCK_DGRAM)
 		return -EOPNOTSUPP;
 
@@ -152,6 +154,9 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 		return 0;
 	}
 
+	skpair = unix_peer(sk);
+	sock_hold(skpair);
+	psock->skpair = skpair;
 	unix_dgram_bpf_check_needs_rebuild(psock->sk_proto);
 	sock_replace_proto(sk, &unix_dgram_bpf_prot);
 	return 0;
@@ -159,12 +164,17 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
+	struct sock *skpair = unix_peer(sk);
+
 	if (restore) {
 		sk->sk_write_space = psock->saved_write_space;
 		sock_replace_proto(sk, psock->sk_proto);
 		return 0;
 	}
 
+	skpair = unix_peer(sk);
+	sock_hold(skpair);
+	psock->skpair = skpair;
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
 	sock_replace_proto(sk, &unix_stream_bpf_prot);
 	return 0;
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map
  2023-10-16 19:08 [PATCH bpf 0/2] sockmap fix for KASAN_VMALLOC and af_unix John Fastabend
  2023-10-16 19:08 ` [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock John Fastabend
@ 2023-10-16 19:08 ` John Fastabend
  2023-11-06 12:44   ` Jakub Sitnicki
  1 sibling, 1 reply; 16+ messages in thread
From: John Fastabend @ 2023-10-16 19:08 UTC (permalink / raw)
  To: bpf, netdev; +Cc: yangyingliang, jakub, martin.lau, john.fastabend

This adds a test where both pairs of a af_unix paired socket are put into
a BPF map. This ensures that when we tear down the af_unix pair we don't
have any issues on sockmap side with ordering and reference counting.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 .../selftests/bpf/prog_tests/sockmap_listen.c | 39 ++++++++++++++++---
 .../selftests/bpf/progs/test_sockmap_listen.c |  7 ++++
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
index 8df8cbb447f1..90e97907c1c1 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
@@ -1824,8 +1824,10 @@ static void inet_unix_skb_redir_to_connected(struct test_sockmap_listen *skel,
 	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
 }
 
-static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
-					int verd_mapfd, enum redir_mode mode)
+static void unix_inet_redir_to_connected(int family, int type,
+					int sock_mapfd, int nop_mapfd,
+					int verd_mapfd,
+					enum redir_mode mode)
 {
 	const char *log_prefix = redir_mode_str(mode);
 	int c0, c1, p0, p1;
@@ -1849,6 +1851,12 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
 	if (err)
 		goto close;
 
+	if (nop_mapfd >= 0) {
+		err = add_to_sockmap(nop_mapfd, c0, c1);
+		if (err)
+			goto close;
+	}
+
 	n = write(c1, "a", 1);
 	if (n < 0)
 		FAIL_ERRNO("%s: write", log_prefix);
@@ -1883,6 +1891,7 @@ static void unix_inet_skb_redir_to_connected(struct test_sockmap_listen *skel,
 					    struct bpf_map *inner_map, int family)
 {
 	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
+	int nop_map = bpf_map__fd(skel->maps.nop_map);
 	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
 	int sock_map = bpf_map__fd(inner_map);
 	int err;
@@ -1892,14 +1901,32 @@ static void unix_inet_skb_redir_to_connected(struct test_sockmap_listen *skel,
 		return;
 
 	skel->bss->test_ingress = false;
-	unix_inet_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
+	unix_inet_redir_to_connected(family, SOCK_DGRAM,
+				     sock_map, -1, verdict_map,
+				     REDIR_EGRESS);
+	unix_inet_redir_to_connected(family, SOCK_DGRAM,
+				     sock_map, -1, verdict_map,
 				     REDIR_EGRESS);
-	unix_inet_redir_to_connected(family, SOCK_STREAM, sock_map, verdict_map,
+
+	unix_inet_redir_to_connected(family, SOCK_DGRAM,
+				     sock_map, nop_map, verdict_map,
+				     REDIR_EGRESS);
+	unix_inet_redir_to_connected(family, SOCK_STREAM,
+				     sock_map, nop_map, verdict_map,
 				     REDIR_EGRESS);
 	skel->bss->test_ingress = true;
-	unix_inet_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
+	unix_inet_redir_to_connected(family, SOCK_DGRAM,
+				     sock_map, -1, verdict_map,
+				     REDIR_INGRESS);
+	unix_inet_redir_to_connected(family, SOCK_STREAM,
+				     sock_map, -1, verdict_map,
+				     REDIR_INGRESS);
+
+	unix_inet_redir_to_connected(family, SOCK_DGRAM,
+				     sock_map, nop_map, verdict_map,
 				     REDIR_INGRESS);
-	unix_inet_redir_to_connected(family, SOCK_STREAM, sock_map, verdict_map,
+	unix_inet_redir_to_connected(family, SOCK_STREAM,
+				     sock_map, nop_map, verdict_map,
 				     REDIR_INGRESS);
 
 	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
index 464d35bd57c7..b7250eb9c30c 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
@@ -14,6 +14,13 @@ struct {
 	__type(value, __u64);
 } sock_map SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 2);
+	__type(key, __u32);
+	__type(value, __u64);
+} nop_map SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_SOCKHASH);
 	__uint(max_entries, 2);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-16 19:08 ` [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock John Fastabend
@ 2023-10-18 10:40   ` Jakub Sitnicki
  2023-10-24 21:39     ` John Fastabend
  2023-11-06 12:35   ` Jakub Sitnicki
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Sitnicki @ 2023-10-18 10:40 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> AF_UNIX sockets are a paired socket. So sending on one of the pairs
> will lookup the paired socket as part of the send operation. It is
> possible however to put just one of the pairs in a BPF map. This
> currently increments the refcnt on the sock in the sockmap to
> ensure it is not free'd by the stack before sockmap cleans up its
> state and stops any skbs being sent/recv'd to that socket.
>
> But we missed a case. If the peer socket is closed it will be
> free'd by the stack. However, the paired socket can still be
> referenced from BPF sockmap side because we hold a reference
> there. Then if we are sending traffic through BPF sockmap to
> that socket it will try to dereference the free'd pair in its
> send logic creating a use after free.  And following splat,
>
>    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>    [...]
>    [59.905468] Call Trace:
>    [59.905787]  <TASK>
>    [59.906066]  dump_stack_lvl+0x130/0x1d0
>    [59.908877]  print_report+0x16f/0x740
>    [59.910629]  kasan_report+0x118/0x160
>    [59.912576]  sk_wake_async+0x31/0x1b0
>    [59.913554]  sock_def_readable+0x156/0x2a0
>    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>    [59.916398]  sock_sendmsg+0x20e/0x250
>    [59.916854]  skb_send_sock+0x236/0xac0
>    [59.920527]  sk_psock_backlog+0x287/0xaa0

Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
helper.

>
> To fix let BPF sockmap hold a refcnt on both the socket in the
> sockmap and its paired socket.  It wasn't obvious how to contain
> the fix to bpf_unix logic. The primarily problem with keeping this
> logic in bpf_unix was: In the sock close() we could handle the
> deref by having a close handler. But, when we are destroying the
> psock through a map delete operation we wouldn't have gotten any
> signal thorugh the proto struct other than it being replaced.
> If we do the deref from the proto replace its too early because
> we need to deref the skpair after the backlog worker has been
> stopped.
>
> Given all this it seems best to just cache it at the end of the
> psock and eat 8B for the af_unix and vsock users.
>
> Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-18 10:40   ` Jakub Sitnicki
@ 2023-10-24 21:39     ` John Fastabend
  2023-10-27 13:32       ` Jakub Sitnicki
  0 siblings, 1 reply; 16+ messages in thread
From: John Fastabend @ 2023-10-24 21:39 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

Jakub Sitnicki wrote:
> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
> > will lookup the paired socket as part of the send operation. It is
> > possible however to put just one of the pairs in a BPF map. This
> > currently increments the refcnt on the sock in the sockmap to
> > ensure it is not free'd by the stack before sockmap cleans up its
> > state and stops any skbs being sent/recv'd to that socket.
> >
> > But we missed a case. If the peer socket is closed it will be
> > free'd by the stack. However, the paired socket can still be
> > referenced from BPF sockmap side because we hold a reference
> > there. Then if we are sending traffic through BPF sockmap to
> > that socket it will try to dereference the free'd pair in its
> > send logic creating a use after free.  And following splat,
> >
> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
> >    [...]
> >    [59.905468] Call Trace:
> >    [59.905787]  <TASK>
> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
> >    [59.908877]  print_report+0x16f/0x740
> >    [59.910629]  kasan_report+0x118/0x160
> >    [59.912576]  sk_wake_async+0x31/0x1b0
> >    [59.913554]  sock_def_readable+0x156/0x2a0
> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
> >    [59.916398]  sock_sendmsg+0x20e/0x250
> >    [59.916854]  skb_send_sock+0x236/0xac0
> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
> 
> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
> helper.

It does by my read. In unix_stream_connect we have,

	sock_hold(sk);
	unix_peer(newsk)	= sk;
	newsk->sk_state		= TCP_ESTABLISHED;

where it assigns the peer sock. unix_dgram_connect() also calls
sock_hold() but through the path that does the socket lookup, such as
unix_find_other().

The problem I see is before the socket does the kfree on the
sock we need to be sure the backlog is canceled and the skb list
ingress_skb is purged. If we don't ensure this then the redirect
will 

My model is this,

         s1            c1
refcnt    1             1
connect   2             2
psock     3             3
send(s1) ...
close(s1) 2             1 <- close drops psock count also
close(c1) 0             0

The important bit here is the psock has a refcnt on the
underlying sock (psock->sk) and wont dec that until after
cancel_delayed_work_sync() completes. This ensures the
backlog wont try to sendmsg() on that sock after we free
it. We also check for SOCK_DEAD and abort to avoid sending
over a socket that has been marked DEAD.

So... After close(s1) the only thing keeping that sock
around is c1. Then we close(c1) that call path is

 unix_release
   close() 
   unix_release_sock()
     skpair = unix_peer(sk);
     ...
     sock_put(skpair);  <- trouble here

The release will call sock_put() on the pair socket and
dec it to 0 where it gets free'd through sk_free(). But
now the trouble is we haven't waited for cancel_delayed_work_sync()
on the c1 socket yet so backlog can still run. When it does
run it may try to send a pkg over socket s1. OK right up until
the sendmsg(s1, ...) does a peer lookup and derefs the peer
socket. The peer socket was free'd earlier so use after free. 

The question I had originally was this is odd, we are allowing
a sendmsg(s1) over a socket while its in unix_release(). We
used to take the sock lock from the backlog that was dropped
in the name of performance, but it creates these races.

Other fixes I considered. First adding sock lock back to
backlog. But that punishes the UDP and TCP cases that don't
have this problem. Set the SOCK_DEAD flag earlier or check
later but this just makes the race smaller doesn't really
eliminate it.

So this patch is what I came up with. 

> 
> >
> > To fix let BPF sockmap hold a refcnt on both the socket in the
> > sockmap and its paired socket.  It wasn't obvious how to contain
> > the fix to bpf_unix logic. The primarily problem with keeping this
> > logic in bpf_unix was: In the sock close() we could handle the
> > deref by having a close handler. But, when we are destroying the
> > psock through a map delete operation we wouldn't have gotten any
> > signal thorugh the proto struct other than it being replaced.
> > If we do the deref from the proto replace its too early because
> > we need to deref the skpair after the backlog worker has been
> > stopped.
> >
> > Given all this it seems best to just cache it at the end of the
> > psock and eat 8B for the af_unix and vsock users.
> >
> > Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> 
> [...]



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-24 21:39     ` John Fastabend
@ 2023-10-27 13:32       ` Jakub Sitnicki
  2023-10-27 17:38         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Sitnicki @ 2023-10-27 13:32 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
>> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
>> > will lookup the paired socket as part of the send operation. It is
>> > possible however to put just one of the pairs in a BPF map. This
>> > currently increments the refcnt on the sock in the sockmap to
>> > ensure it is not free'd by the stack before sockmap cleans up its
>> > state and stops any skbs being sent/recv'd to that socket.
>> >
>> > But we missed a case. If the peer socket is closed it will be
>> > free'd by the stack. However, the paired socket can still be
>> > referenced from BPF sockmap side because we hold a reference
>> > there. Then if we are sending traffic through BPF sockmap to
>> > that socket it will try to dereference the free'd pair in its
>> > send logic creating a use after free.  And following splat,
>> >
>> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>> >    [...]
>> >    [59.905468] Call Trace:
>> >    [59.905787]  <TASK>
>> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
>> >    [59.908877]  print_report+0x16f/0x740
>> >    [59.910629]  kasan_report+0x118/0x160
>> >    [59.912576]  sk_wake_async+0x31/0x1b0
>> >    [59.913554]  sock_def_readable+0x156/0x2a0
>> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>> >    [59.916398]  sock_sendmsg+0x20e/0x250
>> >    [59.916854]  skb_send_sock+0x236/0xac0
>> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
>> 
>> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
>> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
>> helper.
>
> It does by my read. In unix_stream_connect we have,
>
> 	sock_hold(sk);
> 	unix_peer(newsk)	= sk;
> 	newsk->sk_state		= TCP_ESTABLISHED;
>
> where it assigns the peer sock. unix_dgram_connect() also calls
> sock_hold() but through the path that does the socket lookup, such as
> unix_find_other().
>
> The problem I see is before the socket does the kfree on the
> sock we need to be sure the backlog is canceled and the skb list
> ingress_skb is purged. If we don't ensure this then the redirect
> will 
>
> My model is this,
>
>          s1            c1
> refcnt    1             1
> connect   2             2
> psock     3             3
> send(s1) ...
> close(s1) 2             1 <- close drops psock count also
> close(c1) 0             0
>
> The important bit here is the psock has a refcnt on the
> underlying sock (psock->sk) and wont dec that until after
> cancel_delayed_work_sync() completes. This ensures the
> backlog wont try to sendmsg() on that sock after we free
> it. We also check for SOCK_DEAD and abort to avoid sending
> over a socket that has been marked DEAD.
>
> So... After close(s1) the only thing keeping that sock
> around is c1. Then we close(c1) that call path is
>
>  unix_release
>    close() 
>    unix_release_sock()
>      skpair = unix_peer(sk);
>      ...
>      sock_put(skpair);  <- trouble here
>
> The release will call sock_put() on the pair socket and
> dec it to 0 where it gets free'd through sk_free(). But
> now the trouble is we haven't waited for cancel_delayed_work_sync()
> on the c1 socket yet so backlog can still run. When it does
> run it may try to send a pkg over socket s1. OK right up until
> the sendmsg(s1, ...) does a peer lookup and derefs the peer
> socket. The peer socket was free'd earlier so use after free. 
>
> The question I had originally was this is odd, we are allowing
> a sendmsg(s1) over a socket while its in unix_release(). We
> used to take the sock lock from the backlog that was dropped
> in the name of performance, but it creates these races.
>
> Other fixes I considered. First adding sock lock back to
> backlog. But that punishes the UDP and TCP cases that don't
> have this problem. Set the SOCK_DEAD flag earlier or check
> later but this just makes the race smaller doesn't really
> eliminate it.
>
> So this patch is what I came up with. 

What I was getting at is that we could make it safe to call sendmsg on a
unix stream sock while its peer is being release. And not just for
sockmap. I expect io_uring might have the same problem. But I didn't
actually check yet.

For that we could keep a ref to peer for the duration of sendmsg call,
like unix dgram does. Then 'other' doesn't become a stale pointer before
we're done with it.

Bumping ref count on each sendmsg is not free, but maybe its
acceptable. Unix dgram sockets live with it.

With a patch like below, I'm no longer able to trigger an UAF splat.

WDYT?

---8<---

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3e8a04a13668..48cf19ea9294 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2198,7 +2198,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		goto out_err;
 	} else {
 		err = -ENOTCONN;
-		other = unix_peer(sk);
+		other = unix_peer_get(sk);
 		if (!other)
 			goto out_err;
 	}
@@ -2282,6 +2282,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	}
 #endif
 
+	sock_put(other);
 	scm_destroy(&scm);
 
 	return sent;
@@ -2294,6 +2295,8 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		send_sig(SIGPIPE, current, 0);
 	err = -EPIPE;
 out_err:
+	if (other)
+		sock_put(other);
 	scm_destroy(&scm);
 	return sent ? : err;
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-27 13:32       ` Jakub Sitnicki
@ 2023-10-27 17:38         ` Kuniyuki Iwashima
  2023-10-28  7:33           ` Jakub Sitnicki
  0 siblings, 1 reply; 16+ messages in thread
From: Kuniyuki Iwashima @ 2023-10-27 17:38 UTC (permalink / raw)
  To: jakub; +Cc: bpf, john.fastabend, martin.lau, netdev, yangyingliang, kuniyu

From: Jakub Sitnicki <jakub@cloudflare.com>
Date: Fri, 27 Oct 2023 15:32:15 +0200
> On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> >> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
> >> > will lookup the paired socket as part of the send operation. It is
> >> > possible however to put just one of the pairs in a BPF map. This
> >> > currently increments the refcnt on the sock in the sockmap to
> >> > ensure it is not free'd by the stack before sockmap cleans up its
> >> > state and stops any skbs being sent/recv'd to that socket.
> >> >
> >> > But we missed a case. If the peer socket is closed it will be
> >> > free'd by the stack. However, the paired socket can still be
> >> > referenced from BPF sockmap side because we hold a reference
> >> > there. Then if we are sending traffic through BPF sockmap to
> >> > that socket it will try to dereference the free'd pair in its
> >> > send logic creating a use after free.  And following splat,
> >> >
> >> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
> >> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
> >> >    [...]
> >> >    [59.905468] Call Trace:
> >> >    [59.905787]  <TASK>
> >> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
> >> >    [59.908877]  print_report+0x16f/0x740
> >> >    [59.910629]  kasan_report+0x118/0x160
> >> >    [59.912576]  sk_wake_async+0x31/0x1b0
> >> >    [59.913554]  sock_def_readable+0x156/0x2a0
> >> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
> >> >    [59.916398]  sock_sendmsg+0x20e/0x250
> >> >    [59.916854]  skb_send_sock+0x236/0xac0
> >> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
> >> 
> >> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
> >> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
> >> helper.
> >
> > It does by my read. In unix_stream_connect we have,
> >
> > 	sock_hold(sk);
> > 	unix_peer(newsk)	= sk;
> > 	newsk->sk_state		= TCP_ESTABLISHED;
> >
> > where it assigns the peer sock. unix_dgram_connect() also calls
> > sock_hold() but through the path that does the socket lookup, such as
> > unix_find_other().
> >
> > The problem I see is before the socket does the kfree on the
> > sock we need to be sure the backlog is canceled and the skb list
> > ingress_skb is purged. If we don't ensure this then the redirect
> > will 
> >
> > My model is this,
> >
> >          s1            c1
> > refcnt    1             1
> > connect   2             2
> > psock     3             3
> > send(s1) ...
> > close(s1) 2             1 <- close drops psock count also
> > close(c1) 0             0
> >
> > The important bit here is the psock has a refcnt on the
> > underlying sock (psock->sk) and wont dec that until after
> > cancel_delayed_work_sync() completes. This ensures the
> > backlog wont try to sendmsg() on that sock after we free
> > it. We also check for SOCK_DEAD and abort to avoid sending
> > over a socket that has been marked DEAD.
> >
> > So... After close(s1) the only thing keeping that sock
> > around is c1. Then we close(c1) that call path is
> >
> >  unix_release
> >    close() 
> >    unix_release_sock()
> >      skpair = unix_peer(sk);
> >      ...
> >      sock_put(skpair);  <- trouble here
> >
> > The release will call sock_put() on the pair socket and
> > dec it to 0 where it gets free'd through sk_free(). But
> > now the trouble is we haven't waited for cancel_delayed_work_sync()
> > on the c1 socket yet so backlog can still run. When it does
> > run it may try to send a pkg over socket s1. OK right up until
> > the sendmsg(s1, ...) does a peer lookup and derefs the peer
> > socket. The peer socket was free'd earlier so use after free. 
> >
> > The question I had originally was this is odd, we are allowing
> > a sendmsg(s1) over a socket while its in unix_release(). We
> > used to take the sock lock from the backlog that was dropped
> > in the name of performance, but it creates these races.
> >
> > Other fixes I considered. First adding sock lock back to
> > backlog. But that punishes the UDP and TCP cases that don't
> > have this problem. Set the SOCK_DEAD flag earlier or check
> > later but this just makes the race smaller doesn't really
> > eliminate it.
> >
> > So this patch is what I came up with. 
> 
> What I was getting at is that we could make it safe to call sendmsg on a
> unix stream sock while its peer is being release. And not just for
> sockmap. I expect io_uring might have the same problem. But I didn't
> actually check yet.
> 
> For that we could keep a ref to peer for the duration of sendmsg call,
> like unix dgram does. Then 'other' doesn't become a stale pointer before
> we're done with it.
> 
> Bumping ref count on each sendmsg is not free, but maybe its
> acceptable. Unix dgram sockets live with it.

The reason why only dgram sk needs sock_hold() for each sendmsg() is
that dgram sk can send data without connect().  unix_peer_get() in
unix_dgram_sendmsg() is to reuse the same code when peer is not set.

unix_stream_sendmsg() already holds a necessary refcnt and need not
use sock_hold() there.

The user who touches a peer without lookup must hold refcnt beforehand.


> 
> With a patch like below, I'm no longer able to trigger an UAF splat.
> 
> WDYT?
> 
> ---8<---
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 3e8a04a13668..48cf19ea9294 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2198,7 +2198,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
>  		goto out_err;
>  	} else {
>  		err = -ENOTCONN;
> -		other = unix_peer(sk);
> +		other = unix_peer_get(sk);
>  		if (!other)
>  			goto out_err;
>  	}
> @@ -2282,6 +2282,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
>  	}
>  #endif
>  
> +	sock_put(other);
>  	scm_destroy(&scm);
>  
>  	return sent;
> @@ -2294,6 +2295,8 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
>  		send_sig(SIGPIPE, current, 0);
>  	err = -EPIPE;
>  out_err:
> +	if (other)
> +		sock_put(other);
>  	scm_destroy(&scm);
>  	return sent ? : err;
>  }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-27 17:38         ` Kuniyuki Iwashima
@ 2023-10-28  7:33           ` Jakub Sitnicki
  2023-11-04  3:38             ` John Fastabend
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Sitnicki @ 2023-10-28  7:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: bpf, john.fastabend, martin.lau, netdev, yangyingliang

On Fri, Oct 27, 2023 at 10:38 AM -07, Kuniyuki Iwashima wrote:
> From: Jakub Sitnicki <jakub@cloudflare.com>
> Date: Fri, 27 Oct 2023 15:32:15 +0200
>> On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote:
>> > Jakub Sitnicki wrote:
>> >> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
>> >> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
>> >> > will lookup the paired socket as part of the send operation. It is
>> >> > possible however to put just one of the pairs in a BPF map. This
>> >> > currently increments the refcnt on the sock in the sockmap to
>> >> > ensure it is not free'd by the stack before sockmap cleans up its
>> >> > state and stops any skbs being sent/recv'd to that socket.
>> >> >
>> >> > But we missed a case. If the peer socket is closed it will be
>> >> > free'd by the stack. However, the paired socket can still be
>> >> > referenced from BPF sockmap side because we hold a reference
>> >> > there. Then if we are sending traffic through BPF sockmap to
>> >> > that socket it will try to dereference the free'd pair in its
>> >> > send logic creating a use after free.  And following splat,
>> >> >
>> >> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>> >> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>> >> >    [...]
>> >> >    [59.905468] Call Trace:
>> >> >    [59.905787]  <TASK>
>> >> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
>> >> >    [59.908877]  print_report+0x16f/0x740
>> >> >    [59.910629]  kasan_report+0x118/0x160
>> >> >    [59.912576]  sk_wake_async+0x31/0x1b0
>> >> >    [59.913554]  sock_def_readable+0x156/0x2a0
>> >> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>> >> >    [59.916398]  sock_sendmsg+0x20e/0x250
>> >> >    [59.916854]  skb_send_sock+0x236/0xac0
>> >> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
>> >> 
>> >> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
>> >> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
>> >> helper.
>> >
>> > It does by my read. In unix_stream_connect we have,
>> >
>> > 	sock_hold(sk);
>> > 	unix_peer(newsk)	= sk;
>> > 	newsk->sk_state		= TCP_ESTABLISHED;
>> >
>> > where it assigns the peer sock. unix_dgram_connect() also calls
>> > sock_hold() but through the path that does the socket lookup, such as
>> > unix_find_other().
>> >
>> > The problem I see is before the socket does the kfree on the
>> > sock we need to be sure the backlog is canceled and the skb list
>> > ingress_skb is purged. If we don't ensure this then the redirect
>> > will 
>> >
>> > My model is this,
>> >
>> >          s1            c1
>> > refcnt    1             1
>> > connect   2             2
>> > psock     3             3
>> > send(s1) ...
>> > close(s1) 2             1 <- close drops psock count also
>> > close(c1) 0             0
>> >
>> > The important bit here is the psock has a refcnt on the
>> > underlying sock (psock->sk) and wont dec that until after
>> > cancel_delayed_work_sync() completes. This ensures the
>> > backlog wont try to sendmsg() on that sock after we free
>> > it. We also check for SOCK_DEAD and abort to avoid sending
>> > over a socket that has been marked DEAD.
>> >
>> > So... After close(s1) the only thing keeping that sock
>> > around is c1. Then we close(c1) that call path is
>> >
>> >  unix_release
>> >    close() 
>> >    unix_release_sock()
>> >      skpair = unix_peer(sk);
>> >      ...
>> >      sock_put(skpair);  <- trouble here
>> >
>> > The release will call sock_put() on the pair socket and
>> > dec it to 0 where it gets free'd through sk_free(). But
>> > now the trouble is we haven't waited for cancel_delayed_work_sync()
>> > on the c1 socket yet so backlog can still run. When it does
>> > run it may try to send a pkg over socket s1. OK right up until
>> > the sendmsg(s1, ...) does a peer lookup and derefs the peer
>> > socket. The peer socket was free'd earlier so use after free. 
>> >
>> > The question I had originally was this is odd, we are allowing
>> > a sendmsg(s1) over a socket while its in unix_release(). We
>> > used to take the sock lock from the backlog that was dropped
>> > in the name of performance, but it creates these races.
>> >
>> > Other fixes I considered. First adding sock lock back to
>> > backlog. But that punishes the UDP and TCP cases that don't
>> > have this problem. Set the SOCK_DEAD flag earlier or check
>> > later but this just makes the race smaller doesn't really
>> > eliminate it.
>> >
>> > So this patch is what I came up with. 
>> 
>> What I was getting at is that we could make it safe to call sendmsg on a
>> unix stream sock while its peer is being release. And not just for
>> sockmap. I expect io_uring might have the same problem. But I didn't
>> actually check yet.
>> 
>> For that we could keep a ref to peer for the duration of sendmsg call,
>> like unix dgram does. Then 'other' doesn't become a stale pointer before
>> we're done with it.
>> 
>> Bumping ref count on each sendmsg is not free, but maybe its
>> acceptable. Unix dgram sockets live with it.
>
> The reason why only dgram sk needs sock_hold() for each sendmsg() is
> that dgram sk can send data without connect().  unix_peer_get() in
> unix_dgram_sendmsg() is to reuse the same code when peer is not set.
>
> unix_stream_sendmsg() already holds a necessary refcnt and need not
> use sock_hold() there.
>
> The user who touches a peer without lookup must hold refcnt beforehand.

Right. And this ownership scheme works well for unix stream because, as
John nicely explained, we serialize close() and sendmsg() ops with sock
lock.

Here, however, we have a case of deferred work which holds a ref to sock
but does not grab the sock lock. While it is doing its thing, the sk
gets closed/released and we drop the skpair ref. And bam, UAF.

If it wasn't for the reference cycle between sk and skpair, we could
defer the skpari ref drop until sk_destruct callback. But we can't.

If grabbing a ref on skpair on each sendmsg turns out to be not a viable
option, I didn't run any benchmarks so can't say what's the penatly
like, the next best thing is RCU.

If we can protect the skpair pointer with RCU, the memory it points to
will stay valid until unix_stream_sendmsg is done with it. I have not
given this approach a shot, so there might be something in the way.

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-28  7:33           ` Jakub Sitnicki
@ 2023-11-04  3:38             ` John Fastabend
  2023-11-06 10:15               ` Jakub Sitnicki
  0 siblings, 1 reply; 16+ messages in thread
From: John Fastabend @ 2023-11-04  3:38 UTC (permalink / raw)
  To: Jakub Sitnicki, Kuniyuki Iwashima
  Cc: bpf, john.fastabend, martin.lau, netdev, yangyingliang

Jakub Sitnicki wrote:
> On Fri, Oct 27, 2023 at 10:38 AM -07, Kuniyuki Iwashima wrote:
> > From: Jakub Sitnicki <jakub@cloudflare.com>
> > Date: Fri, 27 Oct 2023 15:32:15 +0200
> >> On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote:
> >> > Jakub Sitnicki wrote:
> >> >> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> >> >> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
> >> >> > will lookup the paired socket as part of the send operation. It is
> >> >> > possible however to put just one of the pairs in a BPF map. This
> >> >> > currently increments the refcnt on the sock in the sockmap to
> >> >> > ensure it is not free'd by the stack before sockmap cleans up its
> >> >> > state and stops any skbs being sent/recv'd to that socket.
> >> >> >
> >> >> > But we missed a case. If the peer socket is closed it will be
> >> >> > free'd by the stack. However, the paired socket can still be
> >> >> > referenced from BPF sockmap side because we hold a reference
> >> >> > there. Then if we are sending traffic through BPF sockmap to
> >> >> > that socket it will try to dereference the free'd pair in its
> >> >> > send logic creating a use after free.  And following splat,
> >> >> >
> >> >> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
> >> >> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
> >> >> >    [...]
> >> >> >    [59.905468] Call Trace:
> >> >> >    [59.905787]  <TASK>
> >> >> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
> >> >> >    [59.908877]  print_report+0x16f/0x740
> >> >> >    [59.910629]  kasan_report+0x118/0x160
> >> >> >    [59.912576]  sk_wake_async+0x31/0x1b0
> >> >> >    [59.913554]  sock_def_readable+0x156/0x2a0
> >> >> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
> >> >> >    [59.916398]  sock_sendmsg+0x20e/0x250
> >> >> >    [59.916854]  skb_send_sock+0x236/0xac0
> >> >> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
> >> >> 
> >> >> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
> >> >> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
> >> >> helper.
> >> >
> >> > It does by my read. In unix_stream_connect we have,
> >> >
> >> > 	sock_hold(sk);
> >> > 	unix_peer(newsk)	= sk;
> >> > 	newsk->sk_state		= TCP_ESTABLISHED;
> >> >
> >> > where it assigns the peer sock. unix_dgram_connect() also calls
> >> > sock_hold() but through the path that does the socket lookup, such as
> >> > unix_find_other().
> >> >
> >> > The problem I see is before the socket does the kfree on the
> >> > sock we need to be sure the backlog is canceled and the skb list
> >> > ingress_skb is purged. If we don't ensure this then the redirect
> >> > will 
> >> >
> >> > My model is this,
> >> >
> >> >          s1            c1
> >> > refcnt    1             1
> >> > connect   2             2
> >> > psock     3             3
> >> > send(s1) ...
> >> > close(s1) 2             1 <- close drops psock count also
> >> > close(c1) 0             0
> >> >
> >> > The important bit here is the psock has a refcnt on the
> >> > underlying sock (psock->sk) and wont dec that until after
> >> > cancel_delayed_work_sync() completes. This ensures the
> >> > backlog wont try to sendmsg() on that sock after we free
> >> > it. We also check for SOCK_DEAD and abort to avoid sending
> >> > over a socket that has been marked DEAD.
> >> >
> >> > So... After close(s1) the only thing keeping that sock
> >> > around is c1. Then we close(c1) that call path is
> >> >
> >> >  unix_release
> >> >    close() 
> >> >    unix_release_sock()
> >> >      skpair = unix_peer(sk);
> >> >      ...
> >> >      sock_put(skpair);  <- trouble here
> >> >
> >> > The release will call sock_put() on the pair socket and
> >> > dec it to 0 where it gets free'd through sk_free(). But
> >> > now the trouble is we haven't waited for cancel_delayed_work_sync()
> >> > on the c1 socket yet so backlog can still run. When it does
> >> > run it may try to send a pkg over socket s1. OK right up until
> >> > the sendmsg(s1, ...) does a peer lookup and derefs the peer
> >> > socket. The peer socket was free'd earlier so use after free. 
> >> >
> >> > The question I had originally was this is odd, we are allowing
> >> > a sendmsg(s1) over a socket while its in unix_release(). We
> >> > used to take the sock lock from the backlog that was dropped
> >> > in the name of performance, but it creates these races.
> >> >
> >> > Other fixes I considered. First adding sock lock back to
> >> > backlog. But that punishes the UDP and TCP cases that don't
> >> > have this problem. Set the SOCK_DEAD flag earlier or check
> >> > later but this just makes the race smaller doesn't really
> >> > eliminate it.
> >> >
> >> > So this patch is what I came up with. 
> >> 
> >> What I was getting at is that we could make it safe to call sendmsg on a
> >> unix stream sock while its peer is being release. And not just for
> >> sockmap. I expect io_uring might have the same problem. But I didn't
> >> actually check yet.
> >> 
> >> For that we could keep a ref to peer for the duration of sendmsg call,
> >> like unix dgram does. Then 'other' doesn't become a stale pointer before
> >> we're done with it.
> >> 
> >> Bumping ref count on each sendmsg is not free, but maybe its
> >> acceptable. Unix dgram sockets live with it.
> >
> > The reason why only dgram sk needs sock_hold() for each sendmsg() is
> > that dgram sk can send data without connect().  unix_peer_get() in
> > unix_dgram_sendmsg() is to reuse the same code when peer is not set.
> >
> > unix_stream_sendmsg() already holds a necessary refcnt and need not
> > use sock_hold() there.
> >
> > The user who touches a peer without lookup must hold refcnt beforehand.

Hi, we probably do need to get a fix for this. syzkaller hit it again
and anyways it likely will crash some real systems if folks try to use
it with enough systems.

> 
> Right. And this ownership scheme works well for unix stream because, as
> John nicely explained, we serialize close() and sendmsg() ops with sock
> lock.
> 
> Here, however, we have a case of deferred work which holds a ref to sock
> but does not grab the sock lock. While it is doing its thing, the sk
> gets closed/released and we drop the skpair ref. And bam, UAF.
> 
> If it wasn't for the reference cycle between sk and skpair, we could
> defer the skpari ref drop until sk_destruct callback. But we can't.
> 
> If grabbing a ref on skpair on each sendmsg turns out to be not a viable
> option, I didn't run any benchmarks so can't say what's the penatly
> like, the next best thing is RCU.

I think it really would be best to stay out of the presumably hotpath
here if we can.

I had considered marking the socket SOCK_RCU_FREE which should then
wait an rcu grace period. But then it wasn't clear to me that would
completely solve the race. The psock still needs to do the 
cancel_delayed_work_sync() and this is also done from the rcu call
back on the psock. Withtout the extra reference iirc the concern
was we would basically have two rcu callbacks running that have
an ordering requirement which I don't think is ensured.

> 
> If we can protect the skpair pointer with RCU, the memory it points to
> will stay valid until unix_stream_sendmsg is done with it. I have not
> given this approach a shot, so there might be something in the way.
> 
> [...]

Thanks,
John

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-11-04  3:38             ` John Fastabend
@ 2023-11-06 10:15               ` Jakub Sitnicki
  0 siblings, 0 replies; 16+ messages in thread
From: Jakub Sitnicki @ 2023-11-06 10:15 UTC (permalink / raw)
  To: John Fastabend; +Cc: Kuniyuki Iwashima, bpf, martin.lau, netdev, yangyingliang

On Fri, Nov 03, 2023 at 08:38 PM -07, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Fri, Oct 27, 2023 at 10:38 AM -07, Kuniyuki Iwashima wrote:
>> > From: Jakub Sitnicki <jakub@cloudflare.com>
>> > Date: Fri, 27 Oct 2023 15:32:15 +0200
>> >> On Tue, Oct 24, 2023 at 02:39 PM -07, John Fastabend wrote:
>> >> > Jakub Sitnicki wrote:
>> >> >> On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
>> >> >> > AF_UNIX sockets are a paired socket. So sending on one of the pairs
>> >> >> > will lookup the paired socket as part of the send operation. It is
>> >> >> > possible however to put just one of the pairs in a BPF map. This
>> >> >> > currently increments the refcnt on the sock in the sockmap to
>> >> >> > ensure it is not free'd by the stack before sockmap cleans up its
>> >> >> > state and stops any skbs being sent/recv'd to that socket.
>> >> >> >
>> >> >> > But we missed a case. If the peer socket is closed it will be
>> >> >> > free'd by the stack. However, the paired socket can still be
>> >> >> > referenced from BPF sockmap side because we hold a reference
>> >> >> > there. Then if we are sending traffic through BPF sockmap to
>> >> >> > that socket it will try to dereference the free'd pair in its
>> >> >> > send logic creating a use after free.  And following splat,
>> >> >> >
>> >> >> >    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>> >> >> >    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>> >> >> >    [...]
>> >> >> >    [59.905468] Call Trace:
>> >> >> >    [59.905787]  <TASK>
>> >> >> >    [59.906066]  dump_stack_lvl+0x130/0x1d0
>> >> >> >    [59.908877]  print_report+0x16f/0x740
>> >> >> >    [59.910629]  kasan_report+0x118/0x160
>> >> >> >    [59.912576]  sk_wake_async+0x31/0x1b0
>> >> >> >    [59.913554]  sock_def_readable+0x156/0x2a0
>> >> >> >    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>> >> >> >    [59.916398]  sock_sendmsg+0x20e/0x250
>> >> >> >    [59.916854]  skb_send_sock+0x236/0xac0
>> >> >> >    [59.920527]  sk_psock_backlog+0x287/0xaa0
>> >> >> 
>> >> >> Isn't the problem here that unix_stream_sendmsg doesn't grab a ref to
>> >> >> peer sock? Unlike unix_dgram_sendmsg which uses the unix_peer_get
>> >> >> helper.
>> >> >
>> >> > It does by my read. In unix_stream_connect we have,
>> >> >
>> >> > 	sock_hold(sk);
>> >> > 	unix_peer(newsk)	= sk;
>> >> > 	newsk->sk_state		= TCP_ESTABLISHED;
>> >> >
>> >> > where it assigns the peer sock. unix_dgram_connect() also calls
>> >> > sock_hold() but through the path that does the socket lookup, such as
>> >> > unix_find_other().
>> >> >
>> >> > The problem I see is before the socket does the kfree on the
>> >> > sock we need to be sure the backlog is canceled and the skb list
>> >> > ingress_skb is purged. If we don't ensure this then the redirect
>> >> > will 
>> >> >
>> >> > My model is this,
>> >> >
>> >> >          s1            c1
>> >> > refcnt    1             1
>> >> > connect   2             2
>> >> > psock     3             3
>> >> > send(s1) ...
>> >> > close(s1) 2             1 <- close drops psock count also
>> >> > close(c1) 0             0
>> >> >
>> >> > The important bit here is the psock has a refcnt on the
>> >> > underlying sock (psock->sk) and wont dec that until after
>> >> > cancel_delayed_work_sync() completes. This ensures the
>> >> > backlog wont try to sendmsg() on that sock after we free
>> >> > it. We also check for SOCK_DEAD and abort to avoid sending
>> >> > over a socket that has been marked DEAD.
>> >> >
>> >> > So... After close(s1) the only thing keeping that sock
>> >> > around is c1. Then we close(c1) that call path is
>> >> >
>> >> >  unix_release
>> >> >    close() 
>> >> >    unix_release_sock()
>> >> >      skpair = unix_peer(sk);
>> >> >      ...
>> >> >      sock_put(skpair);  <- trouble here
>> >> >
>> >> > The release will call sock_put() on the pair socket and
>> >> > dec it to 0 where it gets free'd through sk_free(). But
>> >> > now the trouble is we haven't waited for cancel_delayed_work_sync()
>> >> > on the c1 socket yet so backlog can still run. When it does
>> >> > run it may try to send a pkg over socket s1. OK right up until
>> >> > the sendmsg(s1, ...) does a peer lookup and derefs the peer
>> >> > socket. The peer socket was free'd earlier so use after free. 
>> >> >
>> >> > The question I had originally was this is odd, we are allowing
>> >> > a sendmsg(s1) over a socket while its in unix_release(). We
>> >> > used to take the sock lock from the backlog that was dropped
>> >> > in the name of performance, but it creates these races.
>> >> >
>> >> > Other fixes I considered. First adding sock lock back to
>> >> > backlog. But that punishes the UDP and TCP cases that don't
>> >> > have this problem. Set the SOCK_DEAD flag earlier or check
>> >> > later but this just makes the race smaller doesn't really
>> >> > eliminate it.
>> >> >
>> >> > So this patch is what I came up with. 
>> >> 
>> >> What I was getting at is that we could make it safe to call sendmsg on a
>> >> unix stream sock while its peer is being release. And not just for
>> >> sockmap. I expect io_uring might have the same problem. But I didn't
>> >> actually check yet.
>> >> 
>> >> For that we could keep a ref to peer for the duration of sendmsg call,
>> >> like unix dgram does. Then 'other' doesn't become a stale pointer before
>> >> we're done with it.
>> >> 
>> >> Bumping ref count on each sendmsg is not free, but maybe its
>> >> acceptable. Unix dgram sockets live with it.
>> >
>> > The reason why only dgram sk needs sock_hold() for each sendmsg() is
>> > that dgram sk can send data without connect().  unix_peer_get() in
>> > unix_dgram_sendmsg() is to reuse the same code when peer is not set.
>> >
>> > unix_stream_sendmsg() already holds a necessary refcnt and need not
>> > use sock_hold() there.
>> >
>> > The user who touches a peer without lookup must hold refcnt beforehand.
>
> Hi, we probably do need to get a fix for this. syzkaller hit it again
> and anyways it likely will crash some real systems if folks try to use
> it with enough systems.
>
>> 
>> Right. And this ownership scheme works well for unix stream because, as
>> John nicely explained, we serialize close() and sendmsg() ops with sock
>> lock.
>> 
>> Here, however, we have a case of deferred work which holds a ref to sock
>> but does not grab the sock lock. While it is doing its thing, the sk
>> gets closed/released and we drop the skpair ref. And bam, UAF.
>> 
>> If it wasn't for the reference cycle between sk and skpair, we could
>> defer the skpari ref drop until sk_destruct callback. But we can't.
>> 
>> If grabbing a ref on skpair on each sendmsg turns out to be not a viable
>> option, I didn't run any benchmarks so can't say what's the penatly
>> like, the next best thing is RCU.
>
> I think it really would be best to stay out of the presumably hotpath
> here if we can.
>
> I had considered marking the socket SOCK_RCU_FREE which should then
> wait an rcu grace period. But then it wasn't clear to me that would
> completely solve the race. The psock still needs to do the 
> cancel_delayed_work_sync() and this is also done from the rcu call
> back on the psock. Withtout the extra reference iirc the concern
> was we would basically have two rcu callbacks running that have
> an ordering requirement which I don't think is ensured.
>

I've checked io_uring and it keeps a ref to the related file for the
lifetime the I/O request. So I think we are good there and it's only
sockmap that is "bleeding".

I agree it would be best not to hamper unix_stream sendmsg performance.

At the same time, I think we can easily make the extra ref grab in
unix_stream_sendmsg optional, keeping the fast path fast (modulo 2x a
branch op).

I realize RCU would be a bigger, riskier overhaul. Perhaps not worth it,
if there are no benefits for "the regular" unix_stream users.

If we can avoid managing more state in psock, that would be my
preference. But I don't want to block any fixes that users are waiting
for.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-10-16 19:08 ` [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock John Fastabend
  2023-10-18 10:40   ` Jakub Sitnicki
@ 2023-11-06 12:35   ` Jakub Sitnicki
  2023-11-20 21:13     ` Martin KaFai Lau
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Sitnicki @ 2023-11-06 12:35 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> AF_UNIX sockets are a paired socket. So sending on one of the pairs
> will lookup the paired socket as part of the send operation. It is
> possible however to put just one of the pairs in a BPF map. This
> currently increments the refcnt on the sock in the sockmap to
> ensure it is not free'd by the stack before sockmap cleans up its
> state and stops any skbs being sent/recv'd to that socket.
>
> But we missed a case. If the peer socket is closed it will be
> free'd by the stack. However, the paired socket can still be
> referenced from BPF sockmap side because we hold a reference
> there. Then if we are sending traffic through BPF sockmap to
> that socket it will try to dereference the free'd pair in its
> send logic creating a use after free.  And following splat,
>
>    [59.900375] BUG: KASAN: slab-use-after-free in sk_wake_async+0x31/0x1b0
>    [59.901211] Read of size 8 at addr ffff88811acbf060 by task kworker/1:2/954
>    [...]
>    [59.905468] Call Trace:
>    [59.905787]  <TASK>
>    [59.906066]  dump_stack_lvl+0x130/0x1d0
>    [59.908877]  print_report+0x16f/0x740
>    [59.910629]  kasan_report+0x118/0x160
>    [59.912576]  sk_wake_async+0x31/0x1b0
>    [59.913554]  sock_def_readable+0x156/0x2a0
>    [59.914060]  unix_stream_sendmsg+0x3f9/0x12a0
>    [59.916398]  sock_sendmsg+0x20e/0x250
>    [59.916854]  skb_send_sock+0x236/0xac0
>    [59.920527]  sk_psock_backlog+0x287/0xaa0
>
> To fix let BPF sockmap hold a refcnt on both the socket in the
> sockmap and its paired socket.  It wasn't obvious how to contain
> the fix to bpf_unix logic. The primarily problem with keeping this
> logic in bpf_unix was: In the sock close() we could handle the
> deref by having a close handler. But, when we are destroying the
> psock through a map delete operation we wouldn't have gotten any
> signal thorugh the proto struct other than it being replaced.
> If we do the deref from the proto replace its too early because
> we need to deref the skpair after the backlog worker has been
> stopped.
>
> Given all this it seems best to just cache it at the end of the
> psock and eat 8B for the af_unix and vsock users.
>
> Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h |  1 +
>  include/net/af_unix.h |  1 +
>  net/core/skmsg.c      |  2 ++
>  net/unix/af_unix.c    |  2 --
>  net/unix/unix_bpf.c   | 10 ++++++++++
>  5 files changed, 14 insertions(+), 2 deletions(-)
>

[...]

> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 3e8a04a13668..87dd723aacf9 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -212,8 +212,6 @@ static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
>  }
>  #endif /* CONFIG_SECURITY_NETWORK */
>  
> -#define unix_peer(sk) (unix_sk(sk)->peer)
> -
>  static inline int unix_our_peer(struct sock *sk, struct sock *osk)
>  {
>  	return unix_peer(osk) == sk;
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index 2f9d8271c6ec..705eeed10be3 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -143,6 +143,8 @@ static void unix_stream_bpf_check_needs_rebuild(struct proto *ops)
>  
>  int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
> +	struct sock *skpair;
> +
>  	if (sk->sk_type != SOCK_DGRAM)
>  		return -EOPNOTSUPP;
>  
> @@ -152,6 +154,9 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>  		return 0;
>  	}
>  
> +	skpair = unix_peer(sk);
> +	sock_hold(skpair);
> +	psock->skpair = skpair;
>  	unix_dgram_bpf_check_needs_rebuild(psock->sk_proto);
>  	sock_replace_proto(sk, &unix_dgram_bpf_prot);
>  	return 0;

unix_dgram should not need this, since it grabs a ref on each sendmsg.

I'm not able to reproduce this bug for unix_dgram.

Have you seen any KASAN reports for unix_dgram from syzcaller?

> @@ -159,12 +164,17 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>  
>  int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
> +	struct sock *skpair = unix_peer(sk);
> +
>  	if (restore) {
>  		sk->sk_write_space = psock->saved_write_space;
>  		sock_replace_proto(sk, psock->sk_proto);
>  		return 0;
>  	}
>  
> +	skpair = unix_peer(sk);
> +	sock_hold(skpair);
> +	psock->skpair = skpair;
>  	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
>  	sock_replace_proto(sk, &unix_stream_bpf_prot);
>  	return 0;


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map
  2023-10-16 19:08 ` [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map John Fastabend
@ 2023-11-06 12:44   ` Jakub Sitnicki
  2023-11-06 14:42     ` Jakub Sitnicki
  0 siblings, 1 reply; 16+ messages in thread
From: Jakub Sitnicki @ 2023-11-06 12:44 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

On Mon, Oct 16, 2023 at 12:08 PM -07, John Fastabend wrote:
> This adds a test where both pairs of a af_unix paired socket are put into
> a BPF map. This ensures that when we tear down the af_unix pair we don't
> have any issues on sockmap side with ordering and reference counting.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  .../selftests/bpf/prog_tests/sockmap_listen.c | 39 ++++++++++++++++---
>  .../selftests/bpf/progs/test_sockmap_listen.c |  7 ++++
>  2 files changed, 40 insertions(+), 6 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> index 8df8cbb447f1..90e97907c1c1 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_listen.c
> @@ -1824,8 +1824,10 @@ static void inet_unix_skb_redir_to_connected(struct test_sockmap_listen *skel,
>  	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
>  }
>  
> -static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
> -					int verd_mapfd, enum redir_mode mode)
> +static void unix_inet_redir_to_connected(int family, int type,
> +					int sock_mapfd, int nop_mapfd,
> +					int verd_mapfd,
> +					enum redir_mode mode)
>  {
>  	const char *log_prefix = redir_mode_str(mode);
>  	int c0, c1, p0, p1;
> @@ -1849,6 +1851,12 @@ static void unix_inet_redir_to_connected(int family, int type, int sock_mapfd,
>  	if (err)
>  		goto close;
>  
> +	if (nop_mapfd >= 0) {
> +		err = add_to_sockmap(nop_mapfd, c0, c1);
> +		if (err)
> +			goto close;
> +	}
> +
>  	n = write(c1, "a", 1);
>  	if (n < 0)
>  		FAIL_ERRNO("%s: write", log_prefix);
> @@ -1883,6 +1891,7 @@ static void unix_inet_skb_redir_to_connected(struct test_sockmap_listen *skel,
>  					    struct bpf_map *inner_map, int family)
>  {
>  	int verdict = bpf_program__fd(skel->progs.prog_skb_verdict);
> +	int nop_map = bpf_map__fd(skel->maps.nop_map);
>  	int verdict_map = bpf_map__fd(skel->maps.verdict_map);
>  	int sock_map = bpf_map__fd(inner_map);
>  	int err;
> @@ -1892,14 +1901,32 @@ static void unix_inet_skb_redir_to_connected(struct test_sockmap_listen *skel,
>  		return;
>  
>  	skel->bss->test_ingress = false;
> -	unix_inet_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
> +	unix_inet_redir_to_connected(family, SOCK_DGRAM,
> +				     sock_map, -1, verdict_map,
> +				     REDIR_EGRESS);
> +	unix_inet_redir_to_connected(family, SOCK_DGRAM,
> +				     sock_map, -1, verdict_map,
>  				     REDIR_EGRESS);
> -	unix_inet_redir_to_connected(family, SOCK_STREAM, sock_map, verdict_map,
> +
> +	unix_inet_redir_to_connected(family, SOCK_DGRAM,
> +				     sock_map, nop_map, verdict_map,
> +				     REDIR_EGRESS);
> +	unix_inet_redir_to_connected(family, SOCK_STREAM,
> +				     sock_map, nop_map, verdict_map,
>  				     REDIR_EGRESS);
>  	skel->bss->test_ingress = true;
> -	unix_inet_redir_to_connected(family, SOCK_DGRAM, sock_map, verdict_map,
> +	unix_inet_redir_to_connected(family, SOCK_DGRAM,
> +				     sock_map, -1, verdict_map,
> +				     REDIR_INGRESS);
> +	unix_inet_redir_to_connected(family, SOCK_STREAM,
> +				     sock_map, -1, verdict_map,
> +				     REDIR_INGRESS);
> +
> +	unix_inet_redir_to_connected(family, SOCK_DGRAM,
> +				     sock_map, nop_map, verdict_map,
>  				     REDIR_INGRESS);
> -	unix_inet_redir_to_connected(family, SOCK_STREAM, sock_map, verdict_map,
> +	unix_inet_redir_to_connected(family, SOCK_STREAM,
> +				     sock_map, nop_map, verdict_map,
>  				     REDIR_INGRESS);
>  
>  	xbpf_prog_detach2(verdict, sock_map, BPF_SK_SKB_VERDICT);
> diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
> index 464d35bd57c7..b7250eb9c30c 100644
> --- a/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
> +++ b/tools/testing/selftests/bpf/progs/test_sockmap_listen.c
> @@ -14,6 +14,13 @@ struct {
>  	__type(value, __u64);
>  } sock_map SEC(".maps");
>  
> +struct {
> +	__uint(type, BPF_MAP_TYPE_SOCKMAP);
> +	__uint(max_entries, 2);
> +	__type(key, __u32);
> +	__type(value, __u64);
> +} nop_map SEC(".maps");
> +
>  struct {
>  	__uint(type, BPF_MAP_TYPE_SOCKHASH);
>  	__uint(max_entries, 2);

So... we have a bug in unix_inet_redir_to_connected() - it happily
ignores the passed socket type, which is currently hardcoded to
SOCK_DGRAM :-)

Which means these tests don't exercise unix_stream paths where the added
psock->skpair is actually needed.

But I'm able to reproduce the bug by running the VSOCK redir test:

bash-5.2# ./test_progs -n 212/79
[   23.232282] ==================================================================
[   23.232634] BUG: KASAN: slab-use-after-free in sock_def_readable+0xe3/0x400
[   23.232942] Read of size 8 at addr ffff8881013f9860 by task kworker/1:2/220
[   23.233253]
[   23.233326] CPU: 1 PID: 220 Comm: kworker/1:2 Tainted: G           OE      6.6.0 #30
[   23.233697] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   23.234074] Workqueue: events sk_psock_backlog
[   23.234271] Call Trace:
[   23.234381]  <TASK>
[   23.234477]  dump_stack_lvl+0x4a/0x90
[   23.234640]  print_address_description.constprop.0+0x33/0x400
[   23.234888]  ? preempt_count_sub+0x13/0xc0
[   23.235071]  print_report+0xb6/0x260
[   23.235228]  ? kasan_complete_mode_report_info+0x7c/0x1f0
[   23.235462]  kasan_report+0xd0/0x110
[   23.235619]  ? sock_def_readable+0xe3/0x400
[   23.235801]  ? sock_def_readable+0xe3/0x400
[   23.235989]  kasan_check_range+0xf7/0x1b0
[   23.236164]  __kasan_check_read+0x11/0x20
[   23.236340]  sock_def_readable+0xe3/0x400
[   23.236514]  unix_stream_sendmsg+0x3c5/0x7d0
[   23.236704]  ? queue_oob+0x300/0x300
[   23.236865]  sock_sendmsg+0x229/0x250
[   23.237030]  ? sock_write_iter+0x320/0x320
[   23.237211]  ? __this_cpu_preempt_check+0x13/0x20
[   23.237416]  ? lock_acquire+0x191/0x410
[   23.237607]  ? lock_sync+0x110/0x110
[   23.237766]  ? lock_is_held_type+0xd0/0x130
[   23.237948]  ? __asan_storeN+0x12/0x20
[   23.238115]  __skb_send_sock+0x4fa/0x670
[   23.238288]  ? preempt_count_sub+0x13/0xc0
[   23.238494]  ? sendmsg_locked+0x90/0x90
[   23.238721]  ? sendmsg_unlocked+0x40/0x40
[   23.238975]  ? __lock_acquire+0x765/0xf00
[   23.239252]  ? __this_cpu_preempt_check+0x13/0x20
[   23.239570]  ? lock_acquire+0x191/0x410
[   23.239831]  skb_send_sock+0x10/0x20
[   23.240079]  sk_psock_backlog+0x141/0x5e0
[   23.240340]  ? __this_cpu_preempt_check+0x13/0x20
[   23.240638]  process_one_work+0x49d/0x970
[   23.240900]  ? drain_workqueue+0x1c0/0x1c0
[   23.241173]  ? assign_work+0xe1/0x120
[   23.241404]  worker_thread+0x380/0x680
[   23.241660]  ? trace_hardirqs_on+0x22/0x100
[   23.241933]  ? preempt_count_sub+0x13/0xc0
[   23.242213]  ? process_one_work+0x970/0x970
[   23.242491]  kthread+0x1ba/0x200
[   23.242687]  ? kthread+0xfe/0x200
[   23.242890]  ? kthread_complete_and_exit+0x20/0x20
[   23.243193]  ret_from_fork+0x35/0x60
[   23.243418]  ? kthread_complete_and_exit+0x20/0x20
[   23.243718]  ret_from_fork_asm+0x11/0x20
[   23.243995]  </TASK>
[   23.244145]
[   23.244227] Allocated by task 227:
[   23.244446]  kasan_save_stack+0x26/0x50
[   23.244709]  kasan_set_track+0x25/0x40
[   23.244951]  kasan_save_alloc_info+0x1e/0x30
[   23.245220]  __kasan_slab_alloc+0x72/0x80
[   23.245491]  kmem_cache_alloc+0x182/0x360
[   23.245758]  sk_prot_alloc+0x43/0x160
[   23.246007]  sk_alloc+0x2c/0x3a0
[   23.246216]  unix_create1+0x86/0x440
[   23.246462]  unix_create+0x7d/0x100
[   23.246701]  __sock_create+0x1bc/0x420
[   23.246960]  __sys_socketpair+0x1ac/0x3a0
[   23.247237]  __x64_sys_socketpair+0x4f/0x60
[   23.247521]  do_syscall_64+0x38/0x90
[   23.247769]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   23.248113]
[   23.248225] Freed by task 227:
[   23.248444]  kasan_save_stack+0x26/0x50
[   23.248707]  kasan_set_track+0x25/0x40
[   23.248963]  kasan_save_free_info+0x2b/0x50
[   23.249249]  ____kasan_slab_free+0x154/0x1c0
[   23.249541]  __kasan_slab_free+0x12/0x20
[   23.249810]  kmem_cache_free+0x1e7/0x480
[   23.250084]  __sk_destruct+0x270/0x3f0
[   23.250342]  sk_destruct+0x78/0x90
[   23.250577]  __sk_free+0x51/0x160
[   23.250807]  sk_free+0x45/0x70
[   23.251025]  unix_release_sock+0x5cc/0x700
[   23.251301]  unix_release+0x50/0x70
[   23.251536]  __sock_release+0x5f/0x120
[   23.251754]  sock_close+0x13/0x20
[   23.252109]  __fput+0x1f3/0x470
[   23.252451]  __fput_sync+0x2f/0x40
[   23.252811]  __x64_sys_close+0x51/0x90
[   23.253169]  do_syscall_64+0x38/0x90
[   23.253480]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[   23.253940]
[   23.254097] The buggy address belongs to the object at ffff8881013f9800
[   23.254097]  which belongs to the cache UNIX-STREAM of size 1920
[   23.254936] The buggy address is located 96 bytes inside of
[   23.254936]  freed 1920-byte region [ffff8881013f9800, ffff8881013f9f80)
[   23.255731]
[   23.255844] The buggy address belongs to the physical page:
[   23.256225] page:00000000c005ecb3 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8881013f8000 pfn:0x1013f8
[   23.256905] head:00000000c005ecb3 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[   23.257382] flags: 0x2fffe000000840(slab|head|node=0|zone=2|lastcpupid=0x7fff)
[   23.257791] page_type: 0xffffffff()
[   23.257988] raw: 002fffe000000840 ffff888100961a40 dead000000000122 0000000000000000
[   23.258418] raw: ffff8881013f8000 000000008010000e 00000001ffffffff 0000000000000000
[   23.258817] page dumped because: kasan: bad access detected
[   23.259131]
[   23.259205] Memory state around the buggy address:
[   23.259469]  ffff8881013f9700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   23.259871]  ffff8881013f9780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   23.260290] >ffff8881013f9800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   23.260704]                                                        ^
[   23.261056]  ffff8881013f9880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   23.261469]  ffff8881013f9900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   23.261854] ==================================================================
[   23.262453] Disabling lock debugging due to kernel taint
#212/79  sockmap_listen/sockmap VSOCK test_vsock_redir:OK
#212     sockmap_listen:OK
Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
bash-5.2#

If I modify the test to use (AF_UNIX, SOCK_DGRAM) instead of
SOCK_STREAM, the bug no longer reproduces.

Which confirms my thinking that unix_dgram_sendmsg is safe to use from
sockmap because it grabs a ref to skpair.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map
  2023-11-06 12:44   ` Jakub Sitnicki
@ 2023-11-06 14:42     ` Jakub Sitnicki
  0 siblings, 0 replies; 16+ messages in thread
From: Jakub Sitnicki @ 2023-11-06 14:42 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau

On Mon, Nov 06, 2023 at 01:44 PM +01, Jakub Sitnicki wrote:
> But I'm able to reproduce the bug by running the VSOCK redir test:

I should clarify - the bug reproduces only without the patch #1 applied.

The fix is good. It just has some unneeded bits (unix_dgram), IMO.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-11-06 12:35   ` Jakub Sitnicki
@ 2023-11-20 21:13     ` Martin KaFai Lau
  2023-11-21 20:40       ` John Fastabend
  0 siblings, 1 reply; 16+ messages in thread
From: Martin KaFai Lau @ 2023-11-20 21:13 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, yangyingliang, martin.lau, Jakub Sitnicki

On 11/6/23 4:35 AM, Jakub Sitnicki wrote:
>> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
>> index 2f9d8271c6ec..705eeed10be3 100644
>> --- a/net/unix/unix_bpf.c
>> +++ b/net/unix/unix_bpf.c
>> @@ -143,6 +143,8 @@ static void unix_stream_bpf_check_needs_rebuild(struct proto *ops)
>>   
>>   int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>>   {
>> +	struct sock *skpair;
>> +
>>   	if (sk->sk_type != SOCK_DGRAM)
>>   		return -EOPNOTSUPP;
>>   
>> @@ -152,6 +154,9 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>>   		return 0;
>>   	}
>>   
>> +	skpair = unix_peer(sk);
>> +	sock_hold(skpair);
>> +	psock->skpair = skpair;
>>   	unix_dgram_bpf_check_needs_rebuild(psock->sk_proto);
>>   	sock_replace_proto(sk, &unix_dgram_bpf_prot);
>>   	return 0;
> unix_dgram should not need this, since it grabs a ref on each sendmsg.

John, could you address this comment and respin v2?

The unix_inet_redir_to_connected() seems needing a fix in patch 2 also as 
pointed out by JakubS.

Thanks.

> 
> I'm not able to reproduce this bug for unix_dgram.
> 
> Have you seen any KASAN reports for unix_dgram from syzcaller?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-11-20 21:13     ` Martin KaFai Lau
@ 2023-11-21 20:40       ` John Fastabend
  2023-11-22 19:26         ` John Fastabend
  0 siblings, 1 reply; 16+ messages in thread
From: John Fastabend @ 2023-11-21 20:40 UTC (permalink / raw)
  To: Martin KaFai Lau, John Fastabend
  Cc: bpf, netdev, yangyingliang, martin.lau, Jakub Sitnicki

Martin KaFai Lau wrote:
> On 11/6/23 4:35 AM, Jakub Sitnicki wrote:
> >> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> >> index 2f9d8271c6ec..705eeed10be3 100644
> >> --- a/net/unix/unix_bpf.c
> >> +++ b/net/unix/unix_bpf.c
> >> @@ -143,6 +143,8 @@ static void unix_stream_bpf_check_needs_rebuild(struct proto *ops)
> >>   
> >>   int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> >>   {
> >> +	struct sock *skpair;
> >> +
> >>   	if (sk->sk_type != SOCK_DGRAM)
> >>   		return -EOPNOTSUPP;
> >>   
> >> @@ -152,6 +154,9 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
> >>   		return 0;
> >>   	}
> >>   
> >> +	skpair = unix_peer(sk);
> >> +	sock_hold(skpair);
> >> +	psock->skpair = skpair;
> >>   	unix_dgram_bpf_check_needs_rebuild(psock->sk_proto);
> >>   	sock_replace_proto(sk, &unix_dgram_bpf_prot);
> >>   	return 0;
> > unix_dgram should not need this, since it grabs a ref on each sendmsg.
> 
> John, could you address this comment and respin v2?

Respinning now just letting some tests run for a bit and I'll kick it out.

Thanks.

> 
> The unix_inet_redir_to_connected() seems needing a fix in patch 2 also as 
> pointed out by JakubS.
> 
> Thanks.
> 
> > 
> > I'm not able to reproduce this bug for unix_dgram.
> > 
> > Have you seen any KASAN reports for unix_dgram from syzcaller?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock
  2023-11-21 20:40       ` John Fastabend
@ 2023-11-22 19:26         ` John Fastabend
  0 siblings, 0 replies; 16+ messages in thread
From: John Fastabend @ 2023-11-22 19:26 UTC (permalink / raw)
  To: John Fastabend, Martin KaFai Lau, John Fastabend
  Cc: bpf, netdev, yangyingliang, martin.lau, Jakub Sitnicki

John Fastabend wrote:
> Martin KaFai Lau wrote:
> > On 11/6/23 4:35 AM, Jakub Sitnicki wrote:
> > >> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> > >> index 2f9d8271c6ec..705eeed10be3 100644
> > >> --- a/net/unix/unix_bpf.c
> > >> +++ b/net/unix/unix_bpf.c
> > >> @@ -143,6 +143,8 @@ static void unix_stream_bpf_check_needs_rebuild(struct proto *ops)
> > >>   
> > >>   int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> > >>   {
> > >> +	struct sock *skpair;
> > >> +
> > >>   	if (sk->sk_type != SOCK_DGRAM)
> > >>   		return -EOPNOTSUPP;
> > >>   
> > >> @@ -152,6 +154,9 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
> > >>   		return 0;
> > >>   	}
> > >>   
> > >> +	skpair = unix_peer(sk);
> > >> +	sock_hold(skpair);
> > >> +	psock->skpair = skpair;
> > >>   	unix_dgram_bpf_check_needs_rebuild(psock->sk_proto);
> > >>   	sock_replace_proto(sk, &unix_dgram_bpf_prot);
> > >>   	return 0;
> > > unix_dgram should not need this, since it grabs a ref on each sendmsg.
> > 
> > John, could you address this comment and respin v2?
> 
> Respinning now just letting some tests run for a bit and I'll kick it out.


v2 on the list. Unfortunately the simple fix to the selftests to test STREAM
and DGRAM types caused a test failure. I look at it Monday unless someone
beats me to it.

> 
> Thanks.
> 
> > 
> > The unix_inet_redir_to_connected() seems needing a fix in patch 2 also as 
> > pointed out by JakubS.
> > 
> > Thanks.
> > 
> > > 
> > > I'm not able to reproduce this bug for unix_dgram.
> > > 
> > > Have you seen any KASAN reports for unix_dgram from syzcaller?



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-11-22 19:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-16 19:08 [PATCH bpf 0/2] sockmap fix for KASAN_VMALLOC and af_unix John Fastabend
2023-10-16 19:08 ` [PATCH bpf 1/2] bpf: sockmap, af_unix sockets need to hold ref for pair sock John Fastabend
2023-10-18 10:40   ` Jakub Sitnicki
2023-10-24 21:39     ` John Fastabend
2023-10-27 13:32       ` Jakub Sitnicki
2023-10-27 17:38         ` Kuniyuki Iwashima
2023-10-28  7:33           ` Jakub Sitnicki
2023-11-04  3:38             ` John Fastabend
2023-11-06 10:15               ` Jakub Sitnicki
2023-11-06 12:35   ` Jakub Sitnicki
2023-11-20 21:13     ` Martin KaFai Lau
2023-11-21 20:40       ` John Fastabend
2023-11-22 19:26         ` John Fastabend
2023-10-16 19:08 ` [PATCH bpf 2/2] bpf: sockmap, add af_unix test with both sockets in map John Fastabend
2023-11-06 12:44   ` Jakub Sitnicki
2023-11-06 14:42     ` Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).