bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/2] >   Keep reference on socket file while wait send memory
@ 2022-08-15  2:33 Liu Jian
  2022-08-15  2:33 ` [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory Liu Jian
  2022-08-15  2:33 ` [PATCH bpf-next 2/2] selftests/bpf: Add wait send memory test for sockmap redirect Liu Jian
  0 siblings, 2 replies; 13+ messages in thread
From: Liu Jian @ 2022-08-15  2:33 UTC (permalink / raw)
  To: john.fastabend, jakub, edumazet, davem, yoshfuji, dsahern, kuba,
	pabeni, andrii, mykolal, ast, daniel, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, shuah, bpf
  Cc: liujian56

Keep reference on socket file while wait send memory

Liu Jian (2):
  sk_msg: Keep reference on socket file while wait_memory
  selftests/bpf: Add wait send memory test for sockmap redirect

 net/ipv4/tcp_bpf.c                         |  8 +++++
 tools/testing/selftests/bpf/test_sockmap.c | 42 ++++++++++++++++++++++
 2 files changed, 50 insertions(+)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-15  2:33 [PATCH bpf-next 0/2] > Keep reference on socket file while wait send memory Liu Jian
@ 2022-08-15  2:33 ` Liu Jian
  2022-08-17  0:54   ` John Fastabend
  2022-08-19  8:39   ` Jakub Sitnicki
  2022-08-15  2:33 ` [PATCH bpf-next 2/2] selftests/bpf: Add wait send memory test for sockmap redirect Liu Jian
  1 sibling, 2 replies; 13+ messages in thread
From: Liu Jian @ 2022-08-15  2:33 UTC (permalink / raw)
  To: john.fastabend, jakub, edumazet, davem, yoshfuji, dsahern, kuba,
	pabeni, andrii, mykolal, ast, daniel, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, shuah, bpf
  Cc: liujian56

Fix the below NULL pointer dereference:

[   14.471200] Call Trace:
[   14.471562]  <TASK>
[   14.471882]  lock_acquire+0x245/0x2e0
[   14.472416]  ? remove_wait_queue+0x12/0x50
[   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
[   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
[   14.474318]  ? remove_wait_queue+0x12/0x50
[   14.474907]  remove_wait_queue+0x12/0x50
[   14.475480]  sk_stream_wait_memory+0x20d/0x340
[   14.476127]  ? do_wait_intr_irq+0x80/0x80
[   14.476704]  do_tcp_sendpages+0x287/0x600
[   14.477283]  tcp_bpf_push+0xab/0x260
[   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
[   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
[   14.479096]  tcp_bpf_send_verdict+0x105/0x470
[   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
[   14.480311]  sock_sendmsg+0x2d/0x40
[   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
[   14.481390]  ? copy_msghdr_from_user+0x62/0x80
[   14.482048]  ___sys_sendmsg+0x78/0xb0
[   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
[   14.483215]  ? __do_fault+0x2a/0x1a0
[   14.483738]  ? do_fault+0x15e/0x5d0
[   14.484246]  ? __handle_mm_fault+0x56b/0x1040
[   14.484874]  ? lock_is_held_type+0xdf/0x130
[   14.485474]  ? find_held_lock+0x2d/0x90
[   14.486046]  ? __sys_sendmsg+0x41/0x70
[   14.486587]  __sys_sendmsg+0x41/0x70
[   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
[   14.487822]  do_syscall_64+0x34/0x80
[   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

The test scene as following flow:
thread1                               thread2
-----------                           ---------------
 tcp_bpf_sendmsg
  tcp_bpf_send_verdict
   tcp_bpf_sendmsg_redir              sock_close
    tcp_bpf_push_locked                 __sock_release
     tcp_bpf_push                         //inet_release
      do_tcp_sendpages                    sock->ops->release
       sk_stream_wait_memory          	   // tcp_close
          sk_wait_event                      sk->sk_prot->close
           release_sock(__sk);
            ***

                                                lock_sock(sk);
                                                  __tcp_close
                                                    sock_orphan(sk)
                                                      sk->sk_wq  = NULL
                                                release_sock
            ****
           lock_sock(__sk);
          remove_wait_queue(sk_sleep(sk), &wait);
             sk_sleep(sk)
             //NULL pointer dereference
             &rcu_dereference_raw(sk->sk_wq)->wait

While waiting for memory in thread1, the socket is released with its wait
queue because thread2 has closed it. This caused by tcp_bpf_send_verdict
didn't increase the f_count of psock->sk_redir->sk_socket->file in thread1.

Avoid it by keeping a reference to the socket file while redirect sock wait
send memory. Refer to [1].

[1] https://lore.kernel.org/netdev/20190211090949.18560-1-jakub@cloudflare.com/

Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 net/ipv4/tcp_bpf.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index a1626afe87a1..201375829367 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -125,9 +125,17 @@ static int tcp_bpf_push_locked(struct sock *sk, struct sk_msg *msg,
 {
 	int ret;
 
+	/* Hold on to socket wait queue. */
+	if (sk->sk_socket && sk->sk_socket->file)
+		get_file(sk->sk_socket->file);
+
 	lock_sock(sk);
 	ret = tcp_bpf_push(sk, msg, apply_bytes, flags, uncharge);
 	release_sock(sk);
+
+	if (sk->sk_socket && sk->sk_socket->file)
+		fput(sk->sk_socket->file);
+
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 2/2] selftests/bpf: Add wait send memory test for sockmap redirect
  2022-08-15  2:33 [PATCH bpf-next 0/2] > Keep reference on socket file while wait send memory Liu Jian
  2022-08-15  2:33 ` [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory Liu Jian
@ 2022-08-15  2:33 ` Liu Jian
  1 sibling, 0 replies; 13+ messages in thread
From: Liu Jian @ 2022-08-15  2:33 UTC (permalink / raw)
  To: john.fastabend, jakub, edumazet, davem, yoshfuji, dsahern, kuba,
	pabeni, andrii, mykolal, ast, daniel, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, shuah, bpf
  Cc: liujian56

Add one test for wait redirect sock's send memory test for sockmap.

Signed-off-by: Liu Jian <liujian56@huawei.com>
---
 tools/testing/selftests/bpf/test_sockmap.c | 42 ++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 0fbaccdc8861..95b9b45ad028 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -138,6 +138,7 @@ struct sockmap_options {
 	bool data_test;
 	bool drop_expected;
 	bool check_recved_len;
+	bool tx_wait_mem;
 	int iov_count;
 	int iov_length;
 	int rate;
@@ -578,6 +579,10 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
 			sent = sendmsg(fd, &msg, flags);
 
 			if (!drop && sent < 0) {
+				if (opt->tx_wait_mem && errno == EACCES) {
+					errno = 0;
+					goto out_errno;
+				}
 				perror("sendmsg loop error");
 				goto out_errno;
 			} else if (drop && sent >= 0) {
@@ -644,6 +649,15 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt,
 				goto out_errno;
 			}
 
+			if (opt->tx_wait_mem) {
+				FD_ZERO(&w);
+				FD_SET(fd, &w);
+				slct = select(max_fd + 1, NULL, NULL, &w, &timeout);
+				errno = 0;
+				close(fd);
+				goto out_errno;
+			}
+
 			errno = 0;
 			if (peek_flag) {
 				flags |= MSG_PEEK;
@@ -752,6 +766,22 @@ static int sendmsg_test(struct sockmap_options *opt)
 			return err;
 	}
 
+	if (opt->tx_wait_mem) {
+		struct timeval timeout;
+		int rxtx_buf_len = 1024;
+
+		timeout.tv_sec = 3;
+		timeout.tv_usec = 0;
+
+		err = setsockopt(c2, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(struct timeval));
+		err |= setsockopt(c2, SOL_SOCKET, SO_SNDBUFFORCE, &rxtx_buf_len, sizeof(int));
+		err |= setsockopt(p2, SOL_SOCKET, SO_RCVBUFFORCE, &rxtx_buf_len, sizeof(int));
+		if (err) {
+			perror("setsockopt failed()");
+			return errno;
+		}
+	}
+
 	rxpid = fork();
 	if (rxpid == 0) {
 		if (txmsg_pop || txmsg_start_pop)
@@ -788,6 +818,9 @@ static int sendmsg_test(struct sockmap_options *opt)
 		return errno;
 	}
 
+	if (opt->tx_wait_mem)
+		close(c2);
+
 	txpid = fork();
 	if (txpid == 0) {
 		if (opt->sendpage)
@@ -1452,6 +1485,14 @@ static void test_txmsg_redir(int cgrp, struct sockmap_options *opt)
 	test_send(opt, cgrp);
 }
 
+static void test_txmsg_redir_wait_sndmem(int cgrp, struct sockmap_options *opt)
+{
+	txmsg_redir = 1;
+	opt->tx_wait_mem = true;
+	test_send_large(opt, cgrp);
+	opt->tx_wait_mem = false;
+}
+
 static void test_txmsg_drop(int cgrp, struct sockmap_options *opt)
 {
 	txmsg_drop = 1;
@@ -1800,6 +1841,7 @@ static int populate_progs(char *bpf_file)
 struct _test test[] = {
 	{"txmsg test passthrough", test_txmsg_pass},
 	{"txmsg test redirect", test_txmsg_redir},
+	{"txmsg test redirect wait send mem", test_txmsg_redir_wait_sndmem},
 	{"txmsg test drop", test_txmsg_drop},
 	{"txmsg test ingress redirect", test_txmsg_ingress_redir},
 	{"txmsg test skb", test_txmsg_skb},
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-15  2:33 ` [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory Liu Jian
@ 2022-08-17  0:54   ` John Fastabend
  2022-08-17  2:11     ` liujian (CE)
  2022-08-19  8:39   ` Jakub Sitnicki
  1 sibling, 1 reply; 13+ messages in thread
From: John Fastabend @ 2022-08-17  0:54 UTC (permalink / raw)
  To: Liu Jian, john.fastabend, jakub, edumazet, davem, yoshfuji,
	dsahern, kuba, pabeni, andrii, mykolal, ast, daniel, martin.lau,
	song, yhs, kpsingh, sdf, haoluo, jolsa, shuah, bpf
  Cc: liujian56

Liu Jian wrote:
> Fix the below NULL pointer dereference:
> 
> [   14.471200] Call Trace:
> [   14.471562]  <TASK>
> [   14.471882]  lock_acquire+0x245/0x2e0
> [   14.472416]  ? remove_wait_queue+0x12/0x50
> [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> [   14.474318]  ? remove_wait_queue+0x12/0x50
> [   14.474907]  remove_wait_queue+0x12/0x50
> [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> [   14.476704]  do_tcp_sendpages+0x287/0x600
> [   14.477283]  tcp_bpf_push+0xab/0x260
> [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> [   14.480311]  sock_sendmsg+0x2d/0x40
> [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> [   14.482048]  ___sys_sendmsg+0x78/0xb0
> [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> [   14.483215]  ? __do_fault+0x2a/0x1a0
> [   14.483738]  ? do_fault+0x15e/0x5d0
> [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> [   14.484874]  ? lock_is_held_type+0xdf/0x130
> [   14.485474]  ? find_held_lock+0x2d/0x90
> [   14.486046]  ? __sys_sendmsg+0x41/0x70
> [   14.486587]  __sys_sendmsg+0x41/0x70
> [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> [   14.487822]  do_syscall_64+0x34/0x80
> [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> The test scene as following flow:
> thread1                               thread2
> -----------                           ---------------
>  tcp_bpf_sendmsg
>   tcp_bpf_send_verdict
>    tcp_bpf_sendmsg_redir              sock_close
>     tcp_bpf_push_locked                 __sock_release
>      tcp_bpf_push                         //inet_release
>       do_tcp_sendpages                    sock->ops->release
>        sk_stream_wait_memory          	   // tcp_close
>           sk_wait_event                      sk->sk_prot->close
>            release_sock(__sk);
>             ***
> 
>                                                 lock_sock(sk);
>                                                   __tcp_close
>                                                     sock_orphan(sk)
>                                                       sk->sk_wq  = NULL
>                                                 release_sock
>             ****
>            lock_sock(__sk);
>           remove_wait_queue(sk_sleep(sk), &wait);
>              sk_sleep(sk)
>              //NULL pointer dereference
>              &rcu_dereference_raw(sk->sk_wq)->wait
> 
> While waiting for memory in thread1, the socket is released with its wait
> queue because thread2 has closed it. This caused by tcp_bpf_send_verdict
> didn't increase the f_count of psock->sk_redir->sk_socket->file in thread1.
> 
> Avoid it by keeping a reference to the socket file while redirect sock wait
> send memory. Refer to [1].
> 
> [1] https://lore.kernel.org/netdev/20190211090949.18560-1-jakub@cloudflare.com/
> 
> Signed-off-by: Liu Jian <liujian56@huawei.com>

Thanks for the detailed commit message its necessary to understand
the problem without spending hours deciphering it myself.

When I looked at [1] we solved a simliar problem by using
the MSG_DONTWAIT flag so that the error was pushed back to
the sending.

Can we do the same thing here? The nice bit here is the error
would get all the way back to the sending socket so userspace
could decide how to handle it? Did I miss something?

> ---
>  net/ipv4/tcp_bpf.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
> index a1626afe87a1..201375829367 100644
> --- a/net/ipv4/tcp_bpf.c
> +++ b/net/ipv4/tcp_bpf.c
> @@ -125,9 +125,17 @@ static int tcp_bpf_push_locked(struct sock *sk, struct sk_msg *msg,
>  {
>  	int ret;
>  
> +	/* Hold on to socket wait queue. */
> +	if (sk->sk_socket && sk->sk_socket->file)
> +		get_file(sk->sk_socket->file);
> +
>  	lock_sock(sk);
>  	ret = tcp_bpf_push(sk, msg, apply_bytes, flags, uncharge);
>  	release_sock(sk);
> +
> +	if (sk->sk_socket && sk->sk_socket->file)
> +		fput(sk->sk_socket->file);
> +
>  	return ret;
>  }
>  
> -- 
> 2.17.1
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-17  0:54   ` John Fastabend
@ 2022-08-17  2:11     ` liujian (CE)
  2022-08-17 18:33       ` John Fastabend
  0 siblings, 1 reply; 13+ messages in thread
From: liujian (CE) @ 2022-08-17  2:11 UTC (permalink / raw)
  To: John Fastabend, jakub, edumazet, davem, yoshfuji, dsahern, kuba,
	pabeni, andrii, mykolal, ast, daniel, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, shuah, bpf



> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@gmail.com]
> Sent: Wednesday, August 17, 2022 8:55 AM
> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
> jakub@cloudflare.com; edumazet@google.com; davem@davemloft.net;
> yoshfuji@linux-ipv6.org; dsahern@kernel.org; kuba@kernel.org;
> pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com; ast@kernel.org;
> daniel@iogearbox.net; martin.lau@linux.dev; song@kernel.org; yhs@fb.com;
> kpsingh@kernel.org; sdf@google.com; haoluo@google.com;
> jolsa@kernel.org; shuah@kernel.org; bpf@vger.kernel.org
> Cc: liujian (CE) <liujian56@huawei.com>
> Subject: RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
> while wait_memory
> 
> Liu Jian wrote:
> > Fix the below NULL pointer dereference:
> >
> > [   14.471200] Call Trace:
> > [   14.471562]  <TASK>
> > [   14.471882]  lock_acquire+0x245/0x2e0
> > [   14.472416]  ? remove_wait_queue+0x12/0x50
> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> > [   14.474318]  ? remove_wait_queue+0x12/0x50
> > [   14.474907]  remove_wait_queue+0x12/0x50
> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> > [   14.476704]  do_tcp_sendpages+0x287/0x600
> > [   14.477283]  tcp_bpf_push+0xab/0x260
> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> > [   14.480311]  sock_sendmsg+0x2d/0x40
> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> > [   14.483215]  ? __do_fault+0x2a/0x1a0
> > [   14.483738]  ? do_fault+0x15e/0x5d0
> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
> > [   14.485474]  ? find_held_lock+0x2d/0x90
> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
> > [   14.486587]  __sys_sendmsg+0x41/0x70
> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> > [   14.487822]  do_syscall_64+0x34/0x80
> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > The test scene as following flow:
> > thread1                               thread2
> > -----------                           ---------------
> >  tcp_bpf_sendmsg
> >   tcp_bpf_send_verdict
> >    tcp_bpf_sendmsg_redir              sock_close
> >     tcp_bpf_push_locked                 __sock_release
> >      tcp_bpf_push                         //inet_release
> >       do_tcp_sendpages                    sock->ops->release
> >        sk_stream_wait_memory          	   // tcp_close
> >           sk_wait_event                      sk->sk_prot->close
> >            release_sock(__sk);
> >             ***
> >
> >                                                 lock_sock(sk);
> >                                                   __tcp_close
> >                                                     sock_orphan(sk)
> >                                                       sk->sk_wq  = NULL
> >                                                 release_sock
> >             ****
> >            lock_sock(__sk);
> >           remove_wait_queue(sk_sleep(sk), &wait);
> >              sk_sleep(sk)
> >              //NULL pointer dereference
> >              &rcu_dereference_raw(sk->sk_wq)->wait
> >
> > While waiting for memory in thread1, the socket is released with its
> > wait queue because thread2 has closed it. This caused by
> > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
> >sk_socket->file in thread1.
> >
> > Avoid it by keeping a reference to the socket file while redirect sock
> > wait send memory. Refer to [1].
> >
> > [1]
> > https://lore.kernel.org/netdev/20190211090949.18560-1-jakub@cloudflare
> > .com/
> >
> > Signed-off-by: Liu Jian <liujian56@huawei.com>
> 
> Thanks for the detailed commit message its necessary to understand the
> problem without spending hours deciphering it myself.
> 
> When I looked at [1] we solved a simliar problem by using the
> MSG_DONTWAIT flag so that the error was pushed back to the sending.
> 
> Can we do the same thing here? The nice bit here is the error would get all
> the way back to the sending socket so userspace could decide how to handle
> it? Did I miss something?
> 
[1] works in sk_psock_backlog function, it can not be detected by the userspace app.
But here, the problem is that app wants this to be a blocked system call.
If the MSG_DONTWAIT flag is forcibly added, this changes the function behavior.

> > ---
> >  net/ipv4/tcp_bpf.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index
> > a1626afe87a1..201375829367 100644
> > --- a/net/ipv4/tcp_bpf.c
> > +++ b/net/ipv4/tcp_bpf.c
> > @@ -125,9 +125,17 @@ static int tcp_bpf_push_locked(struct sock *sk,
> > struct sk_msg *msg,  {
> >  	int ret;
> >
> > +	/* Hold on to socket wait queue. */
> > +	if (sk->sk_socket && sk->sk_socket->file)
> > +		get_file(sk->sk_socket->file);
> > +
> >  	lock_sock(sk);
> >  	ret = tcp_bpf_push(sk, msg, apply_bytes, flags, uncharge);
> >  	release_sock(sk);
> > +
> > +	if (sk->sk_socket && sk->sk_socket->file)
> > +		fput(sk->sk_socket->file);
> > +
> >  	return ret;
> >  }
> >
> > --
> > 2.17.1
> >
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-17  2:11     ` liujian (CE)
@ 2022-08-17 18:33       ` John Fastabend
  0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2022-08-17 18:33 UTC (permalink / raw)
  To: liujian (CE),
	John Fastabend, jakub, edumazet, davem, yoshfuji, dsahern, kuba,
	pabeni, andrii, mykolal, ast, daniel, martin.lau, song, yhs,
	kpsingh, sdf, haoluo, jolsa, shuah, bpf

liujian (CE) wrote:
> 
> 
> > -----Original Message-----
> > From: John Fastabend [mailto:john.fastabend@gmail.com]
> > Sent: Wednesday, August 17, 2022 8:55 AM
> > To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
> > jakub@cloudflare.com; edumazet@google.com; davem@davemloft.net;
> > yoshfuji@linux-ipv6.org; dsahern@kernel.org; kuba@kernel.org;
> > pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com; ast@kernel.org;
> > daniel@iogearbox.net; martin.lau@linux.dev; song@kernel.org; yhs@fb.com;
> > kpsingh@kernel.org; sdf@google.com; haoluo@google.com;
> > jolsa@kernel.org; shuah@kernel.org; bpf@vger.kernel.org
> > Cc: liujian (CE) <liujian56@huawei.com>
> > Subject: RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
> > while wait_memory
> > 
> > Liu Jian wrote:
> > > Fix the below NULL pointer dereference:
> > >
> > > [   14.471200] Call Trace:
> > > [   14.471562]  <TASK>
> > > [   14.471882]  lock_acquire+0x245/0x2e0
> > > [   14.472416]  ? remove_wait_queue+0x12/0x50
> > > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> > > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> > > [   14.474318]  ? remove_wait_queue+0x12/0x50
> > > [   14.474907]  remove_wait_queue+0x12/0x50
> > > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> > > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> > > [   14.476704]  do_tcp_sendpages+0x287/0x600
> > > [   14.477283]  tcp_bpf_push+0xab/0x260
> > > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> > > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> > > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> > > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> > > [   14.480311]  sock_sendmsg+0x2d/0x40
> > > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> > > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> > > [   14.482048]  ___sys_sendmsg+0x78/0xb0
> > > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> > > [   14.483215]  ? __do_fault+0x2a/0x1a0
> > > [   14.483738]  ? do_fault+0x15e/0x5d0
> > > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> > > [   14.484874]  ? lock_is_held_type+0xdf/0x130
> > > [   14.485474]  ? find_held_lock+0x2d/0x90
> > > [   14.486046]  ? __sys_sendmsg+0x41/0x70
> > > [   14.486587]  __sys_sendmsg+0x41/0x70
> > > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> > > [   14.487822]  do_syscall_64+0x34/0x80
> > > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > >
> > > The test scene as following flow:
> > > thread1                               thread2
> > > -----------                           ---------------
> > >  tcp_bpf_sendmsg
> > >   tcp_bpf_send_verdict
> > >    tcp_bpf_sendmsg_redir              sock_close
> > >     tcp_bpf_push_locked                 __sock_release
> > >      tcp_bpf_push                         //inet_release
> > >       do_tcp_sendpages                    sock->ops->release
> > >        sk_stream_wait_memory          	   // tcp_close
> > >           sk_wait_event                      sk->sk_prot->close
> > >            release_sock(__sk);
> > >             ***
> > >
> > >                                                 lock_sock(sk);
> > >                                                   __tcp_close
> > >                                                     sock_orphan(sk)
> > >                                                       sk->sk_wq  = NULL
> > >                                                 release_sock
> > >             ****
> > >            lock_sock(__sk);
> > >           remove_wait_queue(sk_sleep(sk), &wait);
> > >              sk_sleep(sk)
> > >              //NULL pointer dereference
> > >              &rcu_dereference_raw(sk->sk_wq)->wait
> > >
> > > While waiting for memory in thread1, the socket is released with its
> > > wait queue because thread2 has closed it. This caused by
> > > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
> > >sk_socket->file in thread1.
> > >
> > > Avoid it by keeping a reference to the socket file while redirect sock
> > > wait send memory. Refer to [1].
> > >
> > > [1]
> > > https://lore.kernel.org/netdev/20190211090949.18560-1-jakub@cloudflare
> > > .com/
> > >
> > > Signed-off-by: Liu Jian <liujian56@huawei.com>
> > 
> > Thanks for the detailed commit message its necessary to understand the
> > problem without spending hours deciphering it myself.
> > 
> > When I looked at [1] we solved a simliar problem by using the
> > MSG_DONTWAIT flag so that the error was pushed back to the sending.
> > 
> > Can we do the same thing here? The nice bit here is the error would get all
> > the way back to the sending socket so userspace could decide how to handle
> > it? Did I miss something?
> > 
> [1] works in sk_psock_backlog function, it can not be detected by the userspace app.
> But here, the problem is that app wants this to be a blocked system call.
> If the MSG_DONTWAIT flag is forcibly added, this changes the function behavior.
> 

Ah right. We could push it to the sk_psock_backlog as another option similar
to what sk_psock_verdict_apply() does with sk_psock_skb_redirect(). The
problem would be we don't have an skb here and then instead of the
stream wait logic we would be using the backlog logic which might create
some subtle change. Seems a bit intrusive to me.

I don't have any better ideas off-hand even though reaching into the file
like above in the patch is not ideal.

Maybe Jakub has some thoughts?

Thanks!
John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-15  2:33 ` [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory Liu Jian
  2022-08-17  0:54   ` John Fastabend
@ 2022-08-19  8:39   ` Jakub Sitnicki
  2022-08-19 10:01     ` liujian (CE)
  1 sibling, 1 reply; 13+ messages in thread
From: Jakub Sitnicki @ 2022-08-19  8:39 UTC (permalink / raw)
  To: Liu Jian, john.fastabend, edumazet
  Cc: davem, yoshfuji, dsahern, kuba, pabeni, andrii, mykolal, ast,
	daniel, martin.lau, song, yhs, kpsingh, sdf, haoluo, jolsa,
	shuah, bpf

On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
> Fix the below NULL pointer dereference:
>
> [   14.471200] Call Trace:
> [   14.471562]  <TASK>
> [   14.471882]  lock_acquire+0x245/0x2e0
> [   14.472416]  ? remove_wait_queue+0x12/0x50
> [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> [   14.474318]  ? remove_wait_queue+0x12/0x50
> [   14.474907]  remove_wait_queue+0x12/0x50
> [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> [   14.476704]  do_tcp_sendpages+0x287/0x600
> [   14.477283]  tcp_bpf_push+0xab/0x260
> [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> [   14.480311]  sock_sendmsg+0x2d/0x40
> [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> [   14.482048]  ___sys_sendmsg+0x78/0xb0
> [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> [   14.483215]  ? __do_fault+0x2a/0x1a0
> [   14.483738]  ? do_fault+0x15e/0x5d0
> [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> [   14.484874]  ? lock_is_held_type+0xdf/0x130
> [   14.485474]  ? find_held_lock+0x2d/0x90
> [   14.486046]  ? __sys_sendmsg+0x41/0x70
> [   14.486587]  __sys_sendmsg+0x41/0x70
> [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> [   14.487822]  do_syscall_64+0x34/0x80
> [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>
> The test scene as following flow:
> thread1                               thread2
> -----------                           ---------------
>  tcp_bpf_sendmsg
>   tcp_bpf_send_verdict
>    tcp_bpf_sendmsg_redir              sock_close
>     tcp_bpf_push_locked                 __sock_release
>      tcp_bpf_push                         //inet_release
>       do_tcp_sendpages                    sock->ops->release
>        sk_stream_wait_memory          	   // tcp_close
>           sk_wait_event                      sk->sk_prot->close
>            release_sock(__sk);
>             ***
>
>                                                 lock_sock(sk);
>                                                   __tcp_close
>                                                     sock_orphan(sk)
>                                                       sk->sk_wq  = NULL
>                                                 release_sock
>             ****
>            lock_sock(__sk);
>           remove_wait_queue(sk_sleep(sk), &wait);
>              sk_sleep(sk)
>              //NULL pointer dereference
>              &rcu_dereference_raw(sk->sk_wq)->wait
>
> While waiting for memory in thread1, the socket is released with its wait
> queue because thread2 has closed it. This caused by tcp_bpf_send_verdict
> didn't increase the f_count of psock->sk_redir->sk_socket->file in thread1.

I'm not sure about this approach. Keeping a closed sock file alive, just
so we can wakeup from sleep, seems like wasted effort.

__tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN | SEND_SHUTDOWN. So we
will return from sk_stream_wait_memory via the do_error path.

SEND_SHUTDOWN might be set because socket got closed and orphaned - dead
and detached from its file, like in this case.

So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the wait queue.

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-19  8:39   ` Jakub Sitnicki
@ 2022-08-19 10:01     ` liujian (CE)
  2022-08-19 10:34       ` Jakub Sitnicki
  0 siblings, 1 reply; 13+ messages in thread
From: liujian (CE) @ 2022-08-19 10:01 UTC (permalink / raw)
  To: Jakub Sitnicki, john.fastabend, edumazet
  Cc: davem, yoshfuji, dsahern, kuba, pabeni, andrii, mykolal, ast,
	daniel, martin.lau, song, yhs, kpsingh, sdf, haoluo, jolsa,
	shuah, bpf



> -----Original Message-----
> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
> Sent: Friday, August 19, 2022 4:39 PM
> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
> edumazet@google.com
> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
> bpf@vger.kernel.org
> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
> while wait_memory
> 
> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
> > Fix the below NULL pointer dereference:
> >
> > [   14.471200] Call Trace:
> > [   14.471562]  <TASK>
> > [   14.471882]  lock_acquire+0x245/0x2e0
> > [   14.472416]  ? remove_wait_queue+0x12/0x50
> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> > [   14.474318]  ? remove_wait_queue+0x12/0x50
> > [   14.474907]  remove_wait_queue+0x12/0x50
> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> > [   14.476704]  do_tcp_sendpages+0x287/0x600
> > [   14.477283]  tcp_bpf_push+0xab/0x260
> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> > [   14.480311]  sock_sendmsg+0x2d/0x40
> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> > [   14.483215]  ? __do_fault+0x2a/0x1a0
> > [   14.483738]  ? do_fault+0x15e/0x5d0
> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
> > [   14.485474]  ? find_held_lock+0x2d/0x90
> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
> > [   14.486587]  __sys_sendmsg+0x41/0x70
> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> > [   14.487822]  do_syscall_64+0x34/0x80
> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >
> > The test scene as following flow:
> > thread1                               thread2
> > -----------                           ---------------
> >  tcp_bpf_sendmsg
> >   tcp_bpf_send_verdict
> >    tcp_bpf_sendmsg_redir              sock_close
> >     tcp_bpf_push_locked                 __sock_release
> >      tcp_bpf_push                         //inet_release
> >       do_tcp_sendpages                    sock->ops->release
> >        sk_stream_wait_memory          	   // tcp_close
> >           sk_wait_event                      sk->sk_prot->close
> >            release_sock(__sk);
> >             ***
> >
> >                                                 lock_sock(sk);
> >                                                   __tcp_close
> >                                                     sock_orphan(sk)
> >                                                       sk->sk_wq  = NULL
> >                                                 release_sock
> >             ****
> >            lock_sock(__sk);
> >           remove_wait_queue(sk_sleep(sk), &wait);
> >              sk_sleep(sk)
> >              //NULL pointer dereference
> >              &rcu_dereference_raw(sk->sk_wq)->wait
> >
> > While waiting for memory in thread1, the socket is released with its
> > wait queue because thread2 has closed it. This caused by
> > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
> >sk_socket->file in thread1.
> 
> I'm not sure about this approach. Keeping a closed sock file alive, just so we
> can wakeup from sleep, seems like wasted effort.
> 
> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN | SEND_SHUTDOWN.
> So we will return from sk_stream_wait_memory via the do_error path.
> 
> SEND_SHUTDOWN might be set because socket got closed and orphaned -
> dead and detached from its file, like in this case.
> 
> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the wait
> queue.
> 
> [...]
As jakub's approach, this problem can be solved.

diff --git a/include/net/sock.h b/include/net/sock.h
index a7273b289188..a3dab7140f1e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 static inline wait_queue_head_t *sk_sleep(struct sock *sk)
 {
        BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
+       if (sock_flag(sk, SOCK_DEAD))
+               return NULL;
        return &rcu_dereference_raw(sk->sk_wq)->wait;
 }
 /* Detach socket from process context.
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 9860bb9a847c..da1be17d0b19 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry
 {
        unsigned long flags;
 
+       if (wq_head == NULL)
+               return;
        spin_lock_irqsave(&wq_head->lock, flags);
        __remove_wait_queue(wq_head, wq_entry);
        spin_unlock_irqrestore(&wq_head->lock, flags);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-19 10:01     ` liujian (CE)
@ 2022-08-19 10:34       ` Jakub Sitnicki
  2022-08-20  3:01         ` liujian (CE)
  2022-08-22 15:52         ` Eric Dumazet
  0 siblings, 2 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2022-08-19 10:34 UTC (permalink / raw)
  To: liujian (CE)
  Cc: john.fastabend, edumazet, davem, yoshfuji, dsahern, kuba, pabeni,
	andrii, mykolal, ast, daniel, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, shuah, bpf


On Fri, Aug 19, 2022 at 10:01 AM GMT, liujian (CE) wrote:
>> -----Original Message-----
>> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
>> Sent: Friday, August 19, 2022 4:39 PM
>> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
>> edumazet@google.com
>> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
>> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
>> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
>> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
>> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
>> bpf@vger.kernel.org
>> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
>> while wait_memory
>> 
>> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
>> > Fix the below NULL pointer dereference:
>> >
>> > [   14.471200] Call Trace:
>> > [   14.471562]  <TASK>
>> > [   14.471882]  lock_acquire+0x245/0x2e0
>> > [   14.472416]  ? remove_wait_queue+0x12/0x50
>> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
>> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
>> > [   14.474318]  ? remove_wait_queue+0x12/0x50
>> > [   14.474907]  remove_wait_queue+0x12/0x50
>> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
>> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
>> > [   14.476704]  do_tcp_sendpages+0x287/0x600
>> > [   14.477283]  tcp_bpf_push+0xab/0x260
>> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
>> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
>> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
>> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
>> > [   14.480311]  sock_sendmsg+0x2d/0x40
>> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
>> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
>> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
>> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
>> > [   14.483215]  ? __do_fault+0x2a/0x1a0
>> > [   14.483738]  ? do_fault+0x15e/0x5d0
>> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
>> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
>> > [   14.485474]  ? find_held_lock+0x2d/0x90
>> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
>> > [   14.486587]  __sys_sendmsg+0x41/0x70
>> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
>> > [   14.487822]  do_syscall_64+0x34/0x80
>> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> >
>> > The test scene as following flow:
>> > thread1                               thread2
>> > -----------                           ---------------
>> >  tcp_bpf_sendmsg
>> >   tcp_bpf_send_verdict
>> >    tcp_bpf_sendmsg_redir              sock_close
>> >     tcp_bpf_push_locked                 __sock_release
>> >      tcp_bpf_push                         //inet_release
>> >       do_tcp_sendpages                    sock->ops->release
>> >        sk_stream_wait_memory          	   // tcp_close
>> >           sk_wait_event                      sk->sk_prot->close
>> >            release_sock(__sk);
>> >             ***
>> >
>> >                                                 lock_sock(sk);
>> >                                                   __tcp_close
>> >                                                     sock_orphan(sk)
>> >                                                       sk->sk_wq  = NULL
>> >                                                 release_sock
>> >             ****
>> >            lock_sock(__sk);
>> >           remove_wait_queue(sk_sleep(sk), &wait);
>> >              sk_sleep(sk)
>> >              //NULL pointer dereference
>> >              &rcu_dereference_raw(sk->sk_wq)->wait
>> >
>> > While waiting for memory in thread1, the socket is released with its
>> > wait queue because thread2 has closed it. This caused by
>> > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
>> >sk_socket->file in thread1.
>> 
>> I'm not sure about this approach. Keeping a closed sock file alive, just so we
>> can wakeup from sleep, seems like wasted effort.
>> 
>> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN | SEND_SHUTDOWN.
>> So we will return from sk_stream_wait_memory via the do_error path.
>> 
>> SEND_SHUTDOWN might be set because socket got closed and orphaned -
>> dead and detached from its file, like in this case.
>> 
>> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
>> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the wait
>> queue.
>> 
>> [...]
> As jakub's approach, this problem can be solved.
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index a7273b289188..a3dab7140f1e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>  static inline wait_queue_head_t *sk_sleep(struct sock *sk)
>  {
>         BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
> +       if (sock_flag(sk, SOCK_DEAD))
> +               return NULL;
>         return &rcu_dereference_raw(sk->sk_wq)->wait;
>  }
>  /* Detach socket from process context.
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 9860bb9a847c..da1be17d0b19 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry
>  {
>         unsigned long flags;
>  
> +       if (wq_head == NULL)
> +               return;
>         spin_lock_irqsave(&wq_head->lock, flags);
>         __remove_wait_queue(wq_head, wq_entry);
>         spin_unlock_irqrestore(&wq_head->lock, flags);

I don't know if we want to change the contract for sk_sleep()
remove_wait_queue() so that they accept dead sockets or nulls.

How about just:

diff --git a/net/core/stream.c b/net/core/stream.c
index ccc083cdef23..1105057ce00a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -159,7 +159,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
                *timeo_p = current_timeo;
        }
 out:
-       remove_wait_queue(sk_sleep(sk), &wait);
+       if (!sock_flag(sk, SOCK_DEAD))
+               remove_wait_queue(sk_sleep(sk), &wait);
        return err;

 do_error:





^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-19 10:34       ` Jakub Sitnicki
@ 2022-08-20  3:01         ` liujian (CE)
  2022-08-22 14:32           ` Jakub Sitnicki
  2022-08-22 15:52         ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: liujian (CE) @ 2022-08-20  3:01 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: john.fastabend, edumazet, davem, yoshfuji, dsahern, kuba, pabeni,
	andrii, mykolal, ast, daniel, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, shuah, bpf



> -----Original Message-----
> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
> Sent: Friday, August 19, 2022 6:35 PM
> To: liujian (CE) <liujian56@huawei.com>
> Cc: john.fastabend@gmail.com; edumazet@google.com;
> davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
> bpf@vger.kernel.org
> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
> while wait_memory
> 
> 
> On Fri, Aug 19, 2022 at 10:01 AM GMT, liujian (CE) wrote:
> >> -----Original Message-----
> >> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
> >> Sent: Friday, August 19, 2022 4:39 PM
> >> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
> >> edumazet@google.com
> >> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
> >> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org;
> >> mykolal@fb.com; ast@kernel.org; daniel@iogearbox.net;
> >> martin.lau@linux.dev; song@kernel.org; yhs@fb.com;
> >> kpsingh@kernel.org; sdf@google.com; haoluo@google.com;
> >> jolsa@kernel.org; shuah@kernel.org; bpf@vger.kernel.org
> >> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket
> >> file while wait_memory
> >>
> >> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
> >> > Fix the below NULL pointer dereference:
> >> >
> >> > [   14.471200] Call Trace:
> >> > [   14.471562]  <TASK>
> >> > [   14.471882]  lock_acquire+0x245/0x2e0
> >> > [   14.472416]  ? remove_wait_queue+0x12/0x50
> >> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> >> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> >> > [   14.474318]  ? remove_wait_queue+0x12/0x50
> >> > [   14.474907]  remove_wait_queue+0x12/0x50
> >> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> >> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> >> > [   14.476704]  do_tcp_sendpages+0x287/0x600
> >> > [   14.477283]  tcp_bpf_push+0xab/0x260
> >> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> >> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> >> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> >> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> >> > [   14.480311]  sock_sendmsg+0x2d/0x40
> >> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> >> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> >> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
> >> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> >> > [   14.483215]  ? __do_fault+0x2a/0x1a0
> >> > [   14.483738]  ? do_fault+0x15e/0x5d0
> >> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> >> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
> >> > [   14.485474]  ? find_held_lock+0x2d/0x90
> >> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
> >> > [   14.486587]  __sys_sendmsg+0x41/0x70
> >> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> >> > [   14.487822]  do_syscall_64+0x34/0x80
> >> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >> >
> >> > The test scene as following flow:
> >> > thread1                               thread2
> >> > -----------                           ---------------
> >> >  tcp_bpf_sendmsg
> >> >   tcp_bpf_send_verdict
> >> >    tcp_bpf_sendmsg_redir              sock_close
> >> >     tcp_bpf_push_locked                 __sock_release
> >> >      tcp_bpf_push                         //inet_release
> >> >       do_tcp_sendpages                    sock->ops->release
> >> >        sk_stream_wait_memory          	   // tcp_close
> >> >           sk_wait_event                      sk->sk_prot->close
> >> >            release_sock(__sk);
> >> >             ***
> >> >
> >> >                                                 lock_sock(sk);
> >> >                                                   __tcp_close
> >> >                                                     sock_orphan(sk)
> >> >                                                       sk->sk_wq  = NULL
> >> >                                                 release_sock
> >> >             ****
> >> >            lock_sock(__sk);
> >> >           remove_wait_queue(sk_sleep(sk), &wait);
> >> >              sk_sleep(sk)
> >> >              //NULL pointer dereference
> >> >              &rcu_dereference_raw(sk->sk_wq)->wait
> >> >
> >> > While waiting for memory in thread1, the socket is released with
> >> >its  wait queue because thread2 has closed it. This caused by
> >> >tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
> >> >sk_socket->file in thread1.
> >>
> >> I'm not sure about this approach. Keeping a closed sock file alive,
> >> just so we can wakeup from sleep, seems like wasted effort.
> >>
> >> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN |
> SEND_SHUTDOWN.
> >> So we will return from sk_stream_wait_memory via the do_error path.
> >>
> >> SEND_SHUTDOWN might be set because socket got closed and orphaned
> -
> >> dead and detached from its file, like in this case.
> >>
> >> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
> >> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the
> wait
> >> queue.
> >>
> >> [...]
> > As jakub's approach, this problem can be solved.
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h index
> > a7273b289188..a3dab7140f1e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock
> > *sk, struct socket *sock)  static inline wait_queue_head_t
> > *sk_sleep(struct sock *sk)  {
> >         BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
> > +       if (sock_flag(sk, SOCK_DEAD))
> > +               return NULL;
> >         return &rcu_dereference_raw(sk->sk_wq)->wait;
> >  }
> >  /* Detach socket from process context.
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index
> > 9860bb9a847c..da1be17d0b19 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head
> > *wq_head, struct wait_queue_entry  {
> >         unsigned long flags;
> >
> > +       if (wq_head == NULL)
> > +               return;
> >         spin_lock_irqsave(&wq_head->lock, flags);
> >         __remove_wait_queue(wq_head, wq_entry);
> >         spin_unlock_irqrestore(&wq_head->lock, flags);
> 
> I don't know if we want to change the contract for sk_sleep()
> remove_wait_queue() so that they accept dead sockets or nulls.
> 
> How about just:

It is all ok to me, thank you. Cloud you provide a format patch?

Tested-by: Liu Jian <liujian56@huawei.com>
> 
> diff --git a/net/core/stream.c b/net/core/stream.c index
> ccc083cdef23..1105057ce00a 100644
> --- a/net/core/stream.c
> +++ b/net/core/stream.c
> @@ -159,7 +159,8 @@ int sk_stream_wait_memory(struct sock *sk, long
> *timeo_p)
>                 *timeo_p = current_timeo;
>         }
>  out:
> -       remove_wait_queue(sk_sleep(sk), &wait);
> +       if (!sock_flag(sk, SOCK_DEAD))
> +               remove_wait_queue(sk_sleep(sk), &wait);
>         return err;
> 
>  do_error:
> 
> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-20  3:01         ` liujian (CE)
@ 2022-08-22 14:32           ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2022-08-22 14:32 UTC (permalink / raw)
  To: liujian (CE)
  Cc: john.fastabend, edumazet, davem, yoshfuji, dsahern, kuba, pabeni,
	andrii, mykolal, ast, daniel, martin.lau, song, yhs, kpsingh,
	sdf, haoluo, jolsa, shuah, bpf


On Sat, Aug 20, 2022 at 03:01 AM GMT, liujian (CE) wrote:
>> -----Original Message-----
>> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
>> Sent: Friday, August 19, 2022 6:35 PM
>> To: liujian (CE) <liujian56@huawei.com>
>> Cc: john.fastabend@gmail.com; edumazet@google.com;
>> davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
>> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
>> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
>> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
>> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
>> bpf@vger.kernel.org
>> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
>> while wait_memory
>> 
>> 
>> On Fri, Aug 19, 2022 at 10:01 AM GMT, liujian (CE) wrote:
>> >> -----Original Message-----
>> >> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
>> >> Sent: Friday, August 19, 2022 4:39 PM
>> >> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
>> >> edumazet@google.com
>> >> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
>> >> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org;
>> >> mykolal@fb.com; ast@kernel.org; daniel@iogearbox.net;
>> >> martin.lau@linux.dev; song@kernel.org; yhs@fb.com;
>> >> kpsingh@kernel.org; sdf@google.com; haoluo@google.com;
>> >> jolsa@kernel.org; shuah@kernel.org; bpf@vger.kernel.org
>> >> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket
>> >> file while wait_memory
>> >>
>> >> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
>> >> > Fix the below NULL pointer dereference:
>> >> >
>> >> > [   14.471200] Call Trace:
>> >> > [   14.471562]  <TASK>
>> >> > [   14.471882]  lock_acquire+0x245/0x2e0
>> >> > [   14.472416]  ? remove_wait_queue+0x12/0x50
>> >> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
>> >> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
>> >> > [   14.474318]  ? remove_wait_queue+0x12/0x50
>> >> > [   14.474907]  remove_wait_queue+0x12/0x50
>> >> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
>> >> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
>> >> > [   14.476704]  do_tcp_sendpages+0x287/0x600
>> >> > [   14.477283]  tcp_bpf_push+0xab/0x260
>> >> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
>> >> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
>> >> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
>> >> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
>> >> > [   14.480311]  sock_sendmsg+0x2d/0x40
>> >> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
>> >> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
>> >> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
>> >> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
>> >> > [   14.483215]  ? __do_fault+0x2a/0x1a0
>> >> > [   14.483738]  ? do_fault+0x15e/0x5d0
>> >> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
>> >> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
>> >> > [   14.485474]  ? find_held_lock+0x2d/0x90
>> >> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
>> >> > [   14.486587]  __sys_sendmsg+0x41/0x70
>> >> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
>> >> > [   14.487822]  do_syscall_64+0x34/0x80
>> >> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> >> >
>> >> > The test scene as following flow:
>> >> > thread1                               thread2
>> >> > -----------                           ---------------
>> >> >  tcp_bpf_sendmsg
>> >> >   tcp_bpf_send_verdict
>> >> >    tcp_bpf_sendmsg_redir              sock_close
>> >> >     tcp_bpf_push_locked                 __sock_release
>> >> >      tcp_bpf_push                         //inet_release
>> >> >       do_tcp_sendpages                    sock->ops->release
>> >> >        sk_stream_wait_memory          	   // tcp_close
>> >> >           sk_wait_event                      sk->sk_prot->close
>> >> >            release_sock(__sk);
>> >> >             ***
>> >> >
>> >> >                                                 lock_sock(sk);
>> >> >                                                   __tcp_close
>> >> >                                                     sock_orphan(sk)
>> >> >                                                       sk->sk_wq  = NULL
>> >> >                                                 release_sock
>> >> >             ****
>> >> >            lock_sock(__sk);
>> >> >           remove_wait_queue(sk_sleep(sk), &wait);
>> >> >              sk_sleep(sk)
>> >> >              //NULL pointer dereference
>> >> >              &rcu_dereference_raw(sk->sk_wq)->wait
>> >> >
>> >> > While waiting for memory in thread1, the socket is released with
>> >> >its  wait queue because thread2 has closed it. This caused by
>> >> >tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
>> >> >sk_socket->file in thread1.
>> >>
>> >> I'm not sure about this approach. Keeping a closed sock file alive,
>> >> just so we can wakeup from sleep, seems like wasted effort.
>> >>
>> >> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN |
>> SEND_SHUTDOWN.
>> >> So we will return from sk_stream_wait_memory via the do_error path.
>> >>
>> >> SEND_SHUTDOWN might be set because socket got closed and orphaned
>> -
>> >> dead and detached from its file, like in this case.
>> >>
>> >> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
>> >> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the
>> wait
>> >> queue.
>> >>
>> >> [...]
>> > As jakub's approach, this problem can be solved.
>> >
>> > diff --git a/include/net/sock.h b/include/net/sock.h index
>> > a7273b289188..a3dab7140f1e 100644
>> > --- a/include/net/sock.h
>> > +++ b/include/net/sock.h
>> > @@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock
>> > *sk, struct socket *sock)  static inline wait_queue_head_t
>> > *sk_sleep(struct sock *sk)  {
>> >         BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
>> > +       if (sock_flag(sk, SOCK_DEAD))
>> > +               return NULL;
>> >         return &rcu_dereference_raw(sk->sk_wq)->wait;
>> >  }
>> >  /* Detach socket from process context.
>> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index
>> > 9860bb9a847c..da1be17d0b19 100644
>> > --- a/kernel/sched/wait.c
>> > +++ b/kernel/sched/wait.c
>> > @@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head
>> > *wq_head, struct wait_queue_entry  {
>> >         unsigned long flags;
>> >
>> > +       if (wq_head == NULL)
>> > +               return;
>> >         spin_lock_irqsave(&wq_head->lock, flags);
>> >         __remove_wait_queue(wq_head, wq_entry);
>> >         spin_unlock_irqrestore(&wq_head->lock, flags);
>> 
>> I don't know if we want to change the contract for sk_sleep()
>> remove_wait_queue() so that they accept dead sockets or nulls.
>> 
>> How about just:
>
> It is all ok to me, thank you. Cloud you provide a format patch?
>
> Tested-by: Liu Jian <liujian56@huawei.com>

Feel free to pull it into your patch set. I'm a bit backlogged
ATM. Besides, we also want the selftest that you have added.

You can add Suggested-by if you want.

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-19 10:34       ` Jakub Sitnicki
  2022-08-20  3:01         ` liujian (CE)
@ 2022-08-22 15:52         ` Eric Dumazet
  2022-08-22 16:12           ` Jakub Sitnicki
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2022-08-22 15:52 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: liujian (CE),
	john.fastabend, davem, yoshfuji, dsahern, kuba, pabeni, andrii,
	mykolal, ast, daniel, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, shuah, bpf

On Fri, Aug 19, 2022 at 3:37 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
>
> On Fri, Aug 19, 2022 at 10:01 AM GMT, liujian (CE) wrote:
> >> -----Original Message-----
> >> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
> >> Sent: Friday, August 19, 2022 4:39 PM
> >> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
> >> edumazet@google.com
> >> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
> >> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
> >> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
> >> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
> >> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
> >> bpf@vger.kernel.org
> >> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
> >> while wait_memory
> >>
> >> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
> >> > Fix the below NULL pointer dereference:
> >> >
> >> > [   14.471200] Call Trace:
> >> > [   14.471562]  <TASK>
> >> > [   14.471882]  lock_acquire+0x245/0x2e0
> >> > [   14.472416]  ? remove_wait_queue+0x12/0x50
> >> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
> >> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
> >> > [   14.474318]  ? remove_wait_queue+0x12/0x50
> >> > [   14.474907]  remove_wait_queue+0x12/0x50
> >> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
> >> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
> >> > [   14.476704]  do_tcp_sendpages+0x287/0x600
> >> > [   14.477283]  tcp_bpf_push+0xab/0x260
> >> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
> >> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
> >> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
> >> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
> >> > [   14.480311]  sock_sendmsg+0x2d/0x40
> >> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
> >> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
> >> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
> >> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
> >> > [   14.483215]  ? __do_fault+0x2a/0x1a0
> >> > [   14.483738]  ? do_fault+0x15e/0x5d0
> >> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
> >> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
> >> > [   14.485474]  ? find_held_lock+0x2d/0x90
> >> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
> >> > [   14.486587]  __sys_sendmsg+0x41/0x70
> >> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
> >> > [   14.487822]  do_syscall_64+0x34/0x80
> >> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> >> >
> >> > The test scene as following flow:
> >> > thread1                               thread2
> >> > -----------                           ---------------
> >> >  tcp_bpf_sendmsg
> >> >   tcp_bpf_send_verdict
> >> >    tcp_bpf_sendmsg_redir              sock_close
> >> >     tcp_bpf_push_locked                 __sock_release
> >> >      tcp_bpf_push                         //inet_release
> >> >       do_tcp_sendpages                    sock->ops->release
> >> >        sk_stream_wait_memory                  // tcp_close
> >> >           sk_wait_event                      sk->sk_prot->close
> >> >            release_sock(__sk);
> >> >             ***
> >> >
> >> >                                                 lock_sock(sk);
> >> >                                                   __tcp_close
> >> >                                                     sock_orphan(sk)
> >> >                                                       sk->sk_wq  = NULL
> >> >                                                 release_sock
> >> >             ****
> >> >            lock_sock(__sk);
> >> >           remove_wait_queue(sk_sleep(sk), &wait);
> >> >              sk_sleep(sk)
> >> >              //NULL pointer dereference
> >> >              &rcu_dereference_raw(sk->sk_wq)->wait
> >> >
> >> > While waiting for memory in thread1, the socket is released with its
> >> > wait queue because thread2 has closed it. This caused by
> >> > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
> >> >sk_socket->file in thread1.
> >>
> >> I'm not sure about this approach. Keeping a closed sock file alive, just so we
> >> can wakeup from sleep, seems like wasted effort.
> >>
> >> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN | SEND_SHUTDOWN.
> >> So we will return from sk_stream_wait_memory via the do_error path.
> >>
> >> SEND_SHUTDOWN might be set because socket got closed and orphaned -
> >> dead and detached from its file, like in this case.
> >>
> >> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
> >> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the wait
> >> queue.
> >>
> >> [...]
> > As jakub's approach, this problem can be solved.
> >
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index a7273b289188..a3dab7140f1e 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
> >  static inline wait_queue_head_t *sk_sleep(struct sock *sk)
> >  {
> >         BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
> > +       if (sock_flag(sk, SOCK_DEAD))
> > +               return NULL;
> >         return &rcu_dereference_raw(sk->sk_wq)->wait;
> >  }
> >  /* Detach socket from process context.
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 9860bb9a847c..da1be17d0b19 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry
> >  {
> >         unsigned long flags;
> >
> > +       if (wq_head == NULL)
> > +               return;
> >         spin_lock_irqsave(&wq_head->lock, flags);
> >         __remove_wait_queue(wq_head, wq_entry);
> >         spin_unlock_irqrestore(&wq_head->lock, flags);
>
> I don't know if we want to change the contract for sk_sleep()
> remove_wait_queue() so that they accept dead sockets or nulls.
>
> How about just:
>
> diff --git a/net/core/stream.c b/net/core/stream.c
> index ccc083cdef23..1105057ce00a 100644
> --- a/net/core/stream.c
> +++ b/net/core/stream.c
> @@ -159,7 +159,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
>                 *timeo_p = current_timeo;
>         }
>  out:
> -       remove_wait_queue(sk_sleep(sk), &wait);
> +       if (!sock_flag(sk, SOCK_DEAD))
> +               remove_wait_queue(sk_sleep(sk), &wait);
>         return err;
>
>  do_error:
>

OK, but what about tcp_msg_wait_data() and udp_msg_wait_data () ?
It seems they could be vulnerable as well..

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory
  2022-08-22 15:52         ` Eric Dumazet
@ 2022-08-22 16:12           ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2022-08-22 16:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: liujian (CE),
	john.fastabend, davem, yoshfuji, dsahern, kuba, pabeni, andrii,
	mykolal, ast, daniel, martin.lau, song, yhs, kpsingh, sdf,
	haoluo, jolsa, shuah, bpf


On Mon, Aug 22, 2022 at 08:52 AM -07, Eric Dumazet wrote:
> On Fri, Aug 19, 2022 at 3:37 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>>
>> On Fri, Aug 19, 2022 at 10:01 AM GMT, liujian (CE) wrote:
>> >> -----Original Message-----
>> >> From: Jakub Sitnicki [mailto:jakub@cloudflare.com]
>> >> Sent: Friday, August 19, 2022 4:39 PM
>> >> To: liujian (CE) <liujian56@huawei.com>; john.fastabend@gmail.com;
>> >> edumazet@google.com
>> >> Cc: davem@davemloft.net; yoshfuji@linux-ipv6.org; dsahern@kernel.org;
>> >> kuba@kernel.org; pabeni@redhat.com; andrii@kernel.org; mykolal@fb.com;
>> >> ast@kernel.org; daniel@iogearbox.net; martin.lau@linux.dev;
>> >> song@kernel.org; yhs@fb.com; kpsingh@kernel.org; sdf@google.com;
>> >> haoluo@google.com; jolsa@kernel.org; shuah@kernel.org;
>> >> bpf@vger.kernel.org
>> >> Subject: Re: [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file
>> >> while wait_memory
>> >>
>> >> On Mon, Aug 15, 2022 at 10:33 AM +08, Liu Jian wrote:
>> >> > Fix the below NULL pointer dereference:
>> >> >
>> >> > [   14.471200] Call Trace:
>> >> > [   14.471562]  <TASK>
>> >> > [   14.471882]  lock_acquire+0x245/0x2e0
>> >> > [   14.472416]  ? remove_wait_queue+0x12/0x50
>> >> > [   14.473014]  ? _raw_spin_lock_irqsave+0x17/0x50
>> >> > [   14.473681]  _raw_spin_lock_irqsave+0x3d/0x50
>> >> > [   14.474318]  ? remove_wait_queue+0x12/0x50
>> >> > [   14.474907]  remove_wait_queue+0x12/0x50
>> >> > [   14.475480]  sk_stream_wait_memory+0x20d/0x340
>> >> > [   14.476127]  ? do_wait_intr_irq+0x80/0x80
>> >> > [   14.476704]  do_tcp_sendpages+0x287/0x600
>> >> > [   14.477283]  tcp_bpf_push+0xab/0x260
>> >> > [   14.477817]  tcp_bpf_sendmsg_redir+0x297/0x500
>> >> > [   14.478461]  ? __local_bh_enable_ip+0x77/0xe0
>> >> > [   14.479096]  tcp_bpf_send_verdict+0x105/0x470
>> >> > [   14.479729]  tcp_bpf_sendmsg+0x318/0x4f0
>> >> > [   14.480311]  sock_sendmsg+0x2d/0x40
>> >> > [   14.480822]  ____sys_sendmsg+0x1b4/0x1c0
>> >> > [   14.481390]  ? copy_msghdr_from_user+0x62/0x80
>> >> > [   14.482048]  ___sys_sendmsg+0x78/0xb0
>> >> > [   14.482580]  ? vmf_insert_pfn_prot+0x91/0x150
>> >> > [   14.483215]  ? __do_fault+0x2a/0x1a0
>> >> > [   14.483738]  ? do_fault+0x15e/0x5d0
>> >> > [   14.484246]  ? __handle_mm_fault+0x56b/0x1040
>> >> > [   14.484874]  ? lock_is_held_type+0xdf/0x130
>> >> > [   14.485474]  ? find_held_lock+0x2d/0x90
>> >> > [   14.486046]  ? __sys_sendmsg+0x41/0x70
>> >> > [   14.486587]  __sys_sendmsg+0x41/0x70
>> >> > [   14.487105]  ? intel_pmu_drain_pebs_core+0x350/0x350
>> >> > [   14.487822]  do_syscall_64+0x34/0x80
>> >> > [   14.488345]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>> >> >
>> >> > The test scene as following flow:
>> >> > thread1                               thread2
>> >> > -----------                           ---------------
>> >> >  tcp_bpf_sendmsg
>> >> >   tcp_bpf_send_verdict
>> >> >    tcp_bpf_sendmsg_redir              sock_close
>> >> >     tcp_bpf_push_locked                 __sock_release
>> >> >      tcp_bpf_push                         //inet_release
>> >> >       do_tcp_sendpages                    sock->ops->release
>> >> >        sk_stream_wait_memory                  // tcp_close
>> >> >           sk_wait_event                      sk->sk_prot->close
>> >> >            release_sock(__sk);
>> >> >             ***
>> >> >
>> >> >                                                 lock_sock(sk);
>> >> >                                                   __tcp_close
>> >> >                                                     sock_orphan(sk)
>> >> >                                                       sk->sk_wq  = NULL
>> >> >                                                 release_sock
>> >> >             ****
>> >> >            lock_sock(__sk);
>> >> >           remove_wait_queue(sk_sleep(sk), &wait);
>> >> >              sk_sleep(sk)
>> >> >              //NULL pointer dereference
>> >> >              &rcu_dereference_raw(sk->sk_wq)->wait
>> >> >
>> >> > While waiting for memory in thread1, the socket is released with its
>> >> > wait queue because thread2 has closed it. This caused by
>> >> > tcp_bpf_send_verdict didn't increase the f_count of psock->sk_redir-
>> >> >sk_socket->file in thread1.
>> >>
>> >> I'm not sure about this approach. Keeping a closed sock file alive, just so we
>> >> can wakeup from sleep, seems like wasted effort.
>> >>
>> >> __tcp_close sets sk->sk_shutdown = RCV_SHUTDOWN | SEND_SHUTDOWN.
>> >> So we will return from sk_stream_wait_memory via the do_error path.
>> >>
>> >> SEND_SHUTDOWN might be set because socket got closed and orphaned -
>> >> dead and detached from its file, like in this case.
>> >>
>> >> So, IMHO, we should check if SOCK_DEAD flag is set on wakeup due to
>> >> SEND_SHUTDOWN in sk_stream_wait_memory, before accessing the wait
>> >> queue.
>> >>
>> >> [...]
>> > As jakub's approach, this problem can be solved.
>> >
>> > diff --git a/include/net/sock.h b/include/net/sock.h
>> > index a7273b289188..a3dab7140f1e 100644
>> > --- a/include/net/sock.h
>> > +++ b/include/net/sock.h
>> > @@ -1998,6 +1998,8 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>> >  static inline wait_queue_head_t *sk_sleep(struct sock *sk)
>> >  {
>> >         BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
>> > +       if (sock_flag(sk, SOCK_DEAD))
>> > +               return NULL;
>> >         return &rcu_dereference_raw(sk->sk_wq)->wait;
>> >  }
>> >  /* Detach socket from process context.
>> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>> > index 9860bb9a847c..da1be17d0b19 100644
>> > --- a/kernel/sched/wait.c
>> > +++ b/kernel/sched/wait.c
>> > @@ -51,6 +51,8 @@ void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry
>> >  {
>> >         unsigned long flags;
>> >
>> > +       if (wq_head == NULL)
>> > +               return;
>> >         spin_lock_irqsave(&wq_head->lock, flags);
>> >         __remove_wait_queue(wq_head, wq_entry);
>> >         spin_unlock_irqrestore(&wq_head->lock, flags);
>>
>> I don't know if we want to change the contract for sk_sleep()
>> remove_wait_queue() so that they accept dead sockets or nulls.
>>
>> How about just:
>>
>> diff --git a/net/core/stream.c b/net/core/stream.c
>> index ccc083cdef23..1105057ce00a 100644
>> --- a/net/core/stream.c
>> +++ b/net/core/stream.c
>> @@ -159,7 +159,8 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
>>                 *timeo_p = current_timeo;
>>         }
>>  out:
>> -       remove_wait_queue(sk_sleep(sk), &wait);
>> +       if (!sock_flag(sk, SOCK_DEAD))
>> +               remove_wait_queue(sk_sleep(sk), &wait);
>>         return err;
>>
>>  do_error:
>>
>
> OK, but what about tcp_msg_wait_data() and udp_msg_wait_data () ?
> It seems they could be vulnerable as well..

TBH haven't thought about these until now.

They are reachable only via sk->sk_prot->recvmsg, so from recvmsg(2) or
recvmmsg(2). syscall layer will keep the socket file and the wait queue
alive.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-22 16:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-15  2:33 [PATCH bpf-next 0/2] > Keep reference on socket file while wait send memory Liu Jian
2022-08-15  2:33 ` [PATCH bpf-next 1/2] sk_msg: Keep reference on socket file while wait_memory Liu Jian
2022-08-17  0:54   ` John Fastabend
2022-08-17  2:11     ` liujian (CE)
2022-08-17 18:33       ` John Fastabend
2022-08-19  8:39   ` Jakub Sitnicki
2022-08-19 10:01     ` liujian (CE)
2022-08-19 10:34       ` Jakub Sitnicki
2022-08-20  3:01         ` liujian (CE)
2022-08-22 14:32           ` Jakub Sitnicki
2022-08-22 15:52         ` Eric Dumazet
2022-08-22 16:12           ` Jakub Sitnicki
2022-08-15  2:33 ` [PATCH bpf-next 2/2] selftests/bpf: Add wait send memory test for sockmap redirect Liu Jian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).