Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives

From: Jakub Sitnicki <jakub@cloudflare.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	netdev@vger.kernel.org, Marek Majkowski <marek@cloudflare.com>
Subject: Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
Date: Wed, 22 May 2019 13:14:00 +0200	[thread overview]
Message-ID: <87v9y2zqpz.fsf@cloudflare.com> (raw)
In-Reply-To: <5ce45a6fcd82d_48b72ac3337c45b85f@john-XPS-13-9360.notmuch>

Hi John,

On Tue, May 21, 2019 at 10:07 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Hi Daniel,
>>
>> On Tue, Feb 19, 2019 at 05:00 PM CET, Daniel Borkmann wrote:
>> > On 02/11/2019 10:09 AM, Jakub Sitnicki wrote:
>> >> Backlog work for psock (sk_psock_backlog) might sleep while waiting for
>> >> memory to free up when sending packets. While sleeping, socket can
>> >> disappear from under our feet together with its wait queue because the
>> >> userspace has closed it.
>> >>
>> >> This breaks an assumption in sk_stream_wait_memory, which expects the
>> >> wait queue to be still there when it wakes up resulting in a
>> >> use-after-free:
>> >>
>> >> ==================================================================
>> >> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
>> >> Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
>> >>
>> >> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> >> Workqueue: events sk_psock_backlog
>> >> Call Trace:
>> >>  print_address_description+0x6e/0x2b0
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  kasan_report+0xfd/0x177
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  remove_wait_queue+0x31/0x70
>> >>  sk_stream_wait_memory+0x4dd/0x5f0
>> >>  ? sk_stream_wait_close+0x1b0/0x1b0
>> >>  ? wait_woken+0xc0/0xc0
>> >>  ? tcp_current_mss+0xc5/0x110
>> >>  tcp_sendmsg_locked+0x634/0x15d0
>
> [...]
>
>> >>
>> >> Avoid it by keeping a reference to the socket file until the psock gets
>> >> destroyed.
>> >>
>> >> While at it, rearrange the order of reference grabbing and
>> >> initialization to match the destructor in reverse.
>> >>
>> >> Reported-by: Marek Majkowski <marek@cloudflare.com>
>> >> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>  net/core/skmsg.c | 8 +++++++-
>> >>  1 file changed, 7 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> >> index 8c826603bf36..a38442b8580b 100644
>> >> --- a/net/core/skmsg.c
>> >> +++ b/net/core/skmsg.c
>> >> @@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
>> >>  	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
>> >>  	refcount_set(&psock->refcnt, 1);
>> >>
>> >> -	rcu_assign_sk_user_data(sk, psock);
>> >> +	/* Hold on to socket wait queue. Backlog work waits on it for
>> >> +	 * memory when sending. We must cancel work before socket wait
>> >> +	 * queue can go away.
>> >> +	 */
>> >> +	get_file(sk->sk_socket->file);
>> >>  	sock_hold(sk);
>> >> +	rcu_assign_sk_user_data(sk, psock);
>> >>
>> >>  	return psock;
>> >>  }
>> >> @@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
>> >>  	if (psock->sk_redir)
>> >>  		sock_put(psock->sk_redir);
>> >>  	sock_put(psock->sk);
>> >> +	fput(psock->sk->sk_socket->file);
>> >
>> > Thanks for the report (and sorry for the late reply). I think holding ref on
>> > the struct file just so we keep it alive till deferred destruction might be
>> > papering over the actual underlying bug. Nothing obvious pops out from staring
>> > at the code right now; as a reproducer to run, did you use the prog in the link
>> > above and hit it after your strparser fix?
>>
>> I get you, I actually sat on this fix for a moment because I had a
>> similar concern, that holding a file ref is a heavy-handed fix and I'm
>> not seeing the real problem.
>>
>> For me it came down to this:
>
>> 1. tcp_sendmsg_locked that we call from psock backlog work can end up
>>    waiting for memory. We somehow need to ensure that the socket wait
>>    queue does not disappear until tcp_sendmsg_locked returns.
>>
>> 2. KCM, which I assume must have the same problem, holds a reference on
>>    the socket file.
>>
>> I'm curious if there is another angle to it.
>>
>> To answer your actual questions - your guesses are correct on both
>> accounts.
>>
>> For the reproducer, I've used the TCP echo program from Marek [1]. On
>> the client side, I had something like:
>>
>>   while :; do
>>     nc 10.0.0.1 12345 > /dev/null < /dev/zero &
>>     pid=$!
>>     sleep 0.1
>>     kill $pid
>>   done
>>
>> I can dig out the test scripts or help testing any patches.
>>
>> I was testing with the strparser fix applied, 1d79895aef18 ("sk_msg:
>> Always cancel strp work before freeing the psock"), which unfortunately
>> was not enough.
>>
>> The explanation there was that the socket descriptor can get closed, and
>> in consequence the socket file can get destroyed, before the deferred
>> destructor for psock runs. So psock backlog work can be still very much
>> alive and running while the socket file is gone.
>>
>> Thanks for looking into it,
>> -Jakub
>>
>> [1] https://gist.github.com/majek/a09bcbeb8ab548cde6c18c930895c3f2
>
> In the sendpage case we set the MSG_DONTWAIT flag, I think we should
> set it in the sendmsg case as well. This would result in
> tcp_sendmsg_locked() setting timeo via sock_sndtimeo to zero and should
> avoid any waiting. And then we avoid above use after free bug.
>
> Then handle the error in sk_psock_backlog() correctly it returns an
> EAGAIN (I think?) in this case so we will deduct any sent bytes and
> increment the offset as needed and try again here. Except we will
> have purged the ingress_skb list so we should abort.
>
> And the work queue should be flushed before destroying psock so we
> can be assured that the psock reference is not lost.
>
> Here is what I'm thinking still untested. Because skmsg is the only
> user of skb_send_sock_locked() this should be OK to do. If you have
> time running your tests again would be great. I can also try to
> repro.
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e89be6282693..eadfd16be7db 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
>                 kv.iov_base = skb->data + offset;
>                 kv.iov_len = slen;
>                 memset(&msg, 0, sizeof(msg));
> +               msg->flags = MSG_DONTWAIT;
>
>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
>                 if (ret <= 0)

Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
me. No more crashes in sk_stream_wait_memory. I've tested it on top of
current bpf-next (f49aa1de9836). Here's my:

  Tested-by: Jakub Sitnicki <jakub@cloudflare.com>

The actual I've tested is below, for completeness.

BTW. I've ran into another crash which I haven't seen before while
testing sockmap-echo, but it looks unrelated:

  https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/

-Jakub

--- 8< ---

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e89be6282693..4a7c656b195b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
                kv.iov_base = skb->data + offset;
                kv.iov_len = slen;
                memset(&msg, 0, sizeof(msg));
+               msg.msg_flags = MSG_DONTWAIT;

                ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
                if (ret <= 0)