Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH net] sk_msg: Keep reference on socket file while psock lives
@ 2019-02-11  9:09 Jakub Sitnicki
  2019-02-19 16:00 ` Daniel Borkmann
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2019-02-11  9:09 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, John Fastabend, Marek Majkowski

Backlog work for psock (sk_psock_backlog) might sleep while waiting for
memory to free up when sending packets. While sleeping, socket can
disappear from under our feet together with its wait queue because the
userspace has closed it.

This breaks an assumption in sk_stream_wait_memory, which expects the
wait queue to be still there when it wakes up resulting in a
use-after-free:

==================================================================
BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110

CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
Workqueue: events sk_psock_backlog
Call Trace:
 print_address_description+0x6e/0x2b0
 ? remove_wait_queue+0x31/0x70
 kasan_report+0xfd/0x177
 ? remove_wait_queue+0x31/0x70
 ? remove_wait_queue+0x31/0x70
 remove_wait_queue+0x31/0x70
 sk_stream_wait_memory+0x4dd/0x5f0
 ? sk_stream_wait_close+0x1b0/0x1b0
 ? wait_woken+0xc0/0xc0
 ? tcp_current_mss+0xc5/0x110
 tcp_sendmsg_locked+0x634/0x15d0
 ? tcp_set_state+0x2e0/0x2e0
 ? __kasan_slab_free+0x1d1/0x230
 ? kmem_cache_free+0x70/0x140
 ? sk_psock_backlog+0x40c/0x4b0
 ? process_one_work+0x40b/0x660
 ? worker_thread+0x82/0x680
 ? kthread+0x1b9/0x1e0
 ? ret_from_fork+0x1f/0x30
 ? check_preempt_curr+0xaf/0x130
 ? iov_iter_kvec+0x5f/0x70
 ? kernel_sendmsg_locked+0xa0/0xe0
 skb_send_sock_locked+0x273/0x3c0
 ? skb_splice_bits+0x180/0x180
 ? start_thread+0xe0/0xe0
 ? update_min_vruntime.constprop.27+0x88/0xc0
 sk_psock_backlog+0xb3/0x4b0
 ? strscpy+0xbf/0x1e0
 process_one_work+0x40b/0x660
 worker_thread+0x82/0x680
 ? process_one_work+0x660/0x660
 kthread+0x1b9/0x1e0
 ? __kthread_create_on_node+0x250/0x250
 ret_from_fork+0x1f/0x30

Allocated by task 109:
 sock_alloc_inode+0x54/0x120
 alloc_inode+0x28/0xb0
 new_inode_pseudo+0x7/0x60
 sock_alloc+0x21/0xe0
 __sys_accept4+0xc2/0x330
 __x64_sys_accept+0x3b/0x50
 do_syscall_64+0xb2/0x3e0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 7:
 kfree+0x7f/0x140
 rcu_process_callbacks+0xe0/0x100
 __do_softirq+0xe5/0x29a

The buggy address belongs to the object at ffff888069a0c4e0
 which belongs to the cache kmalloc-64 of size 64
The buggy address is located 8 bytes inside of
 64-byte region [ffff888069a0c4e0, ffff888069a0c520)
The buggy address belongs to the page:
page:ffffea0001a68300 count:1 mapcount:0 mapping:ffff88806d4018c0 index:0x0
flags: 0x4000000000000200(slab)
raw: 4000000000000200 dead000000000100 dead000000000200 ffff88806d4018c0
raw: 0000000000000000 00000000002a002a 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888069a0c380: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
 ffff888069a0c400: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>ffff888069a0c480: 00 00 00 00 00 00 00 00 fc fc fc fc fb fb fb fb
                                                          ^
 ffff888069a0c500: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
 ffff888069a0c580: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
==================================================================

Avoid it by keeping a reference to the socket file until the psock gets
destroyed.

While at it, rearrange the order of reference grabbing and
initialization to match the destructor in reverse.

Reported-by: Marek Majkowski <marek@cloudflare.com>
Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 net/core/skmsg.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 8c826603bf36..a38442b8580b 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
 	refcount_set(&psock->refcnt, 1);
 
-	rcu_assign_sk_user_data(sk, psock);
+	/* Hold on to socket wait queue. Backlog work waits on it for
+	 * memory when sending. We must cancel work before socket wait
+	 * queue can go away.
+	 */
+	get_file(sk->sk_socket->file);
 	sock_hold(sk);
+	rcu_assign_sk_user_data(sk, psock);
 
 	return psock;
 }
@@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
 	if (psock->sk_redir)
 		sock_put(psock->sk_redir);
 	sock_put(psock->sk);
+	fput(psock->sk->sk_socket->file);
 	kfree(psock);
 }
 
-- 
2.17.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-02-11  9:09 [PATCH net] sk_msg: Keep reference on socket file while psock lives Jakub Sitnicki
@ 2019-02-19 16:00 ` Daniel Borkmann
  2019-02-20  9:47   ` Jakub Sitnicki
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Borkmann @ 2019-02-19 16:00 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: netdev, John Fastabend, Marek Majkowski

Hi Jakub,

On 02/11/2019 10:09 AM, Jakub Sitnicki wrote:
> Backlog work for psock (sk_psock_backlog) might sleep while waiting for
> memory to free up when sending packets. While sleeping, socket can
> disappear from under our feet together with its wait queue because the
> userspace has closed it.
> 
> This breaks an assumption in sk_stream_wait_memory, which expects the
> wait queue to be still there when it wakes up resulting in a
> use-after-free:
> 
> ==================================================================
> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
> Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
> 
> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> Workqueue: events sk_psock_backlog
> Call Trace:
>  print_address_description+0x6e/0x2b0
>  ? remove_wait_queue+0x31/0x70
>  kasan_report+0xfd/0x177
>  ? remove_wait_queue+0x31/0x70
>  ? remove_wait_queue+0x31/0x70
>  remove_wait_queue+0x31/0x70
>  sk_stream_wait_memory+0x4dd/0x5f0
>  ? sk_stream_wait_close+0x1b0/0x1b0
>  ? wait_woken+0xc0/0xc0
>  ? tcp_current_mss+0xc5/0x110
>  tcp_sendmsg_locked+0x634/0x15d0
>  ? tcp_set_state+0x2e0/0x2e0
>  ? __kasan_slab_free+0x1d1/0x230
>  ? kmem_cache_free+0x70/0x140
>  ? sk_psock_backlog+0x40c/0x4b0
>  ? process_one_work+0x40b/0x660
>  ? worker_thread+0x82/0x680
>  ? kthread+0x1b9/0x1e0
>  ? ret_from_fork+0x1f/0x30
>  ? check_preempt_curr+0xaf/0x130
>  ? iov_iter_kvec+0x5f/0x70
>  ? kernel_sendmsg_locked+0xa0/0xe0
>  skb_send_sock_locked+0x273/0x3c0
>  ? skb_splice_bits+0x180/0x180
>  ? start_thread+0xe0/0xe0
>  ? update_min_vruntime.constprop.27+0x88/0xc0
>  sk_psock_backlog+0xb3/0x4b0
>  ? strscpy+0xbf/0x1e0
>  process_one_work+0x40b/0x660
>  worker_thread+0x82/0x680
>  ? process_one_work+0x660/0x660
>  kthread+0x1b9/0x1e0
>  ? __kthread_create_on_node+0x250/0x250
>  ret_from_fork+0x1f/0x30
> 
> Allocated by task 109:
>  sock_alloc_inode+0x54/0x120
>  alloc_inode+0x28/0xb0
>  new_inode_pseudo+0x7/0x60
>  sock_alloc+0x21/0xe0
>  __sys_accept4+0xc2/0x330
>  __x64_sys_accept+0x3b/0x50
>  do_syscall_64+0xb2/0x3e0
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 7:
>  kfree+0x7f/0x140
>  rcu_process_callbacks+0xe0/0x100
>  __do_softirq+0xe5/0x29a
> 
> The buggy address belongs to the object at ffff888069a0c4e0
>  which belongs to the cache kmalloc-64 of size 64
> The buggy address is located 8 bytes inside of
>  64-byte region [ffff888069a0c4e0, ffff888069a0c520)
> The buggy address belongs to the page:
> page:ffffea0001a68300 count:1 mapcount:0 mapping:ffff88806d4018c0 index:0x0
> flags: 0x4000000000000200(slab)
> raw: 4000000000000200 dead000000000100 dead000000000200 ffff88806d4018c0
> raw: 0000000000000000 00000000002a002a 00000001ffffffff 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff888069a0c380: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
>  ffff888069a0c400: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>> ffff888069a0c480: 00 00 00 00 00 00 00 00 fc fc fc fc fb fb fb fb
>                                                           ^
>  ffff888069a0c500: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
>  ffff888069a0c580: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
> 
> Avoid it by keeping a reference to the socket file until the psock gets
> destroyed.
> 
> While at it, rearrange the order of reference grabbing and
> initialization to match the destructor in reverse.
> 
> Reported-by: Marek Majkowski <marek@cloudflare.com>
> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/core/skmsg.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 8c826603bf36..a38442b8580b 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
>  	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
>  	refcount_set(&psock->refcnt, 1);
>  
> -	rcu_assign_sk_user_data(sk, psock);
> +	/* Hold on to socket wait queue. Backlog work waits on it for
> +	 * memory when sending. We must cancel work before socket wait
> +	 * queue can go away.
> +	 */
> +	get_file(sk->sk_socket->file);
>  	sock_hold(sk);
> +	rcu_assign_sk_user_data(sk, psock);
>  
>  	return psock;
>  }
> @@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
>  	if (psock->sk_redir)
>  		sock_put(psock->sk_redir);
>  	sock_put(psock->sk);
> +	fput(psock->sk->sk_socket->file);

Thanks for the report (and sorry for the late reply). I think holding ref on
the struct file just so we keep it alive till deferred destruction might be
papering over the actual underlying bug. Nothing obvious pops out from staring
at the code right now; as a reproducer to run, did you use the prog in the link
above and hit it after your strparser fix?

Thanks again,
Daniel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-02-19 16:00 ` Daniel Borkmann
@ 2019-02-20  9:47   ` Jakub Sitnicki
  2019-05-21 20:07     ` John Fastabend
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2019-02-20  9:47 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, John Fastabend, Marek Majkowski

Hi Daniel,

On Tue, Feb 19, 2019 at 05:00 PM CET, Daniel Borkmann wrote:
> On 02/11/2019 10:09 AM, Jakub Sitnicki wrote:
>> Backlog work for psock (sk_psock_backlog) might sleep while waiting for
>> memory to free up when sending packets. While sleeping, socket can
>> disappear from under our feet together with its wait queue because the
>> userspace has closed it.
>>
>> This breaks an assumption in sk_stream_wait_memory, which expects the
>> wait queue to be still there when it wakes up resulting in a
>> use-after-free:
>>
>> ==================================================================
>> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
>> Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
>>
>> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> Workqueue: events sk_psock_backlog
>> Call Trace:
>>  print_address_description+0x6e/0x2b0
>>  ? remove_wait_queue+0x31/0x70
>>  kasan_report+0xfd/0x177
>>  ? remove_wait_queue+0x31/0x70
>>  ? remove_wait_queue+0x31/0x70
>>  remove_wait_queue+0x31/0x70
>>  sk_stream_wait_memory+0x4dd/0x5f0
>>  ? sk_stream_wait_close+0x1b0/0x1b0
>>  ? wait_woken+0xc0/0xc0
>>  ? tcp_current_mss+0xc5/0x110
>>  tcp_sendmsg_locked+0x634/0x15d0
>>  ? tcp_set_state+0x2e0/0x2e0
>>  ? __kasan_slab_free+0x1d1/0x230
>>  ? kmem_cache_free+0x70/0x140
>>  ? sk_psock_backlog+0x40c/0x4b0
>>  ? process_one_work+0x40b/0x660
>>  ? worker_thread+0x82/0x680
>>  ? kthread+0x1b9/0x1e0
>>  ? ret_from_fork+0x1f/0x30
>>  ? check_preempt_curr+0xaf/0x130
>>  ? iov_iter_kvec+0x5f/0x70
>>  ? kernel_sendmsg_locked+0xa0/0xe0
>>  skb_send_sock_locked+0x273/0x3c0
>>  ? skb_splice_bits+0x180/0x180
>>  ? start_thread+0xe0/0xe0
>>  ? update_min_vruntime.constprop.27+0x88/0xc0
>>  sk_psock_backlog+0xb3/0x4b0
>>  ? strscpy+0xbf/0x1e0
>>  process_one_work+0x40b/0x660
>>  worker_thread+0x82/0x680
>>  ? process_one_work+0x660/0x660
>>  kthread+0x1b9/0x1e0
>>  ? __kthread_create_on_node+0x250/0x250
>>  ret_from_fork+0x1f/0x30
>>
>> Allocated by task 109:
>>  sock_alloc_inode+0x54/0x120
>>  alloc_inode+0x28/0xb0
>>  new_inode_pseudo+0x7/0x60
>>  sock_alloc+0x21/0xe0
>>  __sys_accept4+0xc2/0x330
>>  __x64_sys_accept+0x3b/0x50
>>  do_syscall_64+0xb2/0x3e0
>>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Freed by task 7:
>>  kfree+0x7f/0x140
>>  rcu_process_callbacks+0xe0/0x100
>>  __do_softirq+0xe5/0x29a
>>
>> The buggy address belongs to the object at ffff888069a0c4e0
>>  which belongs to the cache kmalloc-64 of size 64
>> The buggy address is located 8 bytes inside of
>>  64-byte region [ffff888069a0c4e0, ffff888069a0c520)
>> The buggy address belongs to the page:
>> page:ffffea0001a68300 count:1 mapcount:0 mapping:ffff88806d4018c0 index:0x0
>> flags: 0x4000000000000200(slab)
>> raw: 4000000000000200 dead000000000100 dead000000000200 ffff88806d4018c0
>> raw: 0000000000000000 00000000002a002a 00000001ffffffff 0000000000000000
>> page dumped because: kasan: bad access detected
>>
>> Memory state around the buggy address:
>>  ffff888069a0c380: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
>>  ffff888069a0c400: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>>> ffff888069a0c480: 00 00 00 00 00 00 00 00 fc fc fc fc fb fb fb fb
>>                                                           ^
>>  ffff888069a0c500: fb fb fb fb fc fc fc fc fb fb fb fb fb fb fb fb
>>  ffff888069a0c580: fc fc fc fc fb fb fb fb fb fb fb fb fc fc fc fc
>> ==================================================================
>>
>> Avoid it by keeping a reference to the socket file until the psock gets
>> destroyed.
>>
>> While at it, rearrange the order of reference grabbing and
>> initialization to match the destructor in reverse.
>>
>> Reported-by: Marek Majkowski <marek@cloudflare.com>
>> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>>  net/core/skmsg.c | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 8c826603bf36..a38442b8580b 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
>>  	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
>>  	refcount_set(&psock->refcnt, 1);
>>
>> -	rcu_assign_sk_user_data(sk, psock);
>> +	/* Hold on to socket wait queue. Backlog work waits on it for
>> +	 * memory when sending. We must cancel work before socket wait
>> +	 * queue can go away.
>> +	 */
>> +	get_file(sk->sk_socket->file);
>>  	sock_hold(sk);
>> +	rcu_assign_sk_user_data(sk, psock);
>>
>>  	return psock;
>>  }
>> @@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
>>  	if (psock->sk_redir)
>>  		sock_put(psock->sk_redir);
>>  	sock_put(psock->sk);
>> +	fput(psock->sk->sk_socket->file);
>
> Thanks for the report (and sorry for the late reply). I think holding ref on
> the struct file just so we keep it alive till deferred destruction might be
> papering over the actual underlying bug. Nothing obvious pops out from staring
> at the code right now; as a reproducer to run, did you use the prog in the link
> above and hit it after your strparser fix?

I get you, I actually sat on this fix for a moment because I had a
similar concern, that holding a file ref is a heavy-handed fix and I'm
not seeing the real problem.

For me it came down to this:

1. tcp_sendmsg_locked that we call from psock backlog work can end up
   waiting for memory. We somehow need to ensure that the socket wait
   queue does not disappear until tcp_sendmsg_locked returns.

2. KCM, which I assume must have the same problem, holds a reference on
   the socket file.

I'm curious if there is another angle to it.

To answer your actual questions - your guesses are correct on both
accounts.

For the reproducer, I've used the TCP echo program from Marek [1]. On
the client side, I had something like:

  while :; do
    nc 10.0.0.1 12345 > /dev/null < /dev/zero &
    pid=$!
    sleep 0.1
    kill $pid
  done

I can dig out the test scripts or help testing any patches.

I was testing with the strparser fix applied, 1d79895aef18 ("sk_msg:
Always cancel strp work before freeing the psock"), which unfortunately
was not enough.

The explanation there was that the socket descriptor can get closed, and
in consequence the socket file can get destroyed, before the deferred
destructor for psock runs. So psock backlog work can be still very much
alive and running while the socket file is gone.

Thanks for looking into it,
-Jakub

[1] https://gist.github.com/majek/a09bcbeb8ab548cde6c18c930895c3f2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-02-20  9:47   ` Jakub Sitnicki
@ 2019-05-21 20:07     ` John Fastabend
  2019-05-22 11:14       ` Jakub Sitnicki
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2019-05-21 20:07 UTC (permalink / raw)
  To: Jakub Sitnicki, Daniel Borkmann; +Cc: netdev, John Fastabend, Marek Majkowski

Jakub Sitnicki wrote:
> Hi Daniel,
> 
> On Tue, Feb 19, 2019 at 05:00 PM CET, Daniel Borkmann wrote:
> > On 02/11/2019 10:09 AM, Jakub Sitnicki wrote:
> >> Backlog work for psock (sk_psock_backlog) might sleep while waiting for
> >> memory to free up when sending packets. While sleeping, socket can
> >> disappear from under our feet together with its wait queue because the
> >> userspace has closed it.
> >>
> >> This breaks an assumption in sk_stream_wait_memory, which expects the
> >> wait queue to be still there when it wakes up resulting in a
> >> use-after-free:
> >>
> >> ==================================================================
> >> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
> >> Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
> >>
> >> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> >> Workqueue: events sk_psock_backlog
> >> Call Trace:
> >>  print_address_description+0x6e/0x2b0
> >>  ? remove_wait_queue+0x31/0x70
> >>  kasan_report+0xfd/0x177
> >>  ? remove_wait_queue+0x31/0x70
> >>  ? remove_wait_queue+0x31/0x70
> >>  remove_wait_queue+0x31/0x70
> >>  sk_stream_wait_memory+0x4dd/0x5f0
> >>  ? sk_stream_wait_close+0x1b0/0x1b0
> >>  ? wait_woken+0xc0/0xc0
> >>  ? tcp_current_mss+0xc5/0x110
> >>  tcp_sendmsg_locked+0x634/0x15d0

[...]

> >>
> >> Avoid it by keeping a reference to the socket file until the psock gets
> >> destroyed.
> >>
> >> While at it, rearrange the order of reference grabbing and
> >> initialization to match the destructor in reverse.
> >>
> >> Reported-by: Marek Majkowski <marek@cloudflare.com>
> >> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >>  net/core/skmsg.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> >> index 8c826603bf36..a38442b8580b 100644
> >> --- a/net/core/skmsg.c
> >> +++ b/net/core/skmsg.c
> >> @@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
> >>  	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
> >>  	refcount_set(&psock->refcnt, 1);
> >>
> >> -	rcu_assign_sk_user_data(sk, psock);
> >> +	/* Hold on to socket wait queue. Backlog work waits on it for
> >> +	 * memory when sending. We must cancel work before socket wait
> >> +	 * queue can go away.
> >> +	 */
> >> +	get_file(sk->sk_socket->file);
> >>  	sock_hold(sk);
> >> +	rcu_assign_sk_user_data(sk, psock);
> >>
> >>  	return psock;
> >>  }
> >> @@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
> >>  	if (psock->sk_redir)
> >>  		sock_put(psock->sk_redir);
> >>  	sock_put(psock->sk);
> >> +	fput(psock->sk->sk_socket->file);
> >
> > Thanks for the report (and sorry for the late reply). I think holding ref on
> > the struct file just so we keep it alive till deferred destruction might be
> > papering over the actual underlying bug. Nothing obvious pops out from staring
> > at the code right now; as a reproducer to run, did you use the prog in the link
> > above and hit it after your strparser fix?
> 
> I get you, I actually sat on this fix for a moment because I had a
> similar concern, that holding a file ref is a heavy-handed fix and I'm
> not seeing the real problem.
> 
> For me it came down to this:

> 1. tcp_sendmsg_locked that we call from psock backlog work can end up
>    waiting for memory. We somehow need to ensure that the socket wait
>    queue does not disappear until tcp_sendmsg_locked returns.
> 
> 2. KCM, which I assume must have the same problem, holds a reference on
>    the socket file.
> 
> I'm curious if there is another angle to it.
> 
> To answer your actual questions - your guesses are correct on both
> accounts.
> 
> For the reproducer, I've used the TCP echo program from Marek [1]. On
> the client side, I had something like:
> 
>   while :; do
>     nc 10.0.0.1 12345 > /dev/null < /dev/zero &
>     pid=$!
>     sleep 0.1
>     kill $pid
>   done
> 
> I can dig out the test scripts or help testing any patches.
> 
> I was testing with the strparser fix applied, 1d79895aef18 ("sk_msg:
> Always cancel strp work before freeing the psock"), which unfortunately
> was not enough.
> 
> The explanation there was that the socket descriptor can get closed, and
> in consequence the socket file can get destroyed, before the deferred
> destructor for psock runs. So psock backlog work can be still very much
> alive and running while the socket file is gone.
> 
> Thanks for looking into it,
> -Jakub
> 
> [1] https://gist.github.com/majek/a09bcbeb8ab548cde6c18c930895c3f2

In the sendpage case we set the MSG_DONTWAIT flag, I think we should
set it in the sendmsg case as well. This would result in
tcp_sendmsg_locked() setting timeo via sock_sndtimeo to zero and should
avoid any waiting. And then we avoid above use after free bug.

Then handle the error in sk_psock_backlog() correctly it returns an
EAGAIN (I think?) in this case so we will deduct any sent bytes and
increment the offset as needed and try again here. Except we will
have purged the ingress_skb list so we should abort.

And the work queue should be flushed before destroying psock so we
can be assured that the psock reference is not lost.

Here is what I'm thinking still untested. Because skmsg is the only
user of skb_send_sock_locked() this should be OK to do. If you have
time running your tests again would be great. I can also try to
repro.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e89be6282693..eadfd16be7db 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
                kv.iov_base = skb->data + offset;
                kv.iov_len = slen;
                memset(&msg, 0, sizeof(msg));
+               msg->flags = MSG_DONTWAIT;
 
                ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
                if (ret <= 0)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-05-21 20:07     ` John Fastabend
@ 2019-05-22 11:14       ` Jakub Sitnicki
  2019-05-23 15:58         ` John Fastabend
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2019-05-22 11:14 UTC (permalink / raw)
  To: John Fastabend; +Cc: Daniel Borkmann, netdev, Marek Majkowski

Hi John,

On Tue, May 21, 2019 at 10:07 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> Hi Daniel,
>>
>> On Tue, Feb 19, 2019 at 05:00 PM CET, Daniel Borkmann wrote:
>> > On 02/11/2019 10:09 AM, Jakub Sitnicki wrote:
>> >> Backlog work for psock (sk_psock_backlog) might sleep while waiting for
>> >> memory to free up when sending packets. While sleeping, socket can
>> >> disappear from under our feet together with its wait queue because the
>> >> userspace has closed it.
>> >>
>> >> This breaks an assumption in sk_stream_wait_memory, which expects the
>> >> wait queue to be still there when it wakes up resulting in a
>> >> use-after-free:
>> >>
>> >> ==================================================================
>> >> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
>> >> Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
>> >>
>> >> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
>> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> >> Workqueue: events sk_psock_backlog
>> >> Call Trace:
>> >>  print_address_description+0x6e/0x2b0
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  kasan_report+0xfd/0x177
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  ? remove_wait_queue+0x31/0x70
>> >>  remove_wait_queue+0x31/0x70
>> >>  sk_stream_wait_memory+0x4dd/0x5f0
>> >>  ? sk_stream_wait_close+0x1b0/0x1b0
>> >>  ? wait_woken+0xc0/0xc0
>> >>  ? tcp_current_mss+0xc5/0x110
>> >>  tcp_sendmsg_locked+0x634/0x15d0
>
> [...]
>
>> >>
>> >> Avoid it by keeping a reference to the socket file until the psock gets
>> >> destroyed.
>> >>
>> >> While at it, rearrange the order of reference grabbing and
>> >> initialization to match the destructor in reverse.
>> >>
>> >> Reported-by: Marek Majkowski <marek@cloudflare.com>
>> >> Link: https://lore.kernel.org/netdev/CAJPywTLwgXNEZ2dZVoa=udiZmtrWJ0q5SuBW64aYs0Y1khXX3A@mail.gmail.com
>> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> ---
>> >>  net/core/skmsg.c | 8 +++++++-
>> >>  1 file changed, 7 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> >> index 8c826603bf36..a38442b8580b 100644
>> >> --- a/net/core/skmsg.c
>> >> +++ b/net/core/skmsg.c
>> >> @@ -493,8 +493,13 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
>> >>  	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
>> >>  	refcount_set(&psock->refcnt, 1);
>> >>
>> >> -	rcu_assign_sk_user_data(sk, psock);
>> >> +	/* Hold on to socket wait queue. Backlog work waits on it for
>> >> +	 * memory when sending. We must cancel work before socket wait
>> >> +	 * queue can go away.
>> >> +	 */
>> >> +	get_file(sk->sk_socket->file);
>> >>  	sock_hold(sk);
>> >> +	rcu_assign_sk_user_data(sk, psock);
>> >>
>> >>  	return psock;
>> >>  }
>> >> @@ -558,6 +563,7 @@ static void sk_psock_destroy_deferred(struct work_struct *gc)
>> >>  	if (psock->sk_redir)
>> >>  		sock_put(psock->sk_redir);
>> >>  	sock_put(psock->sk);
>> >> +	fput(psock->sk->sk_socket->file);
>> >
>> > Thanks for the report (and sorry for the late reply). I think holding ref on
>> > the struct file just so we keep it alive till deferred destruction might be
>> > papering over the actual underlying bug. Nothing obvious pops out from staring
>> > at the code right now; as a reproducer to run, did you use the prog in the link
>> > above and hit it after your strparser fix?
>>
>> I get you, I actually sat on this fix for a moment because I had a
>> similar concern, that holding a file ref is a heavy-handed fix and I'm
>> not seeing the real problem.
>>
>> For me it came down to this:
>
>> 1. tcp_sendmsg_locked that we call from psock backlog work can end up
>>    waiting for memory. We somehow need to ensure that the socket wait
>>    queue does not disappear until tcp_sendmsg_locked returns.
>>
>> 2. KCM, which I assume must have the same problem, holds a reference on
>>    the socket file.
>>
>> I'm curious if there is another angle to it.
>>
>> To answer your actual questions - your guesses are correct on both
>> accounts.
>>
>> For the reproducer, I've used the TCP echo program from Marek [1]. On
>> the client side, I had something like:
>>
>>   while :; do
>>     nc 10.0.0.1 12345 > /dev/null < /dev/zero &
>>     pid=$!
>>     sleep 0.1
>>     kill $pid
>>   done
>>
>> I can dig out the test scripts or help testing any patches.
>>
>> I was testing with the strparser fix applied, 1d79895aef18 ("sk_msg:
>> Always cancel strp work before freeing the psock"), which unfortunately
>> was not enough.
>>
>> The explanation there was that the socket descriptor can get closed, and
>> in consequence the socket file can get destroyed, before the deferred
>> destructor for psock runs. So psock backlog work can be still very much
>> alive and running while the socket file is gone.
>>
>> Thanks for looking into it,
>> -Jakub
>>
>> [1] https://gist.github.com/majek/a09bcbeb8ab548cde6c18c930895c3f2
>
> In the sendpage case we set the MSG_DONTWAIT flag, I think we should
> set it in the sendmsg case as well. This would result in
> tcp_sendmsg_locked() setting timeo via sock_sndtimeo to zero and should
> avoid any waiting. And then we avoid above use after free bug.
>
> Then handle the error in sk_psock_backlog() correctly it returns an
> EAGAIN (I think?) in this case so we will deduct any sent bytes and
> increment the offset as needed and try again here. Except we will
> have purged the ingress_skb list so we should abort.
>
> And the work queue should be flushed before destroying psock so we
> can be assured that the psock reference is not lost.
>
> Here is what I'm thinking still untested. Because skmsg is the only
> user of skb_send_sock_locked() this should be OK to do. If you have
> time running your tests again would be great. I can also try to
> repro.
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e89be6282693..eadfd16be7db 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
>                 kv.iov_base = skb->data + offset;
>                 kv.iov_len = slen;
>                 memset(&msg, 0, sizeof(msg));
> +               msg->flags = MSG_DONTWAIT;
>
>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
>                 if (ret <= 0)

Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
me. No more crashes in sk_stream_wait_memory. I've tested it on top of
current bpf-next (f49aa1de9836). Here's my:

  Tested-by: Jakub Sitnicki <jakub@cloudflare.com>

The actual I've tested is below, for completeness.

BTW. I've ran into another crash which I haven't seen before while
testing sockmap-echo, but it looks unrelated:

  https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/

-Jakub

--- 8< ---

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e89be6282693..4a7c656b195b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
                kv.iov_base = skb->data + offset;
                kv.iov_len = slen;
                memset(&msg, 0, sizeof(msg));
+               msg.msg_flags = MSG_DONTWAIT;

                ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
                if (ret <= 0)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-05-22 11:14       ` Jakub Sitnicki
@ 2019-05-23 15:58         ` John Fastabend
  2019-05-24  8:05           ` Jakub Sitnicki
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2019-05-23 15:58 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: Daniel Borkmann, netdev, Marek Majkowski

[...]

> 
> Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
> me. No more crashes in sk_stream_wait_memory. I've tested it on top of
> current bpf-next (f49aa1de9836). Here's my:
> 
>   Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
> 
> The actual I've tested is below, for completeness.
> 
> BTW. I've ran into another crash which I haven't seen before while
> testing sockmap-echo, but it looks unrelated:
> 
>   https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/
> 
> -Jakub
> 
> --- 8< ---
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e89be6282693..4a7c656b195b 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
>                 kv.iov_base = skb->data + offset;
>                 kv.iov_len = slen;
>                 memset(&msg, 0, sizeof(msg));
> +               msg.msg_flags = MSG_DONTWAIT;
> 
>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
>                 if (ret <= 0)

I went ahead and submitted this feel free to add your signed-off-by.

Thanks,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-05-23 15:58         ` John Fastabend
@ 2019-05-24  8:05           ` Jakub Sitnicki
  2019-05-24 15:51             ` John Fastabend
  0 siblings, 1 reply; 9+ messages in thread
From: Jakub Sitnicki @ 2019-05-24  8:05 UTC (permalink / raw)
  To: John Fastabend; +Cc: Daniel Borkmann, netdev, Marek Majkowski

On Thu, May 23, 2019 at 05:58 PM CEST, John Fastabend wrote:
> [...]
>
>>
>> Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
>> me. No more crashes in sk_stream_wait_memory. I've tested it on top of
>> current bpf-next (f49aa1de9836). Here's my:
>>
>>   Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
>>
>> The actual I've tested is below, for completeness.
>>
>> BTW. I've ran into another crash which I haven't seen before while
>> testing sockmap-echo, but it looks unrelated:
>>
>>   https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/
>>
>> -Jakub
>>
>> --- 8< ---
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index e89be6282693..4a7c656b195b 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
>>                 kv.iov_base = skb->data + offset;
>>                 kv.iov_len = slen;
>>                 memset(&msg, 0, sizeof(msg));
>> +               msg.msg_flags = MSG_DONTWAIT;
>>
>>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
>>                 if (ret <= 0)
>
> I went ahead and submitted this feel free to add your signed-off-by.

Thanks! The fix was all your idea :-)

Now that those pesky crashes are gone, we plan to look into drops when
doing echo with sockmap. Marek tried running echo-sockmap [1] with
latest bpf-next (plus mentioned crash fixes) and reports that not all
data bounces back:

$ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
971832
$ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
867352
$ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
952648

I'm tring to turn echo-sockmap into a selftest but as you can probably
guess over loopback all works fine.

-Jakub

[1] https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-sockmap.c

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-05-24  8:05           ` Jakub Sitnicki
@ 2019-05-24 15:51             ` John Fastabend
  2019-05-27  9:27               ` Jakub Sitnicki
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2019-05-24 15:51 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: Daniel Borkmann, netdev, Marek Majkowski

Jakub Sitnicki wrote:
> On Thu, May 23, 2019 at 05:58 PM CEST, John Fastabend wrote:
> > [...]
> >
> >>
> >> Thanks for taking a look at it. Setting MSG_DONTWAIT works great for
> >> me. No more crashes in sk_stream_wait_memory. I've tested it on top of
> >> current bpf-next (f49aa1de9836). Here's my:
> >>
> >>   Tested-by: Jakub Sitnicki <jakub@cloudflare.com>
> >>
> >> The actual I've tested is below, for completeness.
> >>
> >> BTW. I've ran into another crash which I haven't seen before while
> >> testing sockmap-echo, but it looks unrelated:
> >>
> >>   https://lore.kernel.org/netdev/20190522100142.28925-1-jakub@cloudflare.com/
> >>
> >> -Jakub
> >>
> >> --- 8< ---
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index e89be6282693..4a7c656b195b 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
> >>                 kv.iov_base = skb->data + offset;
> >>                 kv.iov_len = slen;
> >>                 memset(&msg, 0, sizeof(msg));
> >> +               msg.msg_flags = MSG_DONTWAIT;
> >>
> >>                 ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
> >>                 if (ret <= 0)
> >
> > I went ahead and submitted this feel free to add your signed-off-by.
> 
> Thanks! The fix was all your idea :-)

If I can push the correct patch to Daniel it should be in bpf tree
soon.

> 
> Now that those pesky crashes are gone, we plan to look into drops when
> doing echo with sockmap. Marek tried running echo-sockmap [1] with
> latest bpf-next (plus mentioned crash fixes) and reports that not all
> data bounces back:
> 
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 971832
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 867352
> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
> 952648
> 
> I'm tring to turn echo-sockmap into a selftest but as you can probably
> guess over loopback all works fine.
> 

Right, sockmap when used from recvmsg with redirect is lossy. This
was a design choice I made that apparently caught a few people
by surprise. The original rationale for this was when doing a
multiplex operation, e.g. single ingress socket to many egress
sockets blocking would cause head of line blocking on all
sockets. To resolve this I simply dropped the packet and then allow
the flow to continue. This pushes the logic up to the application
to do retries, etc. when this happens. FWIW userspace proxies I
tested also had similar points where they fell over and dropped
packets. In hind sight though it probably would have made more
sense to make this behavior opt-in vs the default. But, the
use case I was solving at the time I wrote this could handle
drops and was actually a NxM sockets with N ingress sockets and M
egress sockets so head of line blocking was a real problem.

Adding a flag to turn this into a blocking op has been on my
todo list for awhile. Especially when sockmap is being used as
a single ingress to single egress socket then blocking vs dropping
makes much more sense.

The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT
case there is a few checks and notice we can fallthrough to a
kfree_skb(skb). This is most likely the drops you are hitting.
Maybe annotate it with a dbg statement to check.

To fix this we could have a flag to _not_ drop but enqueue the
packet regardless of the test or hold it until space is
available. I even think sk_psock_strp_read could push back
on the stream parser which would eventually push back via TCP
and get the behavior you want.

Also, I have a couple items on my TODO list that I'll eventually
get to. First we run without stream parsers in some Cilium
use cases. I'll push some patches to allow this in the next
months or so. This avoids the annoying stream parser prog that
simply returns skb->len. This is mostly an optimizations. A
larger change I want to make at some point is to remove the
backlog workqueue altogether. Originally it was added for
simplicity but actually causes some latency spikes when
looking at 99+ percentiles. It really doesn't need to be
there it was a hold over from some original architecture that
got pushed upstream. If you have time and want to let me know
if you would like to tackle removing it.

Thanks,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net] sk_msg: Keep reference on socket file while psock lives
  2019-05-24 15:51             ` John Fastabend
@ 2019-05-27  9:27               ` Jakub Sitnicki
  0 siblings, 0 replies; 9+ messages in thread
From: Jakub Sitnicki @ 2019-05-27  9:27 UTC (permalink / raw)
  To: John Fastabend; +Cc: Daniel Borkmann, netdev, Marek Majkowski

On Fri, May 24, 2019 at 05:51 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>>
>> Now that those pesky crashes are gone, we plan to look into drops when
>> doing echo with sockmap. Marek tried running echo-sockmap [1] with
>> latest bpf-next (plus mentioned crash fixes) and reports that not all
>> data bounces back:
>>
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 971832
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 867352
>> $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c
>> 952648
>>
>> I'm tring to turn echo-sockmap into a selftest but as you can probably
>> guess over loopback all works fine.
>>
>
> Right, sockmap when used from recvmsg with redirect is lossy. This
> was a design choice I made that apparently caught a few people
> by surprise. The original rationale for this was when doing a
> multiplex operation, e.g. single ingress socket to many egress
> sockets blocking would cause head of line blocking on all
> sockets. To resolve this I simply dropped the packet and then allow
> the flow to continue. This pushes the logic up to the application
> to do retries, etc. when this happens. FWIW userspace proxies I
> tested also had similar points where they fell over and dropped
> packets. In hind sight though it probably would have made more
> sense to make this behavior opt-in vs the default. But, the
> use case I was solving at the time I wrote this could handle
> drops and was actually a NxM sockets with N ingress sockets and M
> egress sockets so head of line blocking was a real problem.
>
> Adding a flag to turn this into a blocking op has been on my
> todo list for awhile. Especially when sockmap is being used as
> a single ingress to single egress socket then blocking vs dropping
> makes much more sense.
>
> The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT
> case there is a few checks and notice we can fallthrough to a
> kfree_skb(skb). This is most likely the drops you are hitting.
> Maybe annotate it with a dbg statement to check.
>
> To fix this we could have a flag to _not_ drop but enqueue the
> packet regardless of the test or hold it until space is
> available. I even think sk_psock_strp_read could push back
> on the stream parser which would eventually push back via TCP
> and get the behavior you want.
>
> Also, I have a couple items on my TODO list that I'll eventually
> get to. First we run without stream parsers in some Cilium
> use cases. I'll push some patches to allow this in the next
> months or so. This avoids the annoying stream parser prog that
> simply returns skb->len. This is mostly an optimizations. A
> larger change I want to make at some point is to remove the
> backlog workqueue altogether. Originally it was added for
> simplicity but actually causes some latency spikes when
> looking at 99+ percentiles. It really doesn't need to be
> there it was a hold over from some original architecture that
> got pushed upstream. If you have time and want to let me know
> if you would like to tackle removing it.

This is great stuff. Thanks for explaining the sockmap's design and
decisions behind it. The opt-in blocking mode idea is spot on.

I imagine we'll get back to sockmap once we have a replacement for
TPROXY figured out (unrelated to sockmap). Let's sync then.

-Jakub

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-11  9:09 [PATCH net] sk_msg: Keep reference on socket file while psock lives Jakub Sitnicki
2019-02-19 16:00 ` Daniel Borkmann
2019-02-20  9:47   ` Jakub Sitnicki
2019-05-21 20:07     ` John Fastabend
2019-05-22 11:14       ` Jakub Sitnicki
2019-05-23 15:58         ` John Fastabend
2019-05-24  8:05           ` Jakub Sitnicki
2019-05-24 15:51             ` John Fastabend
2019-05-27  9:27               ` Jakub Sitnicki

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org netdev@archiver.kernel.org
	public-inbox-index netdev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox