bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bpf PATCH 0/2] bpf, sockmap fixes
@ 2021-03-24 20:59 John Fastabend
  2021-03-24 20:59 ` [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset John Fastabend
  2021-03-24 20:59 ` [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting John Fastabend
  0 siblings, 2 replies; 11+ messages in thread
From: John Fastabend @ 2021-03-24 20:59 UTC (permalink / raw)
  To: john.fastabend, andrii, daniel, ast; +Cc: xiyou.wangcong, bpf, netdev, lmb

This addresses an issue found while reviewing latest round of sock
map patches and an issue reported from CI via Andrii.

The CI discovered issue was introduced by over correcting our
previously broken memory accounting. After the fix, "bpf, sockmap:
Avoid returning unneeded EAGAIN when redirecting to self" we fixed
a dropped packet and a missing fwd_alloc calculations, but pushed
it too far back into the packet pipeline creating an issue in the
unlikely case socket tear down happens with an enqueued skb. See
patch for details.

Tested with usual suspects: test_sockmap, test_maps, test_progs
and test_progs-no_alu32.

---

John Fastabend (2):
      bpf, sockmap: fix sk->prot unhash op reset
      bpf, sockmap: fix incorrect fwd_alloc accounting


 include/linux/skmsg.h |    1 -
 net/core/skmsg.c      |   13 ++++++-------
 net/tls/tls_main.c    |    6 ++++++
 3 files changed, 12 insertions(+), 8 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset
  2021-03-24 20:59 [bpf PATCH 0/2] bpf, sockmap fixes John Fastabend
@ 2021-03-24 20:59 ` John Fastabend
  2021-03-25  0:11   ` Cong Wang
  2021-03-24 20:59 ` [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting John Fastabend
  1 sibling, 1 reply; 11+ messages in thread
From: John Fastabend @ 2021-03-24 20:59 UTC (permalink / raw)
  To: john.fastabend, andrii, daniel, ast; +Cc: xiyou.wangcong, bpf, netdev, lmb

In '4da6a196f93b1' we fixed a potential unhash loop caused when
a TLS socket in a sockmap was removed from the sockmap. This
happened because the unhash operation on the TLS ctx continued
to point at the sockmap implementation of unhash even though the
psock has already been removed. The sockmap unhash handler when a
psock is removed does the following,

 void sock_map_unhash(struct sock *sk)
 {
	void (*saved_unhash)(struct sock *sk);
	struct sk_psock *psock;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (unlikely(!psock)) {
		rcu_read_unlock();
		if (sk->sk_prot->unhash)
			sk->sk_prot->unhash(sk);
		return;
	}
        [...]
 }

The unlikely() case is there to handle the case where psock is detached
but the proto ops have not been updated yet. But, in the above case
with TLS and removed psock we never fixed sk_prot->unhash() and unhash()
points back to sock_map_unhash resulting in a loop. To fix this we added
this bit of code,

 static inline void sk_psock_restore_proto(struct sock *sk,
                                          struct sk_psock *psock)
 {
       sk->sk_prot->unhash = psock->saved_unhash;

This will set the sk_prot->unhash back to its saved value. This is the
correct callback for a TLS socket that has been removed from the sock_map.
Unfortunately, this also overwrites the unhash pointer for all psocks.
We effectively break sockmap unhash handling for any future socks.
Omitting the unhash operation will leave stale entries in the map if
a socket transition through unhash, but does not do close() op.

To fix handle similar to write_space and rewrite it in the TLS update
hook. This way the TLS enabled socket will point to the saved unhash()
handler.

Fixes: 4da6a196f93b1 ("bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    1 -
 net/tls/tls_main.c    |    6 ++++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 8edbbf5f2f93..f6009fe9c9ac 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -349,7 +349,6 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
-	sk->sk_prot->unhash = psock->saved_unhash;
 	if (inet_csk_has_ulp(sk)) {
 		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
 	} else {
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 47b7c5334c34..ecb5634b4c4a 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -754,6 +754,12 @@ static void tls_update(struct sock *sk, struct proto *p,
 
 	ctx = tls_get_ctx(sk);
 	if (likely(ctx)) {
+		/* TLS does not have an unhash proto in SW cases, but we need
+		 * to ensure we stop using the sock_map unhash routine because
+		 * the associated psock is being removed. So use the original
+		 * unhash handler.
+		 */
+		WRITE_ONCE(sk->sk_prot->unhash, p->unhash);
 		ctx->sk_write_space = write_space;
 		ctx->sk_proto = p;
 	} else {


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
  2021-03-24 20:59 [bpf PATCH 0/2] bpf, sockmap fixes John Fastabend
  2021-03-24 20:59 ` [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset John Fastabend
@ 2021-03-24 20:59 ` John Fastabend
  2021-03-25  0:44   ` Cong Wang
  1 sibling, 1 reply; 11+ messages in thread
From: John Fastabend @ 2021-03-24 20:59 UTC (permalink / raw)
  To: john.fastabend, andrii, daniel, ast; +Cc: xiyou.wangcong, bpf, netdev, lmb

Incorrect accounting fwd_alloc can result in a warning when the socket
is torn down,

 [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
 [...]
 [18455.319543] Call Trace:
 [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
 [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
 [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
 [18455.319631]  ? sk_reset_timer+0x15/0x70
 [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
 [18455.319663]  ? lock_release+0xb2/0x3f0
 [18455.319676]  ? __release_sock+0x8a/0x1b0
 [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
 [18455.319704]  ? lock_release+0x3f0/0x3f0
 [18455.319717]  ? __tcp_close+0x2c6/0x790
 [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
 [18455.319750]  tcp_v4_do_rcv+0x168/0x370
 [18455.319767]  __release_sock+0xbc/0x1b0
 [18455.319785]  __tcp_close+0x2ee/0x790
 [18455.319805]  tcp_close+0x20/0x80

This currently happens because on redirect case we do skb_set_owner_r()
with the original sock. This increments the fwd_alloc memory accounting
on the original sock. Then on redirect we may push this into the queue
of the psock we are redirecting to. When the skb is flushed from the
queue we give the memory back to the original sock. The problem is if
the original sock is destroyed/closed with skbs on another psocks queue
then the original sock will not have a way to reclaim the memory before
being destroyed. Then above warning will be thrown

  sockA                          sockB

  sk_psock_strp_read()
   sk_psock_verdict_apply()
     -- SK_REDIRECT --
     sk_psock_skb_redirect()
                                skb_queue_tail(psock_other->ingress_skb..)

  sk_close()
   sock_map_unref()
     sk_psock_put()
       sk_psock_drop()
         sk_psock_zap_ingress()

At this point we have torn down our own psock, but have the outstanding
skb in psock_other. Note that SK_PASS doesn't have this problem because
the sk_psock_drop() logic releases the skb, its still associated with
our psock.

To resolve lets only account for sockets on the ingress queue that are
still associated with the current socket. On the redirect case we will
check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
or received with sk_psock_skb_ingress memory will be claimed on psock_other.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 1261512d6807..f150b5b63561 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
 	if (unlikely(!msg))
 		return -EAGAIN;
 	sk_msg_init(msg);
+	skb_set_owner_r(skb, sk);
 	return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
 }
 
@@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int
 {
 	switch (verdict) {
 	case __SK_REDIRECT:
-		skb_set_owner_r(skb, sk);
 		sk_psock_skb_redirect(skb);
 		break;
 	case __SK_PASS:
@@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
 	rcu_read_lock();
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
-		/* We skip full set_owner_r here because if we do a SK_PASS
-		 * or SK_DROP we can skip skb memory accounting and use the
-		 * TLS context.
-		 */
 		skb->sk = psock->sk;
 		tcp_skb_bpf_redirect_clear(skb);
 		ret = sk_psock_bpf_run(psock, prog, skb);
@@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		kfree_skb(skb);
 		goto out;
 	}
-	skb_set_owner_r(skb, sk);
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
+		skb->sk = psock->sk;
 		tcp_skb_bpf_redirect_clear(skb);
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
+		skb->sk = NULL;
 	}
 	sk_psock_verdict_apply(psock, skb, ret);
 out:
@@ -956,12 +953,14 @@ static int sk_psock_verdict_recv(read_descriptor_t *desc, struct sk_buff *skb,
 		kfree_skb(skb);
 		goto out;
 	}
-	skb_set_owner_r(skb, sk);
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
+		skb_orphan(skb);
+		skb->sk = sk;
 		tcp_skb_bpf_redirect_clear(skb);
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
+		skb->sk = NULL;
 	}
 	sk_psock_verdict_apply(psock, skb, ret);
 out:


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset
  2021-03-24 20:59 ` [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset John Fastabend
@ 2021-03-25  0:11   ` Cong Wang
  2021-03-25  2:28     ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-25  0:11 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

On Wed, Mar 24, 2021 at 1:59 PM John Fastabend <john.fastabend@gmail.com> wrote:
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index 47b7c5334c34..ecb5634b4c4a 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -754,6 +754,12 @@ static void tls_update(struct sock *sk, struct proto *p,
>
>         ctx = tls_get_ctx(sk);
>         if (likely(ctx)) {
> +               /* TLS does not have an unhash proto in SW cases, but we need
> +                * to ensure we stop using the sock_map unhash routine because
> +                * the associated psock is being removed. So use the original
> +                * unhash handler.
> +                */
> +               WRITE_ONCE(sk->sk_prot->unhash, p->unhash);
>                 ctx->sk_write_space = write_space;
>                 ctx->sk_proto = p;

It looks awkward to update sk->sk_proto inside tls_update(),
at least when ctx!=NULL.

What is wrong with updating it in sk_psock_restore_proto()
when inet_csk_has_ulp() is true? It looks better to me.

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 6c09d94be2e9..da5dc3ef0ee3 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -360,8 +360,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
                                          struct sk_psock *psock)
 {
-       sk->sk_prot->unhash = psock->saved_unhash;
        if (inet_csk_has_ulp(sk)) {
+               sk->sk_prot->unhash = psock->sk_proto->unhash;
                tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
        } else {
                sk->sk_write_space = psock->saved_write_space;


sk_psock_restore_proto() is the only caller of tcp_update_ulp()
so should be equivalent.

Thanks.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
  2021-03-24 20:59 ` [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting John Fastabend
@ 2021-03-25  0:44   ` Cong Wang
  2021-03-25  2:46     ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-25  0:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Incorrect accounting fwd_alloc can result in a warning when the socket
> is torn down,
>
>  [18455.319240] WARNING: CPU: 0 PID: 24075 at net/core/stream.c:208 sk_stream_kill_queues+0x21f/0x230
>  [...]
>  [18455.319543] Call Trace:
>  [18455.319556]  inet_csk_destroy_sock+0xba/0x1f0
>  [18455.319577]  tcp_rcv_state_process+0x1b4e/0x2380
>  [18455.319593]  ? lock_downgrade+0x3a0/0x3a0
>  [18455.319617]  ? tcp_finish_connect+0x1e0/0x1e0
>  [18455.319631]  ? sk_reset_timer+0x15/0x70
>  [18455.319646]  ? tcp_schedule_loss_probe+0x1b2/0x240
>  [18455.319663]  ? lock_release+0xb2/0x3f0
>  [18455.319676]  ? __release_sock+0x8a/0x1b0
>  [18455.319690]  ? lock_downgrade+0x3a0/0x3a0
>  [18455.319704]  ? lock_release+0x3f0/0x3f0
>  [18455.319717]  ? __tcp_close+0x2c6/0x790
>  [18455.319736]  ? tcp_v4_do_rcv+0x168/0x370
>  [18455.319750]  tcp_v4_do_rcv+0x168/0x370
>  [18455.319767]  __release_sock+0xbc/0x1b0
>  [18455.319785]  __tcp_close+0x2ee/0x790
>  [18455.319805]  tcp_close+0x20/0x80
>
> This currently happens because on redirect case we do skb_set_owner_r()
> with the original sock. This increments the fwd_alloc memory accounting
> on the original sock. Then on redirect we may push this into the queue
> of the psock we are redirecting to. When the skb is flushed from the
> queue we give the memory back to the original sock. The problem is if
> the original sock is destroyed/closed with skbs on another psocks queue
> then the original sock will not have a way to reclaim the memory before
> being destroyed. Then above warning will be thrown
>
>   sockA                          sockB
>
>   sk_psock_strp_read()
>    sk_psock_verdict_apply()
>      -- SK_REDIRECT --
>      sk_psock_skb_redirect()
>                                 skb_queue_tail(psock_other->ingress_skb..)
>
>   sk_close()
>    sock_map_unref()
>      sk_psock_put()
>        sk_psock_drop()
>          sk_psock_zap_ingress()
>
> At this point we have torn down our own psock, but have the outstanding
> skb in psock_other. Note that SK_PASS doesn't have this problem because
> the sk_psock_drop() logic releases the skb, its still associated with
> our psock.
>
> To resolve lets only account for sockets on the ingress queue that are
> still associated with the current socket. On the redirect case we will
> check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> or received with sk_psock_skb_ingress memory will be claimed on psock_other.

You mean sk_psock_skb_ingress(), right?

>
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Fixes: 6fa9201a89898 ("bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c |   13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 1261512d6807..f150b5b63561 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -488,6 +488,7 @@ static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb
>         if (unlikely(!msg))
>                 return -EAGAIN;
>         sk_msg_init(msg);
> +       skb_set_owner_r(skb, sk);
>         return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
>  }
>
> @@ -790,7 +791,6 @@ static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int
>  {
>         switch (verdict) {
>         case __SK_REDIRECT:
> -               skb_set_owner_r(skb, sk);
>                 sk_psock_skb_redirect(skb);
>                 break;
>         case __SK_PASS:
> @@ -808,10 +808,6 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
>         rcu_read_lock();
>         prog = READ_ONCE(psock->progs.skb_verdict);
>         if (likely(prog)) {
> -               /* We skip full set_owner_r here because if we do a SK_PASS
> -                * or SK_DROP we can skip skb memory accounting and use the
> -                * TLS context.
> -                */
>                 skb->sk = psock->sk;
>                 tcp_skb_bpf_redirect_clear(skb);
>                 ret = sk_psock_bpf_run(psock, prog, skb);
> @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
>                 kfree_skb(skb);
>                 goto out;
>         }
> -       skb_set_owner_r(skb, sk);
>         prog = READ_ONCE(psock->progs.skb_verdict);
>         if (likely(prog)) {
> +               skb->sk = psock->sk;

Why is skb_orphan() not needed here?

Nit: You can just use 'sk' here, so "skb->sk = sk".


>                 tcp_skb_bpf_redirect_clear(skb);
>                 ret = sk_psock_bpf_run(psock, prog, skb);
>                 ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
> +               skb->sk = NULL;

Why do you want to set it to NULL here?

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset
  2021-03-25  0:11   ` Cong Wang
@ 2021-03-25  2:28     ` John Fastabend
  2021-03-25 18:57       ` Cong Wang
  0 siblings, 1 reply; 11+ messages in thread
From: John Fastabend @ 2021-03-25  2:28 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

Cong Wang wrote:
> On Wed, Mar 24, 2021 at 1:59 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > index 47b7c5334c34..ecb5634b4c4a 100644
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -754,6 +754,12 @@ static void tls_update(struct sock *sk, struct proto *p,
> >
> >         ctx = tls_get_ctx(sk);
> >         if (likely(ctx)) {
> > +               /* TLS does not have an unhash proto in SW cases, but we need
> > +                * to ensure we stop using the sock_map unhash routine because
> > +                * the associated psock is being removed. So use the original
> > +                * unhash handler.
> > +                */
> > +               WRITE_ONCE(sk->sk_prot->unhash, p->unhash);
> >                 ctx->sk_write_space = write_space;
> >                 ctx->sk_proto = p;
> 
> It looks awkward to update sk->sk_proto inside tls_update(),
> at least when ctx!=NULL.

hmm. It doesn't strike me as paticularly awkward but OK.

> 
> What is wrong with updating it in sk_psock_restore_proto()
> when inet_csk_has_ulp() is true? It looks better to me.

It could be wrong if inet_csk_has_ulp has an unhash callback
already assigned. But, because we know inet_csk_has_ulp()
really means is_tls_attached() it would be fine.

> 
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 6c09d94be2e9..da5dc3ef0ee3 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -360,8 +360,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
>  static inline void sk_psock_restore_proto(struct sock *sk,
>                                           struct sk_psock *psock)
>  {
> -       sk->sk_prot->unhash = psock->saved_unhash;
>         if (inet_csk_has_ulp(sk)) {
> +               sk->sk_prot->unhash = psock->sk_proto->unhash;
>                 tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
>         } else {
>                 sk->sk_write_space = psock->saved_write_space;
> 
> 
> sk_psock_restore_proto() is the only caller of tcp_update_ulp()
> so should be equivalent.

Agree it is equivalent. I don't mind moving the assignment around
if folks think its nicer.

> 
> Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
  2021-03-25  0:44   ` Cong Wang
@ 2021-03-25  2:46     ` John Fastabend
  2021-03-25 19:27       ` Cong Wang
  0 siblings, 1 reply; 11+ messages in thread
From: John Fastabend @ 2021-03-25  2:46 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

Cong Wang wrote:
> On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Incorrect accounting fwd_alloc can result in a warning when the socket
> > is torn down,
> >

[...]

> > To resolve lets only account for sockets on the ingress queue that are
> > still associated with the current socket. On the redirect case we will
> > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> > or received with sk_psock_skb_ingress memory will be claimed on psock_other.
                     ^^^^^^^^^^^^^^^^^^^^
> 
> You mean sk_psock_skb_ingress(), right?

Yes.

[...]

> > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
> >                 kfree_skb(skb);
> >                 goto out;
> >         }
> > -       skb_set_owner_r(skb, sk);
> >         prog = READ_ONCE(psock->progs.skb_verdict);
> >         if (likely(prog)) {
> > +               skb->sk = psock->sk;
> 
> Why is skb_orphan() not needed here?

These come from strparser which do not have skb->sk set.

> 
> Nit: You can just use 'sk' here, so "skb->sk = sk".

Sure that is a bit nicer, will respin with this.

> 
> 
> >                 tcp_skb_bpf_redirect_clear(skb);
> >                 ret = sk_psock_bpf_run(psock, prog, skb);
> >                 ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
> > +               skb->sk = NULL;
> 
> Why do you want to set it to NULL here?

So we don't cause the stack to throw other errors later if we
were to call skb_orphan for example. Various places in the skb
helpers expect both skb->sk and skb->destructor to be set together
and here we are just using it as a mechanism to feed the sk into
the BPF program side. The above skb_set_owner_r for example
would likely BUG().

> 
> Thanks.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset
  2021-03-25  2:28     ` John Fastabend
@ 2021-03-25 18:57       ` Cong Wang
  2021-03-26  0:57         ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-25 18:57 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

On Wed, Mar 24, 2021 at 7:28 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Wed, Mar 24, 2021 at 1:59 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > > index 47b7c5334c34..ecb5634b4c4a 100644
> > > --- a/net/tls/tls_main.c
> > > +++ b/net/tls/tls_main.c
> > > @@ -754,6 +754,12 @@ static void tls_update(struct sock *sk, struct proto *p,
> > >
> > >         ctx = tls_get_ctx(sk);
> > >         if (likely(ctx)) {
> > > +               /* TLS does not have an unhash proto in SW cases, but we need
> > > +                * to ensure we stop using the sock_map unhash routine because
> > > +                * the associated psock is being removed. So use the original
> > > +                * unhash handler.
> > > +                */
> > > +               WRITE_ONCE(sk->sk_prot->unhash, p->unhash);
> > >                 ctx->sk_write_space = write_space;
> > >                 ctx->sk_proto = p;
> >
> > It looks awkward to update sk->sk_proto inside tls_update(),
> > at least when ctx!=NULL.
>
> hmm. It doesn't strike me as paticularly awkward but OK.

I read tls_update() as "updating ctx when it is initialized", with your
patch, we are updating sk->sk_prot->unhash too when updating ctx,
pretty much like a piggyback, hence it reads odd to me.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
  2021-03-25  2:46     ` John Fastabend
@ 2021-03-25 19:27       ` Cong Wang
  2021-03-26  0:58         ` John Fastabend
  0 siblings, 1 reply; 11+ messages in thread
From: Cong Wang @ 2021-03-25 19:27 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

On Wed, Mar 24, 2021 at 7:46 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Incorrect accounting fwd_alloc can result in a warning when the socket
> > > is torn down,
> > >
>
> [...]
>
> > > To resolve lets only account for sockets on the ingress queue that are
> > > still associated with the current socket. On the redirect case we will
> > > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> > > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> > > or received with sk_psock_skb_ingress memory will be claimed on psock_other.
>                      ^^^^^^^^^^^^^^^^^^^^
> >
> > You mean sk_psock_skb_ingress(), right?
>
> Yes.

skb_send_sock_locked() actually allocates its own skb when sending, hence
it uses a different skb for memory accounting.

>
> [...]
>
> > > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
> > >                 kfree_skb(skb);
> > >                 goto out;
> > >         }
> > > -       skb_set_owner_r(skb, sk);
> > >         prog = READ_ONCE(psock->progs.skb_verdict);
> > >         if (likely(prog)) {
> > > +               skb->sk = psock->sk;
> >
> > Why is skb_orphan() not needed here?
>
> These come from strparser which do not have skb->sk set.

Hmm, but sk_psock_verdict_recv() passes a clone too, like
strparser, so either we need it for both, or not at all. Clones
do not have skb->sk, so I think you can remove the one in
sk_psock_verdict_recv() too.


>
> >
> > Nit: You can just use 'sk' here, so "skb->sk = sk".
>
> Sure that is a bit nicer, will respin with this.
>
> >
> >
> > >                 tcp_skb_bpf_redirect_clear(skb);
> > >                 ret = sk_psock_bpf_run(psock, prog, skb);
> > >                 ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
> > > +               skb->sk = NULL;
> >
> > Why do you want to set it to NULL here?
>
> So we don't cause the stack to throw other errors later if we
> were to call skb_orphan for example. Various places in the skb
> helpers expect both skb->sk and skb->destructor to be set together
> and here we are just using it as a mechanism to feed the sk into
> the BPF program side. The above skb_set_owner_r for example
> would likely BUG().

Sounds reasonable.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset
  2021-03-25 18:57       ` Cong Wang
@ 2021-03-26  0:57         ` John Fastabend
  0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2021-03-26  0:57 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

Cong Wang wrote:
> On Wed, Mar 24, 2021 at 7:28 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > On Wed, Mar 24, 2021 at 1:59 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > > > index 47b7c5334c34..ecb5634b4c4a 100644
> > > > --- a/net/tls/tls_main.c
> > > > +++ b/net/tls/tls_main.c
> > > > @@ -754,6 +754,12 @@ static void tls_update(struct sock *sk, struct proto *p,
> > > >
> > > >         ctx = tls_get_ctx(sk);
> > > >         if (likely(ctx)) {
> > > > +               /* TLS does not have an unhash proto in SW cases, but we need
> > > > +                * to ensure we stop using the sock_map unhash routine because
> > > > +                * the associated psock is being removed. So use the original
> > > > +                * unhash handler.
> > > > +                */
> > > > +               WRITE_ONCE(sk->sk_prot->unhash, p->unhash);
> > > >                 ctx->sk_write_space = write_space;
> > > >                 ctx->sk_proto = p;
> > >
> > > It looks awkward to update sk->sk_proto inside tls_update(),
> > > at least when ctx!=NULL.
> >
> > hmm. It doesn't strike me as paticularly awkward but OK.
> 
> I read tls_update() as "updating ctx when it is initialized", with your
> patch, we are updating sk->sk_prot->unhash too when updating ctx,
> pretty much like a piggyback, hence it reads odd to me.
> 
> Thanks.


OK convinced.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting
  2021-03-25 19:27       ` Cong Wang
@ 2021-03-26  0:58         ` John Fastabend
  0 siblings, 0 replies; 11+ messages in thread
From: John Fastabend @ 2021-03-26  0:58 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Andrii Nakryiko, Daniel Borkmann, Alexei Starovoitov, bpf,
	Linux Kernel Network Developers, Lorenz Bauer

Cong Wang wrote:
> On Wed, Mar 24, 2021 at 7:46 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Cong Wang wrote:
> > > On Wed, Mar 24, 2021 at 2:00 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Incorrect accounting fwd_alloc can result in a warning when the socket
> > > > is torn down,
> > > >
> >
> > [...]
> >
> > > > To resolve lets only account for sockets on the ingress queue that are
> > > > still associated with the current socket. On the redirect case we will
> > > > check memory limits per 6fa9201a89898, but will omit fwd_alloc accounting
> > > > until skb is actually enqueued. When the skb is sent via skb_send_sock_locked
> > > > or received with sk_psock_skb_ingress memory will be claimed on psock_other.
> >                      ^^^^^^^^^^^^^^^^^^^^
> > >
> > > You mean sk_psock_skb_ingress(), right?
> >
> > Yes.
> 
> skb_send_sock_locked() actually allocates its own skb when sending, hence
> it uses a different skb for memory accounting.
> 
> >
> > [...]
> >
> > > > @@ -880,12 +876,13 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
> > > >                 kfree_skb(skb);
> > > >                 goto out;
> > > >         }
> > > > -       skb_set_owner_r(skb, sk);
> > > >         prog = READ_ONCE(psock->progs.skb_verdict);
> > > >         if (likely(prog)) {
> > > > +               skb->sk = psock->sk;
> > >
> > > Why is skb_orphan() not needed here?
> >
> > These come from strparser which do not have skb->sk set.
> 
> Hmm, but sk_psock_verdict_recv() passes a clone too, like
> strparser, so either we need it for both, or not at all. Clones
> do not have skb->sk, so I think you can remove the one in
> sk_psock_verdict_recv() too.

Agree skb_orphan can just be removed, I was being overly
paranoid.

Thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-03-26  1:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 20:59 [bpf PATCH 0/2] bpf, sockmap fixes John Fastabend
2021-03-24 20:59 ` [bpf PATCH 1/2] bpf, sockmap: fix sk->prot unhash op reset John Fastabend
2021-03-25  0:11   ` Cong Wang
2021-03-25  2:28     ` John Fastabend
2021-03-25 18:57       ` Cong Wang
2021-03-26  0:57         ` John Fastabend
2021-03-24 20:59 ` [bpf PATCH 2/2] bpf, sockmap: fix incorrect fwd_alloc accounting John Fastabend
2021-03-25  0:44   ` Cong Wang
2021-03-25  2:46     ` John Fastabend
2021-03-25 19:27       ` Cong Wang
2021-03-26  0:58         ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).