BPF Archive on lore.kernel.org
 help / color / Atom feed
* [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes
@ 2020-10-09 18:35 John Fastabend
  2020-10-09 18:36 ` [bpf-next PATCH v3 1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits John Fastabend
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:35 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

Users of sockmap and skmsg trying to build proxys and other tools
have pointed out to me the error handling can be problematic. If
the proxy is under-provisioned and/or the BPF admin does not have
the ability to update/modify memory provisions on the sockets
its possible data may be dropped. For some things we have retries
so everything works out OK, but for most things this is likely
not great. And things go bad.

The original design dropped memory accounting on the receive
socket as early as possible. We did this early in sk_skb
handling and then charged it to the redirect socket immediately
after running the BPF program.

But, this design caused a fundamental problem. Namely, what should we do
if we redirect to a socket that has already reached its socket memory
limits. For proxy use cases the network admin can tune memory limits.
But, in general we punted on this problem and told folks to simply make
your memory limits high enough to handle your workload. This is not a
really good answer. When deploying into environments where we expect this
to be transparent its no longer the case because we need to tune params.
In fact its really only viable in cases where we have fine grained
control over the application. For example a proxy redirecting from an
ingress socket to an egress socket. The result is I get bug
reports because its surprising for one, but more importantly also breaks
some use cases. So lets fix it.

This series cleans up the different cases so that in many common
modes, such as passing packet up to receive socket, we can simply
use the underlying assumption that the TCP stack already has done
memory accounting.

Next instead of trying to do memory accounting against the socket
we plan to redirect into we keep memory accounting on the receive
socket until the skb can be put on the redirect socket. This means
if we do an egress redirect to a socket and sock_writable() returns
EAGAIN we can requeue the skb on the workqueue and try again. The
same scenario plays out for ingress. If the skb can not be put on
the receive queue of the redirect socket than we simply requeue and
retry. In both cases memory is still accounted for against the
receiving socket.

This also handles head of line blocking. With the above scheme the
skb is on a queue associated with the socket it will be sent/recv'd
on, but the memory accounting is against the received socket. This
means the receive socket can advance to the next skb and avoid head
of line blocking. At least until its receive memory on the socket
runs out. This will put some maximum size on the amount of data any
socket can enqueue giving us bounds on the skb lists so they can't grow
indefinitely.

Overall I think this is a win. Tested with test_sockmap.

These are fixes, but I tagged it for bpf-next considering we are
at -rc8.

v1->v2: Fix uninitialized/unused variables (kernel test robot)
v2->v3: fix typo in patch2 err=0 needs to be <0 so use err=-EIO

---

John Fastabend (6):
      bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits
      bpf, sockmap: On receive programs try to fast track SK_PASS ingress
      bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage
      bpf, sockmap: remove dropped data on errors in redirect case
      bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup
      bpf, sockmap: Add memory accounting so skbs on ingress lists are visible


 net/core/skmsg.c |   83 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 45 insertions(+), 38 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
@ 2020-10-09 18:36 ` John Fastabend
  2020-10-09 18:36 ` [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress John Fastabend
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:36 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

For sk_skb case where skb_verdict program returns SK_PASS to continue to
pass packet up the stack, the memory limits were already checked before
enqueuing in skb_queue_tail from TCP side. So, lets remove the extra checks
here. The theory is if the TCP stack believes we have memory to receive
the packet then lets trust the stack and not double check the limits.

In fact the accounting here can cause a drop if sk_rmem_alloc has increased
after the stack accepted this packet, but before the duplicate check here.
And worse if this happens because TCP stack already believes the data has
been received there is no retransmit.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 4b5f7c8fecd1..040ae1d75b65 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -771,6 +771,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
 static void sk_psock_verdict_apply(struct sk_psock *psock,
 				   struct sk_buff *skb, int verdict)
 {
+	struct tcp_skb_cb *tcp;
 	struct sock *sk_other;
 
 	switch (verdict) {
@@ -780,16 +781,12 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 		    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
 			goto out_free;
 		}
-		if (atomic_read(&sk_other->sk_rmem_alloc) <=
-		    sk_other->sk_rcvbuf) {
-			struct tcp_skb_cb *tcp = TCP_SKB_CB(skb);
 
-			tcp->bpf.flags |= BPF_F_INGRESS;
-			skb_queue_tail(&psock->ingress_skb, skb);
-			schedule_work(&psock->work);
-			break;
-		}
-		goto out_free;
+		tcp = TCP_SKB_CB(skb);
+		tcp->bpf.flags |= BPF_F_INGRESS;
+		skb_queue_tail(&psock->ingress_skb, skb);
+		schedule_work(&psock->work);
+		break;
 	case __SK_REDIRECT:
 		sk_psock_skb_redirect(skb);
 		break;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
  2020-10-09 18:36 ` [bpf-next PATCH v3 1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits John Fastabend
@ 2020-10-09 18:36 ` John Fastabend
  2020-10-12  9:03   ` Jakub Sitnicki
  2020-10-09 18:36 ` [bpf-next PATCH v3 3/6] bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage John Fastabend
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:36 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

When we receive an skb and the ingress skb verdict program returns
SK_PASS we currently set the ingress flag and put it on the workqueue
so it can be turned into a sk_msg and put on the sk_msg ingress queue.
Then finally telling userspace with data_ready hook.

Here we observe that if the workqueue is empty then we can try to
convert into a sk_msg type and call data_ready directly without
bouncing through a workqueue. Its a common pattern to have a recv
verdict program for visibility that always returns SK_PASS. In this
case unless there is an ENOMEM error or we overrun the socket we
can avoid the workqueue completely only using it when we fall back
to error cases caused by memory pressure.

By doing this we eliminate another case where data may be dropped
if errors occur on memory limits in workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 040ae1d75b65..4b160d97b7f9 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 {
 	struct tcp_skb_cb *tcp;
 	struct sock *sk_other;
+	int err = -EIO;
 
 	switch (verdict) {
 	case __SK_PASS:
@@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 
 		tcp = TCP_SKB_CB(skb);
 		tcp->bpf.flags |= BPF_F_INGRESS;
-		skb_queue_tail(&psock->ingress_skb, skb);
-		schedule_work(&psock->work);
+
+		/* If the queue is empty then we can submit directly
+		 * into the msg queue. If its not empty we have to
+		 * queue work otherwise we may get OOO data. Otherwise,
+		 * if sk_psock_skb_ingress errors will be handled by
+		 * retrying later from workqueue.
+		 */
+		if (skb_queue_empty(&psock->ingress_skb)) {
+			err = sk_psock_skb_ingress(psock, skb);
+		}
+		if (err < 0) {
+			skb_queue_tail(&psock->ingress_skb, skb);
+			schedule_work(&psock->work);
+		}
 		break;
 	case __SK_REDIRECT:
 		sk_psock_skb_redirect(skb);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 3/6] bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
  2020-10-09 18:36 ` [bpf-next PATCH v3 1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits John Fastabend
  2020-10-09 18:36 ` [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress John Fastabend
@ 2020-10-09 18:36 ` John Fastabend
  2020-10-09 18:37 ` [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case John Fastabend
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:36 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

The skb_set_owner_w is unnecessary here. The sendpage call will create a
fresh skb and set the owner correctly from workqueue. Its also not entirely
harmless because it consumes cycles, but also impacts resource accounting
by increasing sk_wmem_alloc. This is charging the socket we are going to
send to for the skb, but we will put it on the workqueue for some time
before this happens so we are artifically inflating sk_wmem_alloc for
this period. Further, we don't know how many skbs will be used to send the
packet or how it will be broken up when sent over the new socket so
charging it with one big sum is also not correct when the workqueue may
break it up if facing memory pressure. Seeing we don't know how/when
this is going to be sent drop the early accounting.

A later patch will do proper accounting charged on receive socket for
the case where skbs get enqueued on the workqueue.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |    2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 4b160d97b7f9..7389d5d7e7f8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -728,8 +728,6 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
 	    (ingress &&
 	     atomic_read(&sk_other->sk_rmem_alloc) <=
 	     sk_other->sk_rcvbuf)) {
-		if (!ingress)
-			skb_set_owner_w(skb, sk_other);
 		skb_queue_tail(&psock_other->ingress_skb, skb);
 		schedule_work(&psock_other->work);
 	} else {


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
                   ` (2 preceding siblings ...)
  2020-10-09 18:36 ` [bpf-next PATCH v3 3/6] bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage John Fastabend
@ 2020-10-09 18:37 ` John Fastabend
  2020-10-12 12:04   ` Jakub Sitnicki
  2020-10-09 18:37 ` [bpf-next PATCH v3 5/6] bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup John Fastabend
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:37 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

In the sk_skb redirect case we didn't handle the case where we overrun
the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
Because we didn't have anything implemented we simply dropped the skb.
This meant data could be dropped if socket memory accounting was in
place.

This fixes the above dropped data case by moving the memory checks
later in the code where we actually do the send or recv. This pushes
those checks into the workqueue and allows us to return an EAGAIN error
which in turn allows us to try again later from the workqueue.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 7389d5d7e7f8..880b84baab5e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -433,10 +433,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 			       u32 off, u32 len, bool ingress)
 {
-	if (ingress)
-		return sk_psock_skb_ingress(psock, skb);
-	else
+	if (!ingress) {
+		if (!sock_writeable(psock->sk))
+			return -EAGAIN;
 		return skb_send_sock_locked(psock->sk, skb, off, len);
+	}
+	return sk_psock_skb_ingress(psock, skb);
 }
 
 static void sk_psock_backlog(struct work_struct *work)
@@ -709,30 +711,28 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
 {
 	struct sk_psock *psock_other;
 	struct sock *sk_other;
-	bool ingress;
 
 	sk_other = tcp_skb_bpf_redirect_fetch(skb);
+	/* This error is a buggy BPF program, it returned a redirect
+	 * return code, but then didn't set a redirect interface.
+	 */
 	if (unlikely(!sk_other)) {
 		kfree_skb(skb);
 		return;
 	}
 	psock_other = sk_psock(sk_other);
+	/* This error indicates the socket is being torn down or had another
+	 * error that caused the pipe to break. We can't send a packet on
+	 * a socket that is in this state so we drop the skb.
+	 */
 	if (!psock_other || sock_flag(sk_other, SOCK_DEAD) ||
 	    !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
 		kfree_skb(skb);
 		return;
 	}
 
-	ingress = tcp_skb_bpf_ingress(skb);
-	if ((!ingress && sock_writeable(sk_other)) ||
-	    (ingress &&
-	     atomic_read(&sk_other->sk_rmem_alloc) <=
-	     sk_other->sk_rcvbuf)) {
-		skb_queue_tail(&psock_other->ingress_skb, skb);
-		schedule_work(&psock_other->work);
-	} else {
-		kfree_skb(skb);
-	}
+	skb_queue_tail(&psock_other->ingress_skb, skb);
+	schedule_work(&psock_other->work);
 }
 
 static void sk_psock_tls_verdict_apply(struct sk_buff *skb, int verdict)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 5/6] bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
                   ` (3 preceding siblings ...)
  2020-10-09 18:37 ` [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case John Fastabend
@ 2020-10-09 18:37 ` John Fastabend
  2020-10-09 18:37 ` [bpf-next PATCH v3 6/6] bpf, sockmap: Add memory accounting so skbs on ingress lists are visible John Fastabend
  2020-10-12  1:10 ` [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes patchwork-bot+netdevbpf
  6 siblings, 0 replies; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:37 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

Calling skb_orphan() is unnecessary in the strp rcv handler because the skb
is from a skb_clone() in __strp_recv. So it never has a destructor or a
sk assigned. Plus its confusing to read because it might hint to the reader
that the skb could have an sk assigned which is not true. Even if we did
have an sk assigned it would be cleaner to simply wait for the upcoming
kfree_skb().

Additionally, move the comment about strparser clone up so its closer to
the logic it is describing and add to it so that it is more complete.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 880b84baab5e..3e78f2a80747 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -686,15 +686,16 @@ static int sk_psock_bpf_run(struct sk_psock *psock, struct bpf_prog *prog,
 {
 	int ret;
 
+	/* strparser clones the skb before handing it to a upper layer,
+	 * meaning we have the same data, but sk is NULL. We do want an
+	 * sk pointer though when we run the BPF program. So we set it
+	 * here and then NULL it to ensure we don't trigger a BUG_ON()
+	 * in skb/sk operations later if kfree_skb is called with a
+	 * valid skb->sk pointer and no destructor assigned.
+	 */
 	skb->sk = psock->sk;
 	bpf_compute_data_end_sk_skb(skb);
 	ret = bpf_prog_run_pin_on_cpu(prog, skb);
-	/* strparser clones the skb before handing it to a upper layer,
-	 * meaning skb_orphan has been called. We NULL sk on the way out
-	 * to ensure we don't trigger a BUG_ON() in skb/sk operations
-	 * later and because we are not charging the memory of this skb
-	 * to any socket yet.
-	 */
 	skb->sk = NULL;
 	return ret;
 }
@@ -824,7 +825,6 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 	}
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
-		skb_orphan(skb);
 		tcp_skb_bpf_redirect_clear(skb);
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [bpf-next PATCH v3 6/6] bpf, sockmap: Add memory accounting so skbs on ingress lists are visible
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
                   ` (4 preceding siblings ...)
  2020-10-09 18:37 ` [bpf-next PATCH v3 5/6] bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup John Fastabend
@ 2020-10-09 18:37 ` John Fastabend
  2020-10-12  1:10 ` [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes patchwork-bot+netdevbpf
  6 siblings, 0 replies; 14+ messages in thread
From: John Fastabend @ 2020-10-09 18:37 UTC (permalink / raw)
  To: john.fastabend, alexei.starovoitov, daniel; +Cc: netdev, bpf, jakub, lmb

Move skb->sk assignment out of sk_psock_bpf_run() and into individual
callers. Then we can use proper skb_set_owner_r() call to assign a
sk to a skb. This improves things by also charging the truesize against
the sockets sk_rmem_alloc counter. With this done we get some accounting
in place to ensure the memory associated with skbs on the workqueue are
still being accounted for somewhere. Finally, by using skb_set_owner_r
the destructor is setup so we can just let the normal skb_kfree logic
recover the memory. Combined with previous patch dropping skb_orphan()
we now can recover from memory pressure and maintain accounting.

Note, we will charge the skbs against their originating socket even
if being redirected into another socket. Once the skb completes the
redirect op the kfree_skb will give the memory back. This is important
because if we charged the socket we are redirecting to (like it was
done before this series) the sock_writeable() test could fail because
of the skb trying to be sent is already charged against the socket.

Also TLS case is special. Here we wait until we have decided not to
simply PASS the packet up the stack. In the case where we PASS the
packet up the stack we already have an skb which is accounted for on
the TLS socket context.

For the parser case we continue to just set/clear skb->sk this is
because the skb being used here may be combined with other skbs or
turned into multiple skbs depending on the parser logic. For example
the parser could request a payload length greater than skb->len so
that the strparser needs to collect multiple skbs. At any rate
the final result will be handled in the strparser recv callback.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 3e78f2a80747..881a5b290946 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -684,20 +684,8 @@ EXPORT_SYMBOL_GPL(sk_psock_msg_verdict);
 static int sk_psock_bpf_run(struct sk_psock *psock, struct bpf_prog *prog,
 			    struct sk_buff *skb)
 {
-	int ret;
-
-	/* strparser clones the skb before handing it to a upper layer,
-	 * meaning we have the same data, but sk is NULL. We do want an
-	 * sk pointer though when we run the BPF program. So we set it
-	 * here and then NULL it to ensure we don't trigger a BUG_ON()
-	 * in skb/sk operations later if kfree_skb is called with a
-	 * valid skb->sk pointer and no destructor assigned.
-	 */
-	skb->sk = psock->sk;
 	bpf_compute_data_end_sk_skb(skb);
-	ret = bpf_prog_run_pin_on_cpu(prog, skb);
-	skb->sk = NULL;
-	return ret;
+	return bpf_prog_run_pin_on_cpu(prog, skb);
 }
 
 static struct sk_psock *sk_psock_from_strp(struct strparser *strp)
@@ -736,10 +724,11 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
 	schedule_work(&psock_other->work);
 }
 
-static void sk_psock_tls_verdict_apply(struct sk_buff *skb, int verdict)
+static void sk_psock_tls_verdict_apply(struct sk_buff *skb, struct sock *sk, int verdict)
 {
 	switch (verdict) {
 	case __SK_REDIRECT:
+		skb_set_owner_r(skb, sk);
 		sk_psock_skb_redirect(skb);
 		break;
 	case __SK_PASS:
@@ -757,11 +746,17 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
 	rcu_read_lock();
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
+		/* We skip full set_owner_r here because if we do a SK_PASS
+		 * or SK_DROP we can skip skb memory accounting and use the
+		 * TLS context.
+		 */
+		skb->sk = psock->sk;
 		tcp_skb_bpf_redirect_clear(skb);
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
+		skb->sk = NULL;
 	}
-	sk_psock_tls_verdict_apply(skb, ret);
+	sk_psock_tls_verdict_apply(skb, psock->sk, ret);
 	rcu_read_unlock();
 	return ret;
 }
@@ -823,6 +818,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		kfree_skb(skb);
 		goto out;
 	}
+	skb_set_owner_r(skb, sk);
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
 		tcp_skb_bpf_redirect_clear(skb);
@@ -847,8 +843,11 @@ static int sk_psock_strp_parse(struct strparser *strp, struct sk_buff *skb)
 
 	rcu_read_lock();
 	prog = READ_ONCE(psock->progs.skb_parser);
-	if (likely(prog))
+	if (likely(prog)) {
+		skb->sk = psock->sk;
 		ret = sk_psock_bpf_run(psock, prog, skb);
+		skb->sk = NULL;
+	}
 	rcu_read_unlock();
 	return ret;
 }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes
  2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
                   ` (5 preceding siblings ...)
  2020-10-09 18:37 ` [bpf-next PATCH v3 6/6] bpf, sockmap: Add memory accounting so skbs on ingress lists are visible John Fastabend
@ 2020-10-12  1:10 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 14+ messages in thread
From: patchwork-bot+netdevbpf @ 2020-10-12  1:10 UTC (permalink / raw)
  To: John Fastabend; +Cc: alexei.starovoitov, daniel, netdev, bpf, jakub, lmb

Hello:

This series was applied to bpf/bpf-next.git (refs/heads/master):

On Fri, 09 Oct 2020 11:35:53 -0700 you wrote:
> Users of sockmap and skmsg trying to build proxys and other tools
> have pointed out to me the error handling can be problematic. If
> the proxy is under-provisioned and/or the BPF admin does not have
> the ability to update/modify memory provisions on the sockets
> its possible data may be dropped. For some things we have retries
> so everything works out OK, but for most things this is likely
> not great. And things go bad.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v3,1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits
    https://git.kernel.org/bpf/bpf-next/c/cfea28f890cf
  - [bpf-next,v3,2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress
    https://git.kernel.org/bpf/bpf-next/c/9ecbfb06a078
  - [bpf-next,v3,3/6] bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage
    https://git.kernel.org/bpf/bpf-next/c/29545f4977cf
  - [bpf-next,v3,4/6] bpf, sockmap: remove dropped data on errors in redirect case
    https://git.kernel.org/bpf/bpf-next/c/9047f19e7ccb
  - [bpf-next,v3,5/6] bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup
    https://git.kernel.org/bpf/bpf-next/c/10d58d006356
  - [bpf-next,v3,6/6] bpf, sockmap: Add memory accounting so skbs on ingress lists are visible
    https://git.kernel.org/bpf/bpf-next/c/0b17ad25d8d1

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress
  2020-10-09 18:36 ` [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress John Fastabend
@ 2020-10-12  9:03   ` Jakub Sitnicki
  2020-10-12 15:33     ` John Fastabend
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Sitnicki @ 2020-10-12  9:03 UTC (permalink / raw)
  To: John Fastabend; +Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

Hey John,

Exiting to see this work :-)

On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
> When we receive an skb and the ingress skb verdict program returns
> SK_PASS we currently set the ingress flag and put it on the workqueue
> so it can be turned into a sk_msg and put on the sk_msg ingress queue.
> Then finally telling userspace with data_ready hook.
>
> Here we observe that if the workqueue is empty then we can try to
> convert into a sk_msg type and call data_ready directly without
> bouncing through a workqueue. Its a common pattern to have a recv
> verdict program for visibility that always returns SK_PASS. In this
> case unless there is an ENOMEM error or we overrun the socket we
> can avoid the workqueue completely only using it when we fall back
> to error cases caused by memory pressure.
>
> By doing this we eliminate another case where data may be dropped
> if errors occur on memory limits in workqueue.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c |   17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 040ae1d75b65..4b160d97b7f9 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>  {
>  	struct tcp_skb_cb *tcp;
>  	struct sock *sk_other;
> +	int err = -EIO;
>
>  	switch (verdict) {
>  	case __SK_PASS:
> @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>
>  		tcp = TCP_SKB_CB(skb);
>  		tcp->bpf.flags |= BPF_F_INGRESS;
> -		skb_queue_tail(&psock->ingress_skb, skb);
> -		schedule_work(&psock->work);
> +
> +		/* If the queue is empty then we can submit directly
> +		 * into the msg queue. If its not empty we have to
> +		 * queue work otherwise we may get OOO data. Otherwise,
> +		 * if sk_psock_skb_ingress errors will be handled by
> +		 * retrying later from workqueue.
> +		 */
> +		if (skb_queue_empty(&psock->ingress_skb)) {
> +			err = sk_psock_skb_ingress(psock, skb);

When going through the workqueue (sk_psock_backlog), we will also check
if socket didn't get detached from the process, that is if
psock->sk->sk_socket != NULL, before queueing into msg queue.

Do we need a similar check here?

> +		}
> +		if (err < 0) {
> +			skb_queue_tail(&psock->ingress_skb, skb);
> +			schedule_work(&psock->work);
> +		}
>  		break;
>  	case __SK_REDIRECT:
>  		sk_psock_skb_redirect(skb);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case
  2020-10-09 18:37 ` [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case John Fastabend
@ 2020-10-12 12:04   ` Jakub Sitnicki
  2020-10-12 17:19     ` John Fastabend
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Sitnicki @ 2020-10-12 12:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

On Fri, Oct 09, 2020 at 08:37 PM CEST, John Fastabend wrote:
> In the sk_skb redirect case we didn't handle the case where we overrun
> the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
> Because we didn't have anything implemented we simply dropped the skb.
> This meant data could be dropped if socket memory accounting was in
> place.
>
> This fixes the above dropped data case by moving the memory checks
> later in the code where we actually do the send or recv. This pushes
> those checks into the workqueue and allows us to return an EAGAIN error
> which in turn allows us to try again later from the workqueue.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c |   28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
>

[...]

> @@ -709,30 +711,28 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
>  {
>  	struct sk_psock *psock_other;
>  	struct sock *sk_other;
> -	bool ingress;
>
>  	sk_other = tcp_skb_bpf_redirect_fetch(skb);
> +	/* This error is a buggy BPF program, it returned a redirect
> +	 * return code, but then didn't set a redirect interface.
> +	 */
>  	if (unlikely(!sk_other)) {
>  		kfree_skb(skb);
>  		return;
>  	}
>  	psock_other = sk_psock(sk_other);
> +	/* This error indicates the socket is being torn down or had another
> +	 * error that caused the pipe to break. We can't send a packet on
> +	 * a socket that is in this state so we drop the skb.
> +	 */
>  	if (!psock_other || sock_flag(sk_other, SOCK_DEAD) ||
>  	    !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
>  		kfree_skb(skb);
>  		return;
>  	}
>
> -	ingress = tcp_skb_bpf_ingress(skb);
> -	if ((!ingress && sock_writeable(sk_other)) ||
> -	    (ingress &&
> -	     atomic_read(&sk_other->sk_rmem_alloc) <=
> -	     sk_other->sk_rcvbuf)) {

I'm wondering why the check for going over socket's rcvbuf was removed?

I see that we now rely exclusively on
sk_psock_skb_ingress→sk_rmem_schedule for sk_rmem_alloc checks, which I
don't think applies the rcvbuf limit.

> -		skb_queue_tail(&psock_other->ingress_skb, skb);
> -		schedule_work(&psock_other->work);
> -	} else {
> -		kfree_skb(skb);
> -	}
> +	skb_queue_tail(&psock_other->ingress_skb, skb);
> +	schedule_work(&psock_other->work);
>  }
>
>  static void sk_psock_tls_verdict_apply(struct sk_buff *skb, int verdict)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress
  2020-10-12  9:03   ` Jakub Sitnicki
@ 2020-10-12 15:33     ` John Fastabend
  2020-10-13 19:43       ` Jakub Sitnicki
  0 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2020-10-12 15:33 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

Jakub Sitnicki wrote:
> Hey John,
> 
> Exiting to see this work :-)
> 
> On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
> > When we receive an skb and the ingress skb verdict program returns
> > SK_PASS we currently set the ingress flag and put it on the workqueue
> > so it can be turned into a sk_msg and put on the sk_msg ingress queue.
> > Then finally telling userspace with data_ready hook.
> >
> > Here we observe that if the workqueue is empty then we can try to
> > convert into a sk_msg type and call data_ready directly without
> > bouncing through a workqueue. Its a common pattern to have a recv
> > verdict program for visibility that always returns SK_PASS. In this
> > case unless there is an ENOMEM error or we overrun the socket we
> > can avoid the workqueue completely only using it when we fall back
> > to error cases caused by memory pressure.
> >
> > By doing this we eliminate another case where data may be dropped
> > if errors occur on memory limits in workqueue.
> >
> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c |   17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 040ae1d75b65..4b160d97b7f9 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
> >  {
> >  	struct tcp_skb_cb *tcp;
> >  	struct sock *sk_other;
> > +	int err = -EIO;
> >
> >  	switch (verdict) {
> >  	case __SK_PASS:
> > @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
> >
> >  		tcp = TCP_SKB_CB(skb);
> >  		tcp->bpf.flags |= BPF_F_INGRESS;
> > -		skb_queue_tail(&psock->ingress_skb, skb);
> > -		schedule_work(&psock->work);
> > +
> > +		/* If the queue is empty then we can submit directly
> > +		 * into the msg queue. If its not empty we have to
> > +		 * queue work otherwise we may get OOO data. Otherwise,
> > +		 * if sk_psock_skb_ingress errors will be handled by
> > +		 * retrying later from workqueue.
> > +		 */
> > +		if (skb_queue_empty(&psock->ingress_skb)) {
> > +			err = sk_psock_skb_ingress(psock, skb);
> 
> When going through the workqueue (sk_psock_backlog), we will also check
> if socket didn't get detached from the process, that is if
> psock->sk->sk_socket != NULL, before queueing into msg queue.

The sk_socket check is only for the egress path,

  sk_psock_handle_skb -> skb_send_sock_locked -> kernel_sendmsg_locked

Then the do_tcp_sendpages() uses sk_socket and I don't see any checks for
sk_socket being set. Although I think its worth looking through to see
if the psock/sk state is always such that we have sk_socket there I
don't recall off-hand where that is null'd.

But, to answer your question this is ingress only and here we don't
use sk_socket for anything so I don't see any reason the check is
needed. All that is done here is converting to skmsg and posting
onto ingress queue.

> 
> Do we need a similar check here?
> 

Don't think so for above reason. Thanks for asking though and let me
know if you see something.

I think to make the workqueue path symmetric I'll move the check there
into the egress branch.

> > +		}
> > +		if (err < 0) {
> > +			skb_queue_tail(&psock->ingress_skb, skb);
> > +			schedule_work(&psock->work);
> > +		}
> >  		break;
> >  	case __SK_REDIRECT:
> >  		sk_psock_skb_redirect(skb);



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case
  2020-10-12 12:04   ` Jakub Sitnicki
@ 2020-10-12 17:19     ` John Fastabend
  2020-10-13 10:27       ` Jakub Sitnicki
  0 siblings, 1 reply; 14+ messages in thread
From: John Fastabend @ 2020-10-12 17:19 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

Jakub Sitnicki wrote:
> On Fri, Oct 09, 2020 at 08:37 PM CEST, John Fastabend wrote:
> > In the sk_skb redirect case we didn't handle the case where we overrun
> > the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
> > Because we didn't have anything implemented we simply dropped the skb.
> > This meant data could be dropped if socket memory accounting was in
> > place.
> >
> > This fixes the above dropped data case by moving the memory checks
> > later in the code where we actually do the send or recv. This pushes
> > those checks into the workqueue and allows us to return an EAGAIN error
> > which in turn allows us to try again later from the workqueue.
> >
> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c |   28 ++++++++++++++--------------
> >  1 file changed, 14 insertions(+), 14 deletions(-)
> >
> 
> [...]
> 
> > @@ -709,30 +711,28 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
> >  {
> >  	struct sk_psock *psock_other;
> >  	struct sock *sk_other;
> > -	bool ingress;
> >
> >  	sk_other = tcp_skb_bpf_redirect_fetch(skb);
> > +	/* This error is a buggy BPF program, it returned a redirect
> > +	 * return code, but then didn't set a redirect interface.
> > +	 */
> >  	if (unlikely(!sk_other)) {
> >  		kfree_skb(skb);
> >  		return;
> >  	}
> >  	psock_other = sk_psock(sk_other);
> > +	/* This error indicates the socket is being torn down or had another
> > +	 * error that caused the pipe to break. We can't send a packet on
> > +	 * a socket that is in this state so we drop the skb.
> > +	 */
> >  	if (!psock_other || sock_flag(sk_other, SOCK_DEAD) ||
> >  	    !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
> >  		kfree_skb(skb);
> >  		return;
> >  	}
> >
> > -	ingress = tcp_skb_bpf_ingress(skb);
> > -	if ((!ingress && sock_writeable(sk_other)) ||
> > -	    (ingress &&
> > -	     atomic_read(&sk_other->sk_rmem_alloc) <=
> > -	     sk_other->sk_rcvbuf)) {
> 
> I'm wondering why the check for going over socket's rcvbuf was removed?

Couple reasons, I never checked it from skmsg side so after this patch
accounting for both skmsg and sk_skb types are the same. I think this
should be the case going forward with anything we do around memory
accounting. The other, and more immediate, reason is we don't want the
error case here with the kfree_skb().

> 
> I see that we now rely exclusively on
> sk_psock_skb_ingress→sk_rmem_schedule for sk_rmem_alloc checks, which I
> don't think applies the rcvbuf limit.

Right. Also notice even though we checked it here we never charged the
skm_rmem_alloc for skmsg on the ingress queue. So we were effectively
getting that memory for free. Still doing some review and drafting a
patch to see if this works, but my proposal is:

  For ingress sk_skb case we check sk_rmem_alloc before enqueuing
  the new sk_msg into ingress queue and then also charge the memory
  the same as skb_set_owner_r except do it with a new helper
  skmsg_set_owner_r(skmsg, sk) and only do the atomic add against
  sk_rmem_alloc and the sk_mem_charge() part.

  Then for skmsg programs convert the sk_mem_charge() calls to
  use the new skmsg_set_owner_r() to get the same memory accounting.
  Finally, on copy to user buffer we unwind this. Then we will have
  the memory in the queue accounted for against the socket.

I'll give it a try. Thanks for asking. wdyt? Any other ideas.

> 
> > -		skb_queue_tail(&psock_other->ingress_skb, skb);
> > -		schedule_work(&psock_other->work);
> > -	} else {
> > -		kfree_skb(skb);
> > -	}
> > +	skb_queue_tail(&psock_other->ingress_skb, skb);
> > +	schedule_work(&psock_other->work);
> >  }
> >
> >  static void sk_psock_tls_verdict_apply(struct sk_buff *skb, int verdict)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case
  2020-10-12 17:19     ` John Fastabend
@ 2020-10-13 10:27       ` Jakub Sitnicki
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Sitnicki @ 2020-10-13 10:27 UTC (permalink / raw)
  To: John Fastabend; +Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

On Mon, Oct 12, 2020 at 07:19 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Fri, Oct 09, 2020 at 08:37 PM CEST, John Fastabend wrote:
>> > In the sk_skb redirect case we didn't handle the case where we overrun
>> > the sk_rmem_alloc entry on ingress redirect or sk_wmem_alloc on egress.
>> > Because we didn't have anything implemented we simply dropped the skb.
>> > This meant data could be dropped if socket memory accounting was in
>> > place.
>> >
>> > This fixes the above dropped data case by moving the memory checks
>> > later in the code where we actually do the send or recv. This pushes
>> > those checks into the workqueue and allows us to return an EAGAIN error
>> > which in turn allows us to try again later from the workqueue.
>> >
>> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  net/core/skmsg.c |   28 ++++++++++++++--------------
>> >  1 file changed, 14 insertions(+), 14 deletions(-)
>> >
>>
>> [...]
>>
>> > @@ -709,30 +711,28 @@ static void sk_psock_skb_redirect(struct sk_buff *skb)
>> >  {
>> >  	struct sk_psock *psock_other;
>> >  	struct sock *sk_other;
>> > -	bool ingress;
>> >
>> >  	sk_other = tcp_skb_bpf_redirect_fetch(skb);
>> > +	/* This error is a buggy BPF program, it returned a redirect
>> > +	 * return code, but then didn't set a redirect interface.
>> > +	 */
>> >  	if (unlikely(!sk_other)) {
>> >  		kfree_skb(skb);
>> >  		return;
>> >  	}
>> >  	psock_other = sk_psock(sk_other);
>> > +	/* This error indicates the socket is being torn down or had another
>> > +	 * error that caused the pipe to break. We can't send a packet on
>> > +	 * a socket that is in this state so we drop the skb.
>> > +	 */
>> >  	if (!psock_other || sock_flag(sk_other, SOCK_DEAD) ||
>> >  	    !sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
>> >  		kfree_skb(skb);
>> >  		return;
>> >  	}
>> >
>> > -	ingress = tcp_skb_bpf_ingress(skb);
>> > -	if ((!ingress && sock_writeable(sk_other)) ||
>> > -	    (ingress &&
>> > -	     atomic_read(&sk_other->sk_rmem_alloc) <=
>> > -	     sk_other->sk_rcvbuf)) {
>>
>> I'm wondering why the check for going over socket's rcvbuf was removed?
>
> Couple reasons, I never checked it from skmsg side so after this patch
> accounting for both skmsg and sk_skb types are the same. I think this
> should be the case going forward with anything we do around memory
> accounting. The other, and more immediate, reason is we don't want the
> error case here with the kfree_skb().

Right, we definitely don't want to drop the skb.

What crossed my mind is that sk_psock_handle_skb() could check for
sk_rmem_alloc <= sk_rcvbuf and error out with -EAGAIN. Similarly to how
we check for sock_writable() with this change.

This would let the process owning the destination socket, we are
redirecting to, to push back by tuning down SO_RCVBUF.

>
>>
>> I see that we now rely exclusively on
>> sk_psock_skb_ingress→sk_rmem_schedule for sk_rmem_alloc checks, which I
>> don't think applies the rcvbuf limit.
>
> Right. Also notice even though we checked it here we never charged the
> skm_rmem_alloc for skmsg on the ingress queue. So we were effectively
> getting that memory for free. Still doing some review and drafting a
> patch to see if this works, but my proposal is:
>
>   For ingress sk_skb case we check sk_rmem_alloc before enqueuing
>   the new sk_msg into ingress queue and then also charge the memory
>   the same as skb_set_owner_r except do it with a new helper
>   skmsg_set_owner_r(skmsg, sk) and only do the atomic add against
>   sk_rmem_alloc and the sk_mem_charge() part.
>
>   Then for skmsg programs convert the sk_mem_charge() calls to
>   use the new skmsg_set_owner_r() to get the same memory accounting.
>   Finally, on copy to user buffer we unwind this. Then we will have
>   the memory in the queue accounted for against the socket.
>
> I'll give it a try. Thanks for asking. wdyt? Any other ideas.

SGTM, nothing to add except for honoring SO_RCVBUF I mentioned above.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress
  2020-10-12 15:33     ` John Fastabend
@ 2020-10-13 19:43       ` Jakub Sitnicki
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Sitnicki @ 2020-10-13 19:43 UTC (permalink / raw)
  To: John Fastabend; +Cc: alexei.starovoitov, daniel, netdev, bpf, lmb

On Mon, Oct 12, 2020 at 05:33 PM CEST, John Fastabend wrote:
> Jakub Sitnicki wrote:

[...]

>> On Fri, Oct 09, 2020 at 08:36 PM CEST, John Fastabend wrote:
>> > When we receive an skb and the ingress skb verdict program returns
>> > SK_PASS we currently set the ingress flag and put it on the workqueue
>> > so it can be turned into a sk_msg and put on the sk_msg ingress queue.
>> > Then finally telling userspace with data_ready hook.
>> >
>> > Here we observe that if the workqueue is empty then we can try to
>> > convert into a sk_msg type and call data_ready directly without
>> > bouncing through a workqueue. Its a common pattern to have a recv
>> > verdict program for visibility that always returns SK_PASS. In this
>> > case unless there is an ENOMEM error or we overrun the socket we
>> > can avoid the workqueue completely only using it when we fall back
>> > to error cases caused by memory pressure.
>> >
>> > By doing this we eliminate another case where data may be dropped
>> > if errors occur on memory limits in workqueue.
>> >
>> > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  net/core/skmsg.c |   17 +++++++++++++++--
>> >  1 file changed, 15 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index 040ae1d75b65..4b160d97b7f9 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -773,6 +773,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>> >  {
>> >  	struct tcp_skb_cb *tcp;
>> >  	struct sock *sk_other;
>> > +	int err = -EIO;
>> >
>> >  	switch (verdict) {
>> >  	case __SK_PASS:
>> > @@ -784,8 +785,20 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
>> >
>> >  		tcp = TCP_SKB_CB(skb);
>> >  		tcp->bpf.flags |= BPF_F_INGRESS;
>> > -		skb_queue_tail(&psock->ingress_skb, skb);
>> > -		schedule_work(&psock->work);
>> > +
>> > +		/* If the queue is empty then we can submit directly
>> > +		 * into the msg queue. If its not empty we have to
>> > +		 * queue work otherwise we may get OOO data. Otherwise,
>> > +		 * if sk_psock_skb_ingress errors will be handled by
>> > +		 * retrying later from workqueue.
>> > +		 */
>> > +		if (skb_queue_empty(&psock->ingress_skb)) {
>> > +			err = sk_psock_skb_ingress(psock, skb);
>>
>> When going through the workqueue (sk_psock_backlog), we will also check
>> if socket didn't get detached from the process, that is if
>> psock->sk->sk_socket != NULL, before queueing into msg queue.
>
> The sk_socket check is only for the egress path,
>
>   sk_psock_handle_skb -> skb_send_sock_locked -> kernel_sendmsg_locked

Oh, okay. I thought it was because we want to forwarding into the socket
as soon as there is no process to read from the queue.

> Then the do_tcp_sendpages() uses sk_socket and I don't see any checks for
> sk_socket being set. Although I think its worth looking through to see
> if the psock/sk state is always such that we have sk_socket there I
> don't recall off-hand where that is null'd.

It's in sock_orphan().

> But, to answer your question this is ingress only and here we don't
> use sk_socket for anything so I don't see any reason the check is
> needed. All that is done here is converting to skmsg and posting
> onto ingress queue.

Queued skb won't be read out, but I don't see a problem with it.

[...]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, back to index

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-09 18:35 [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes John Fastabend
2020-10-09 18:36 ` [bpf-next PATCH v3 1/6] bpf, sockmap: skb verdict SK_PASS to self already checked rmem limits John Fastabend
2020-10-09 18:36 ` [bpf-next PATCH v3 2/6] bpf, sockmap: On receive programs try to fast track SK_PASS ingress John Fastabend
2020-10-12  9:03   ` Jakub Sitnicki
2020-10-12 15:33     ` John Fastabend
2020-10-13 19:43       ` Jakub Sitnicki
2020-10-09 18:36 ` [bpf-next PATCH v3 3/6] bpf, sockmap: remove skb_set_owner_w wmem will be taken later from sendpage John Fastabend
2020-10-09 18:37 ` [bpf-next PATCH v3 4/6] bpf, sockmap: remove dropped data on errors in redirect case John Fastabend
2020-10-12 12:04   ` Jakub Sitnicki
2020-10-12 17:19     ` John Fastabend
2020-10-13 10:27       ` Jakub Sitnicki
2020-10-09 18:37 ` [bpf-next PATCH v3 5/6] bpf, sockmap: Remove skb_orphan and let normal skb_kfree do cleanup John Fastabend
2020-10-09 18:37 ` [bpf-next PATCH v3 6/6] bpf, sockmap: Add memory accounting so skbs on ingress lists are visible John Fastabend
2020-10-12  1:10 ` [bpf-next PATCH v2 0/6] sockmap/sk_skb program memory acct fixes patchwork-bot+netdevbpf

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git