bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bpf PATCH v2 0/6] sockmap fixes
@ 2020-11-12 23:26 John Fastabend
  2020-11-12 23:26 ` [bpf PATCH v2 1/6] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made John Fastabend
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:26 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

This includes fixes for sockmap found after I started running skmsg and
verdict programs on systems that I use daily. To date with attached
series I've been running for multiple weeks without seeing any issues
on systems doing calls, mail, movies, etc.

Also I started running packetdrill and after this series last remaining
fix needed is to handle MSG_EOR correctly. This will come as a follow
up to this, but because we use sendpage to pass pages into TCP stack
we need to enable TCP side some.

v2: 
 - Added patch3 to use truesize in sk_rmem_schedule (Daniel)
 - cleaned up some small nits... goto and extra set of brackets (Daniel)

---

John Fastabend (6):
      bpf, sockmap: fix partial copy_page_to_iter so progress can still be made
      bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
      bpf, sockmap: Use truesize with sk_rmem_schedule()
      bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
      bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
      bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list


 net/core/skmsg.c   | 84 +++++++++++++++++++++++++++++++++++++++-------
 net/ipv4/tcp_bpf.c |  3 +-
 2 files changed, 73 insertions(+), 14 deletions(-)

--
Signature


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 1/6] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
@ 2020-11-12 23:26 ` John Fastabend
  2020-11-12 23:27 ` [bpf PATCH v2 2/6] bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect John Fastabend
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:26 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

If copy_page_to_iter() fails or even partially completes, but with fewer
bytes copied than expected we currently reset sg.start and return EFAULT.
This proves problematic if we already copied data into the user buffer
before we return an error. Because we leave the copied data in the user
buffer and fail to unwind the scatterlist so kernel side believes data
has been copied and user side believes data has _not_ been received.

Expected behavior should be to return number of bytes copied and then
on the next read we need to return the error assuming its still there. This
can happen if we have a copy length spanning multiple scatterlist elements
and one or more complete before the error is hit.

The error is rare enough though that my normal testing with server side
programs, such as nginx, httpd, envoy, etc., I have never seen this. The
only reliable way to reproduce that I've found is to stream movies over
my browser for a day or so and wait for it to hang. Not very scientific,
but with a few extra WARN_ON()s in the code the bug was obvious.

When we review the errors from copy_page_to_iter() it seems we are hitting
a page fault from copy_page_to_iter_iovec() where the code checks
fault_in_pages_writeable(buf, copy) where buf is the user buffer. It
also seems typical server applications don't hit this case.

The other way to try and reproduce this is run the sockmap selftest tool
test_sockmap with data verification enabled, but it doesn't reproduce the
fault. Perhaps we can trigger this case artificially somehow from the
test tools. I haven't sorted out a way to do that yet though.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 37f4cb2bba5c..8e950b0bfabc 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -15,8 +15,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 {
 	struct iov_iter *iter = &msg->msg_iter;
 	int peek = flags & MSG_PEEK;
-	int i, ret, copied = 0;
 	struct sk_msg *msg_rx;
+	int i, copied = 0;
 
 	msg_rx = list_first_entry_or_null(&psock->ingress_msg,
 					  struct sk_msg, list);
@@ -37,11 +37,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 			page = sg_page(sge);
 			if (copied + copy > len)
 				copy = len - copied;
-			ret = copy_page_to_iter(page, sge->offset, copy, iter);
-			if (ret != copy) {
-				msg_rx->sg.start = i;
-				return -EFAULT;
-			}
+			copy = copy_page_to_iter(page, sge->offset, copy, iter);
+			if (!copy)
+				return copied ? copied : -EFAULT;
 
 			copied += copy;
 			if (likely(!peek)) {
@@ -56,6 +54,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 						put_page(page);
 				}
 			} else {
+				/* Lets not optimize peek case if copy_page_to_iter
+				 * didn't copy the entire length lets just break.
+				 */
+				if (copy != sge->length)
+					return copied;
 				sk_msg_iter_var_next(i);
 			}
 



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 2/6] bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
  2020-11-12 23:26 ` [bpf PATCH v2 1/6] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made John Fastabend
@ 2020-11-12 23:27 ` John Fastabend
  2020-11-12 23:27 ` [bpf PATCH v2 3/6] bpf, sockmap: Use truesize with sk_rmem_schedule() John Fastabend
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:27 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This
allows users to tune SO_RCVBUF and sockmap will honor them.

We can refactor the if(charge) case out in later patches. But, keep this
fix to the point.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c   |   20 ++++++++++++++++----
 net/ipv4/tcp_bpf.c |    3 ++-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 654182ecf87b..fe44280c033e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -170,10 +170,12 @@ static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i,
 	struct scatterlist *sge = sk_msg_elem(msg, i);
 	u32 len = sge->length;
 
-	if (charge)
-		sk_mem_uncharge(sk, len);
-	if (!msg->skb)
+	/* When the skb owns the memory we free it from consume_skb path. */
+	if (!msg->skb) {
+		if (charge)
+			sk_mem_uncharge(sk, len);
 		put_page(sg_page(sge));
+	}
 	memset(sge, 0, sizeof(*sge));
 	return len;
 }
@@ -403,6 +405,9 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 	int copied = 0, num_sge;
 	struct sk_msg *msg;
 
+	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+		return -EAGAIN;
+
 	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
 	if (unlikely(!msg))
 		return -EAGAIN;
@@ -418,7 +423,14 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 		return num_sge;
 	}
 
-	sk_mem_charge(sk, skb->len);
+	/* This will transition ownership of the data from the socket where
+	 * the BPF program was run initiating the redirect to the socket
+	 * we will eventually receive this data on. The data will be released
+	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
+	 * into user buffers.
+	 */
+	skb_set_owner_r(skb, sk);
+
 	copied = skb->len;
 	msg->sg.start = 0;
 	msg->sg.size = copied;
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 8e950b0bfabc..bc7d2a586e18 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -45,7 +45,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 			if (likely(!peek)) {
 				sge->offset += copy;
 				sge->length -= copy;
-				sk_mem_uncharge(sk, copy);
+				if (!msg_rx->skb)
+					sk_mem_uncharge(sk, copy);
 				msg_rx->sg.size -= copy;
 
 				if (!sge->length) {



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 3/6] bpf, sockmap: Use truesize with sk_rmem_schedule()
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
  2020-11-12 23:26 ` [bpf PATCH v2 1/6] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made John Fastabend
  2020-11-12 23:27 ` [bpf PATCH v2 2/6] bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect John Fastabend
@ 2020-11-12 23:27 ` John Fastabend
  2020-11-12 23:27 ` [bpf PATCH v2 4/6] bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self John Fastabend
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:27 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

We use skb->size with sk_rmem_scheduled() which is not correct. Instead
use truesize to align with socket and tcp stack usage of sk_rmem_schedule.

Suggested-by: Daniel Borkman <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index fe44280c033e..d09426ce4af3 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -411,7 +411,7 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
 	if (unlikely(!msg))
 		return -EAGAIN;
-	if (!sk_rmem_schedule(sk, skb, skb->len)) {
+	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
 		kfree(msg);
 		return -EAGAIN;
 	}



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 4/6] bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
                   ` (2 preceding siblings ...)
  2020-11-12 23:27 ` [bpf PATCH v2 3/6] bpf, sockmap: Use truesize with sk_rmem_schedule() John Fastabend
@ 2020-11-12 23:27 ` John Fastabend
  2020-11-12 23:27 ` [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects " John Fastabend
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:27 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

If a socket redirects to itself and it is under memory pressure it is
possible to get a socket stuck so that recv() returns EAGAIN and the
socket can not advance for some time. This happens because when
redirecting a skb to the same socket we received the skb on we first
check if it is OK to enqueue the skb on the receiving socket by checking
memory limits. But, if the skb is itself the object holding the memory
needed to enqueue the skb we will keep retrying from kernel side
and always fail with EAGAIN. Then userspace will get a recv() EAGAIN
error if there are no skbs in the psock ingress queue. This will continue
until either some skbs get kfree'd causing the memory pressure to
reduce far enough that we can enqueue the pending packet or the
socket is destroyed. In some cases its possible to get a socket
stuck for a noticable amount of time if the socket is only receiving
skbs from sk_skb verdict programs. To reproduce I make the socket
memory limits ridiculously low so sockets are always under memory
pressure. More often though if under memory pressure it looks like
a spurious EAGAIN error on user space side causing userspace to retry
and typically enough has moved on the memory side that it works.

To fix skip memory checks and skb_orphan if receiving on the same
sock as already assigned.

For SK_PASS cases this is easy, its always the same socket so we
can just omit the orphan/set_owner pair.

For backlog cases we need to check skb->sk and decide if the orphan
and set_owner pair are needed.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   72 ++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index d09426ce4af3..9aed5a2c7c5b 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
 }
 EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);
 
-static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
+static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
+						  struct sk_buff *skb)
 {
-	struct sock *sk = psock->sk;
-	int copied = 0, num_sge;
 	struct sk_msg *msg;
 
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
-		return -EAGAIN;
+		return NULL;
+
+	if (!sk_rmem_schedule(sk, skb, skb->truesize))
+		return NULL;
 
 	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
 	if (unlikely(!msg))
-		return -EAGAIN;
-	if (!sk_rmem_schedule(sk, skb, skb->truesize)) {
-		kfree(msg);
-		return -EAGAIN;
-	}
+		return NULL;
 
 	sk_msg_init(msg);
-	num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len);
+	return msg;
+}
+
+static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
+					struct sk_psock *psock,
+					struct sock *sk,
+					struct sk_msg *msg)
+{
+	int num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len);
+	int copied;
+
 	if (unlikely(num_sge < 0)) {
 		kfree(msg);
 		return num_sge;
 	}
 
-	/* This will transition ownership of the data from the socket where
-	 * the BPF program was run initiating the redirect to the socket
-	 * we will eventually receive this data on. The data will be released
-	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
-	 * into user buffers.
-	 */
-	skb_set_owner_r(skb, sk);
-
 	copied = skb->len;
 	msg->sg.start = 0;
 	msg->sg.size = copied;
@@ -442,6 +442,40 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 	return copied;
 }
 
+static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
+{
+	struct sock *sk = psock->sk;
+	struct sk_msg *msg;
+
+	msg = sk_psock_create_ingress_msg(sk, skb);
+	if (!msg)
+		return -EAGAIN;
+
+	/* This will transition ownership of the data from the socket where
+	 * the BPF program was run initiating the redirect to the socket
+	 * we will eventually receive this data on. The data will be released
+	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
+	 * into user buffers.
+	 */
+	skb_set_owner_r(skb, sk);
+	return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
+}
+
+/* Puts an skb on the ingress queue of the socket already assigned to the
+ * skb. In this case we do not need to check memory limits or skb_set_owner_r
+ * because the skb is already accounted for here.
+ */
+static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb)
+{
+	struct sk_msg *msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
+	struct sock *sk = psock->sk;
+
+	if (unlikely(!msg))
+		return -EAGAIN;
+	sk_msg_init(msg);
+	return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
+}
+
 static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 			       u32 off, u32 len, bool ingress)
 {
@@ -801,7 +835,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 		 * retrying later from workqueue.
 		 */
 		if (skb_queue_empty(&psock->ingress_skb)) {
-			err = sk_psock_skb_ingress(psock, skb);
+			err = sk_psock_skb_ingress_self(psock, skb);
 		}
 		if (err < 0) {
 			skb_queue_tail(&psock->ingress_skb, skb);



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
                   ` (3 preceding siblings ...)
  2020-11-12 23:27 ` [bpf PATCH v2 4/6] bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self John Fastabend
@ 2020-11-12 23:27 ` John Fastabend
  2020-11-16 14:31   ` Jakub Sitnicki
  2020-11-12 23:28 ` [bpf PATCH v2 6/6] bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list John Fastabend
  2020-11-16 14:49 ` [bpf PATCH v2 0/6] sockmap fixes Jakub Sitnicki
  6 siblings, 1 reply; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:27 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

If the skb_verdict_prog redirects an skb knowingly to itself, fix your
BPF program this is not optimal and an abuse of the API please use
SK_PASS. That said there may be cases, such as socket load balancing,
where picking the socket is hashed based or otherwise picks the same
socket it was received on in some rare cases. If this happens we don't
want to confuse userspace giving them an EAGAIN error if we can avoid
it.

To avoid double accounting in these cases. At the moment even if the
skb has already been charged against the sockets rcvbuf and forward
alloc we check it again and do set_owner_r() causing it to be orphaned
and recharged. For one this is useless work, but more importantly we
can have a case where the skb could be put on the ingress queue, but
because we are under memory pressure we return EAGAIN. The trouble
here is the skb has already been accounted for so any rcvbuf checks
include the memory associated with the packet already. This rolls
up and can result in unecessary EAGAIN errors in userspace read()
calls.

Fix by doing an unlikely check and skipping checks if skb->sk == sk.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9aed5a2c7c5b..f747ee341fe8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -404,11 +404,13 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
 {
 	struct sk_msg *msg;
 
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
-		return NULL;
+	if (likely(skb->sk != sk)) {
+		if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
+			return NULL;
 
-	if (!sk_rmem_schedule(sk, skb, skb->truesize))
-		return NULL;
+		if (!sk_rmem_schedule(sk, skb, skb->truesize))
+			return NULL;
+	}
 
 	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
 	if (unlikely(!msg))
@@ -455,9 +457,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
 	 * the BPF program was run initiating the redirect to the socket
 	 * we will eventually receive this data on. The data will be released
 	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
-	 * into user buffers.
+	 * into user buffers. If we are receiving on the same sock skb->sk is
+	 * already assigned, skip memory accounting and owner transition seeing
+	 * it already set correctly.
 	 */
-	skb_set_owner_r(skb, sk);
+	if (likely(skb->sk != sk))
+		skb_set_owner_r(skb, sk);
 	return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
 }
 



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 6/6] bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
                   ` (4 preceding siblings ...)
  2020-11-12 23:27 ` [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects " John Fastabend
@ 2020-11-12 23:28 ` John Fastabend
  2020-11-16 14:49 ` [bpf PATCH v2 0/6] sockmap fixes Jakub Sitnicki
  6 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-12 23:28 UTC (permalink / raw)
  To: ast, daniel, jakub; +Cc: john.fastabend, bpf, netdev

When skb has a frag_list its possible for skb_to_sgvec() to fail. This
happens when the scatterlist has fewer elements to store pages than would
be needed for the initial skb plus any of its frags.

This case appears rare, but is possible when running an RX parser/verdict
programs exposed to the internet. Currently, when this happens we throw
an error, break the pipe, and kfree the msg. This effectively breaks the
application or forces it to do a retry.

Lets catch this case and handle it by doing an skb_linearize() on any
skb we receive with frags. At this point skb_to_sgvec should not fail
because the failing conditions would require frags to be in place.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index f747ee341fe8..7ec1fdc083e4 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -425,9 +425,16 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
 					struct sock *sk,
 					struct sk_msg *msg)
 {
-	int num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len);
-	int copied;
+	int num_sge, copied;
 
+	/* skb linearize may fail with ENOMEM, but lets simply try again
+	 * later if this happens. Under memory pressure we don't want to
+	 * drop the skb. We need to linearize the skb so that the mapping
+	 * in skb_to_sgvec can not error.
+	 */
+	if (skb_linearize(skb))
+		return -EAGAIN;
+	num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len);
 	if (unlikely(num_sge < 0)) {
 		kfree(msg);
 		return num_sge;



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  2020-11-12 23:27 ` [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects " John Fastabend
@ 2020-11-16 14:31   ` Jakub Sitnicki
  2020-11-16 22:28     ` John Fastabend
  0 siblings, 1 reply; 10+ messages in thread
From: Jakub Sitnicki @ 2020-11-16 14:31 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, bpf, netdev

On Fri, Nov 13, 2020 at 12:27 AM CET, John Fastabend wrote:
> If the skb_verdict_prog redirects an skb knowingly to itself, fix your
> BPF program this is not optimal and an abuse of the API please use
> SK_PASS. That said there may be cases, such as socket load balancing,
> where picking the socket is hashed based or otherwise picks the same
> socket it was received on in some rare cases. If this happens we don't
> want to confuse userspace giving them an EAGAIN error if we can avoid
> it.
>
> To avoid double accounting in these cases. At the moment even if the
> skb has already been charged against the sockets rcvbuf and forward
> alloc we check it again and do set_owner_r() causing it to be orphaned
> and recharged. For one this is useless work, but more importantly we
> can have a case where the skb could be put on the ingress queue, but
> because we are under memory pressure we return EAGAIN. The trouble
> here is the skb has already been accounted for so any rcvbuf checks
> include the memory associated with the packet already. This rolls
> up and can result in unecessary EAGAIN errors in userspace read()
> calls.
>
> Fix by doing an unlikely check and skipping checks if skb->sk == sk.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c |   17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 9aed5a2c7c5b..f747ee341fe8 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -404,11 +404,13 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
>  {
>  	struct sk_msg *msg;
>  
> -	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> -		return NULL;
> +	if (likely(skb->sk != sk)) {
> +		if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
> +			return NULL;
>  
> -	if (!sk_rmem_schedule(sk, skb, skb->truesize))
> -		return NULL;
> +		if (!sk_rmem_schedule(sk, skb, skb->truesize))
> +			return NULL;
> +	}
>  
>  	msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC);
>  	if (unlikely(!msg))
> @@ -455,9 +457,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb)
>  	 * the BPF program was run initiating the redirect to the socket
>  	 * we will eventually receive this data on. The data will be released
>  	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
> -	 * into user buffers.
> +	 * into user buffers. If we are receiving on the same sock skb->sk is
> +	 * already assigned, skip memory accounting and owner transition seeing
> +	 * it already set correctly.
>  	 */
> -	skb_set_owner_r(skb, sk);
> +	if (likely(skb->sk != sk))
> +		skb_set_owner_r(skb, sk);
>  	return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg);
>  }
>  

I think all the added checks boil down to having:

	struct sock *sk = psock->sk;

        if (unlikely(skb->sk == sk))
                return sk_psock_skb_ingress_self(psock, skb);

... on entry to sk_psock_skb_ingress().

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 0/6] sockmap fixes
  2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
                   ` (5 preceding siblings ...)
  2020-11-12 23:28 ` [bpf PATCH v2 6/6] bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list John Fastabend
@ 2020-11-16 14:49 ` Jakub Sitnicki
  6 siblings, 0 replies; 10+ messages in thread
From: Jakub Sitnicki @ 2020-11-16 14:49 UTC (permalink / raw)
  To: John Fastabend; +Cc: ast, daniel, bpf, netdev

On Fri, Nov 13, 2020 at 12:26 AM CET, John Fastabend wrote:
> This includes fixes for sockmap found after I started running skmsg and
> verdict programs on systems that I use daily. To date with attached
> series I've been running for multiple weeks without seeing any issues
> on systems doing calls, mail, movies, etc.
>
> Also I started running packetdrill and after this series last remaining
> fix needed is to handle MSG_EOR correctly. This will come as a follow
> up to this, but because we use sendpage to pass pages into TCP stack
> we need to enable TCP side some.
>
> v2:
>  - Added patch3 to use truesize in sk_rmem_schedule (Daniel)
>  - cleaned up some small nits... goto and extra set of brackets (Daniel)
>
> ---
>
> John Fastabend (6):
>       bpf, sockmap: fix partial copy_page_to_iter so progress can still be made
>       bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
>       bpf, sockmap: Use truesize with sk_rmem_schedule()
>       bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
>       bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
>       bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list


Patch 5 potentially can be simplified. Otherwise LGTM. For the series:

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  2020-11-16 14:31   ` Jakub Sitnicki
@ 2020-11-16 22:28     ` John Fastabend
  0 siblings, 0 replies; 10+ messages in thread
From: John Fastabend @ 2020-11-16 22:28 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: ast, daniel, bpf, netdev

Jakub Sitnicki wrote:
> On Fri, Nov 13, 2020 at 12:27 AM CET, John Fastabend wrote:
> > If the skb_verdict_prog redirects an skb knowingly to itself, fix your
> > BPF program this is not optimal and an abuse of the API please use
> > SK_PASS. That said there may be cases, such as socket load balancing,
> > where picking the socket is hashed based or otherwise picks the same
> > socket it was received on in some rare cases. If this happens we don't
> > want to confuse userspace giving them an EAGAIN error if we can avoid
> > it.

[...]
 
> 
> I think all the added checks boil down to having:
> 
> 	struct sock *sk = psock->sk;
> 
>         if (unlikely(skb->sk == sk))
>                 return sk_psock_skb_ingress_self(psock, skb);
> 
> ... on entry to sk_psock_skb_ingress().

Agree made the change and sent out v3 thanks. I also carried your
Reviewed-by through on patches 1-4 and 6.

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-16 22:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-12 23:26 [bpf PATCH v2 0/6] sockmap fixes John Fastabend
2020-11-12 23:26 ` [bpf PATCH v2 1/6] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made John Fastabend
2020-11-12 23:27 ` [bpf PATCH v2 2/6] bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect John Fastabend
2020-11-12 23:27 ` [bpf PATCH v2 3/6] bpf, sockmap: Use truesize with sk_rmem_schedule() John Fastabend
2020-11-12 23:27 ` [bpf PATCH v2 4/6] bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self John Fastabend
2020-11-12 23:27 ` [bpf PATCH v2 5/6] bpf, sockmap: Handle memory acct if skb_verdict prog redirects " John Fastabend
2020-11-16 14:31   ` Jakub Sitnicki
2020-11-16 22:28     ` John Fastabend
2020-11-12 23:28 ` [bpf PATCH v2 6/6] bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list John Fastabend
2020-11-16 14:49 ` [bpf PATCH v2 0/6] sockmap fixes Jakub Sitnicki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).