All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: jakub@cloudflare.com, daniel@iogearbox.net, lmb@isovalent.com,
	edumazet@google.com
Cc: john.fastabend@gmail.com, bpf@vger.kernel.org,
	netdev@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
	will@isovalent.com
Subject: [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue
Date: Fri,  7 Apr 2023 10:16:45 -0700	[thread overview]
Message-ID: <20230407171654.107311-4-john.fastabend@gmail.com> (raw)
In-Reply-To: <20230407171654.107311-1-john.fastabend@gmail.com>

We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix also check the 'state' variable where we would cache partially
processed sk_buff. This catches the majority of cases. But, we also
need to use the mutex lock around this check because we can't have both
codes running and check sensibly. We could perhaps do this with atomic
bit checks, but we are already here due to memory pressure so slowing
things down a bit seems OK and simpler to just grab a lock.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Tested-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 198bed303c51..f8731818b5c3 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -987,6 +987,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
 static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 				  int verdict)
 {
+	struct sk_psock_work_state *state;
 	struct sock *sk_other;
 	int err = 0;
 	u32 len, off;
@@ -1003,13 +1004,28 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 
 		skb_bpf_set_ingress(skb);
 
+		/* We need to grab mutex here because in-flight skb is in one of
+		 * the following states: either on ingress_skb, in psock->state
+		 * or being processed by backlog and neither in state->skb and
+		 * ingress_skb may be also empty. The troublesome case is when
+		 * the skb has been dequeued from ingress_skb list or taken from
+		 * state->skb because we can not easily test this case. Maybe we
+		 * could be clever with flags and resolve this but being clever
+		 * got us here in the first place and we note this is done under
+		 * sock lock and backlog conditions mean we are already running
+		 * into ENOMEM or other performance hindering cases so lets do
+		 * the obvious thing and grab the mutex.
+		 */
+		mutex_lock(&psock->work_mutex);
+		state = &psock->work_state;
+
 		/* If the queue is empty then we can submit directly
 		 * into the msg queue. If its not empty we have to
 		 * queue work otherwise we may get OOO data. Otherwise,
 		 * if sk_psock_skb_ingress errors will be handled by
 		 * retrying later from workqueue.
 		 */
-		if (skb_queue_empty(&psock->ingress_skb)) {
+		if (skb_queue_empty(&psock->ingress_skb) && likely(!state->skb)) {
 			len = skb->len;
 			off = 0;
 			if (skb_bpf_strparser(skb)) {
@@ -1030,9 +1046,11 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 			spin_unlock_bh(&psock->ingress_lock);
 			if (err < 0) {
 				skb_bpf_redirect_clear(skb);
+				mutex_unlock(&psock->work_mutex);
 				goto out_free;
 			}
 		}
+		mutex_unlock(&psock->work_mutex);
 		break;
 	case __SK_REDIRECT:
 		err = sk_psock_skb_redirect(psock, skb);
-- 
2.33.0


  parent reply	other threads:[~2023-04-07 17:17 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-07 17:16 [PATCH bpf v6 00/12] bpf sockmap fixes John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 01/12] bpf: sockmap, pass skb ownership through read_skb John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 02/12] bpf: sockmap, convert schedule_work into delayed_work John Fastabend
2023-04-07 17:16 ` John Fastabend [this message]
2023-04-11 17:53   ` [PATCH bpf v6 03/12] bpf: sockmap, improved check for empty queue Jakub Sitnicki
2023-04-13 17:51   ` Alexei Starovoitov
2023-04-13 20:06     ` John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 04/12] bpf: sockmap, handle fin correctly John Fastabend
2023-04-11 18:02   ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 05/12] bpf: sockmap, TCP data stall on recv before accept John Fastabend
2023-04-12 10:26   ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 06/12] bpf: sockmap, wake up polling after data copy John Fastabend
2023-04-13 13:00   ` Jakub Sitnicki
2023-04-07 17:16 ` [PATCH bpf v6 07/12] bpf: sockmap, incorrectly handling copied_seq John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 08/12] bpf: sockmap, pull socket helpers out of listen test for general use John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 09/12] bpf: sockmap, build helper to create connected socket pair John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 10/12] bpf: sockmap, test shutdown() correctly exits epoll and recv()=0 John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 11/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer John Fastabend
2023-04-07 17:16 ` [PATCH bpf v6 12/12] bpf: sockmap, test FIONREAD returns correct bytes in rx buffer with drops John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230407171654.107311-4-john.fastabend@gmail.com \
    --to=john.fastabend@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=edumazet@google.com \
    --cc=jakub@cloudflare.com \
    --cc=lmb@isovalent.com \
    --cc=netdev@vger.kernel.org \
    --cc=will@isovalent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.