patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev,
	John Fastabend <john.fastabend@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	William Findlay <will@isovalent.com>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.1 15/42] bpf, sockmap: Convert schedule_work into delayed_work
Date: Thu,  1 Jun 2023 14:21:24 +0100	[thread overview]
Message-ID: <20230601131939.731926934@linuxfoundation.org> (raw)
In-Reply-To: <20230601131939.051934720@linuxfoundation.org>

From: John Fastabend <john.fastabend@gmail.com>

[ Upstream commit 29173d07f79883ac94f5570294f98af3d4287382 ]

Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/skmsg.h |  2 +-
 net/core/skmsg.c      | 21 ++++++++++++++-------
 net/core/sock_map.c   |  3 ++-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 84f787416a54d..904ff9a32ad61 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -105,7 +105,7 @@ struct sk_psock {
 	struct proto			*sk_proto;
 	struct mutex			work_mutex;
 	struct sk_psock_work_state	work_state;
-	struct work_struct		work;
+	struct delayed_work		work;
 	struct rcu_work			rwork;
 };
 
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 2b6d9519ff29c..6a9b794861f3f 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -481,7 +481,7 @@ int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
 	}
 out:
 	if (psock->work_state.skb && copied > 0)
-		schedule_work(&psock->work);
+		schedule_delayed_work(&psock->work, 0);
 	return copied;
 }
 EXPORT_SYMBOL_GPL(sk_msg_recvmsg);
@@ -639,7 +639,8 @@ static void sk_psock_skb_state(struct sk_psock *psock,
 
 static void sk_psock_backlog(struct work_struct *work)
 {
-	struct sk_psock *psock = container_of(work, struct sk_psock, work);
+	struct delayed_work *dwork = to_delayed_work(work);
+	struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
 	struct sk_psock_work_state *state = &psock->work_state;
 	struct sk_buff *skb = NULL;
 	bool ingress;
@@ -679,6 +680,12 @@ static void sk_psock_backlog(struct work_struct *work)
 				if (ret == -EAGAIN) {
 					sk_psock_skb_state(psock, state, skb,
 							   len, off);
+
+					/* Delay slightly to prioritize any
+					 * other work that might be here.
+					 */
+					if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+						schedule_delayed_work(&psock->work, 1);
 					goto end;
 				}
 				/* Hard errors break pipe and stop xmit. */
@@ -733,7 +740,7 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node)
 	INIT_LIST_HEAD(&psock->link);
 	spin_lock_init(&psock->link_lock);
 
-	INIT_WORK(&psock->work, sk_psock_backlog);
+	INIT_DELAYED_WORK(&psock->work, sk_psock_backlog);
 	mutex_init(&psock->work_mutex);
 	INIT_LIST_HEAD(&psock->ingress_msg);
 	spin_lock_init(&psock->ingress_lock);
@@ -822,7 +829,7 @@ static void sk_psock_destroy(struct work_struct *work)
 
 	sk_psock_done_strp(psock);
 
-	cancel_work_sync(&psock->work);
+	cancel_delayed_work_sync(&psock->work);
 	mutex_destroy(&psock->work_mutex);
 
 	psock_progs_drop(&psock->progs);
@@ -937,7 +944,7 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
 	}
 
 	skb_queue_tail(&psock_other->ingress_skb, skb);
-	schedule_work(&psock_other->work);
+	schedule_delayed_work(&psock_other->work, 0);
 	spin_unlock_bh(&psock_other->ingress_lock);
 	return 0;
 }
@@ -1017,7 +1024,7 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 			spin_lock_bh(&psock->ingress_lock);
 			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
 				skb_queue_tail(&psock->ingress_skb, skb);
-				schedule_work(&psock->work);
+				schedule_delayed_work(&psock->work, 0);
 				err = 0;
 			}
 			spin_unlock_bh(&psock->ingress_lock);
@@ -1048,7 +1055,7 @@ static void sk_psock_write_space(struct sock *sk)
 	psock = sk_psock(sk);
 	if (likely(psock)) {
 		if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
-			schedule_work(&psock->work);
+			schedule_delayed_work(&psock->work, 0);
 		write_space = psock->saved_write_space;
 	}
 	rcu_read_unlock();
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index a68a7290a3b2b..d382672018928 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -1624,9 +1624,10 @@ void sock_map_close(struct sock *sk, long timeout)
 		rcu_read_unlock();
 		sk_psock_stop(psock);
 		release_sock(sk);
-		cancel_work_sync(&psock->work);
+		cancel_delayed_work_sync(&psock->work);
 		sk_psock_put(sk, psock);
 	}
+
 	/* Make sure we do not recurse. This is a bug.
 	 * Leak the socket instead of crashing on a stack overflow.
 	 */
-- 
2.39.2




  parent reply	other threads:[~2023-06-01 13:28 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-01 13:21 [PATCH 6.1 00/42] 6.1.32-rc1 review Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 01/42] inet: Add IP_LOCAL_PORT_RANGE socket option Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 02/42] ipv{4,6}/raw: fix output xfrm lookup wrt protocol Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 03/42] firmware: arm_ffa: Fix usage of partition info get count flag Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 04/42] selftests/bpf: Fix pkg-config call building sign-file Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 05/42] platform/x86/amd/pmf: Fix CnQF and auto-mode after resume Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 06/42] tls: rx: device: fix checking decryption status Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 07/42] tls: rx: strp: set the skb->len of detached / CoWed skbs Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 08/42] tls: rx: strp: fix determining record length in copy mode Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 09/42] tls: rx: strp: force mixed decrypted records into " Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 10/42] tls: rx: strp: factor out copying skb data Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 11/42] tls: rx: strp: preserve decryption status of skbs when needed Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 12/42] net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 13/42] gpio-f7188x: fix chip name and pin count on Nuvoton chip Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 14/42] bpf, sockmap: Pass skb ownership through read_skb Greg Kroah-Hartman
2023-06-01 13:21 ` Greg Kroah-Hartman [this message]
2023-06-01 13:21 ` [PATCH 6.1 16/42] bpf, sockmap: Reschedule is now done through backlog Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 17/42] bpf, sockmap: Improved check for empty queue Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 18/42] bpf, sockmap: Handle fin correctly Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 19/42] bpf, sockmap: TCP data stall on recv before accept Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 20/42] bpf, sockmap: Wake up polling after data copy Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 21/42] bpf, sockmap: Incorrectly handling copied_seq Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 22/42] blk-mq: fix race condition in active queue accounting Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 23/42] vfio/type1: check pfn valid before converting to struct page Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 24/42] net: page_pool: use in_softirq() instead Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 25/42] page_pool: fix inconsistency for page_pool_ring_[un]lock() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 26/42] net: phy: mscc: enable VSC8501/2 RGMII RX clock Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 27/42] wifi: rtw89: correct 5 MHz mask setting Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 28/42] wifi: iwlwifi: mvm: support wowlan info notification version 2 Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 29/42] wifi: iwlwifi: mvm: fix potential memory leak Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 30/42] RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 31/42] dmaengine: at_xdmac: disable/enable clock directly on suspend/resume Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 32/42] dmaengine: at_xdmac: do not resume channels paused by consumers Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 33/42] dmaengine: at_xdmac: restore the content of grws register Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 34/42] octeontx2-af: Add validation for lmac type Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 35/42] drm/amd: Dont allow s0ix on APUs older than Raven Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 36/42] bluetooth: Add cmd validity checks at the start of hci_sock_ioctl() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 37/42] Revert "thermal/drivers/mellanox: Use generic thermal_zone_get_trip() function" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 38/42] block: fix bio-cache for passthru IO Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 39/42] cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 40/42] cpufreq: amd-pstate: Add ->fast_switch() callback Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 41/42] netfilter: ctnetlink: Support offloaded conntrack entry deletion Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.1 42/42] tools headers UAPI: Sync the linux/in.h with the kernel sources Greg Kroah-Hartman
2023-06-01 14:11 ` [PATCH 6.1 00/42] 6.1.32-rc1 review Naresh Kamboju
2023-06-01 14:26   ` Greg Kroah-Hartman
2023-06-01 14:33     ` Greg Kroah-Hartman
2023-06-01 14:39     ` Guenter Roeck
2023-06-01 17:41       ` Greg Kroah-Hartman
2023-06-01 20:33 ` Shuah Khan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230601131939.731926934@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=daniel@iogearbox.net \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=patches@lists.linux.dev \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=will@isovalent.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).