All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	patches@lists.linux.dev,
	John Fastabend <john.fastabend@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 6.3 36/45] bpf, sockmap: Incorrectly handling copied_seq
Date: Thu,  1 Jun 2023 14:21:32 +0100	[thread overview]
Message-ID: <20230601131940.300804076@linuxfoundation.org> (raw)
In-Reply-To: <20230601131938.702671708@linuxfoundation.org>

From: John Fastabend <john.fastabend@gmail.com>

[ Upstream commit e5c6de5fa025882babf89cecbed80acf49b987fa ]

The read_skb() logic is incrementing the tcp->copied_seq which is used for
among other things calculating how many outstanding bytes can be read by
the application. This results in application errors, if the application
does an ioctl(FIONREAD) we return zero because this is calculated from
the copied_seq value.

To fix this we move tcp->copied_seq accounting into the recv handler so
that we update these when the recvmsg() hook is called and data is in
fact copied into user buffers. This gives an accurate FIONREAD value
as expected and improves ACK handling. Before we were calling the
tcp_rcv_space_adjust() which would update 'number of bytes copied to
user in last RTT' which is wrong for programs returning SK_PASS. The
bytes are only copied to the user when recvmsg is handled.

Doing the fix for recvmsg is straightforward, but fixing redirect and
SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then
call this from skmsg handlers. This fixes another issue where a broken
socket with a BPF program doing a resubmit could hang the receiver. This
happened because although read_skb() consumed the skb through sock_drop()
it did not update the copied_seq. Now if a single reccv socket is
redirecting to many sockets (for example for lb) the receiver sk will be
hung even though we might expect it to continue. The hang comes from
not updating the copied_seq numbers and memory pressure resulting from
that.

We have a slight layer problem of calling tcp_eat_skb even if its not
a TCP socket. To fix we could refactor and create per type receiver
handlers. I decided this is more work than we want in the fix and we
already have some small tweaks depending on caller that use the
helper skb_bpf_strparser(). So we extend that a bit and always set
the strparser bit when it is in use and then we can gate the
seq_copied updates on this.

Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/net/tcp.h  | 10 ++++++++++
 net/core/skmsg.c   | 15 +++++++--------
 net/ipv4/tcp.c     | 10 +---------
 net/ipv4/tcp_bpf.c | 28 +++++++++++++++++++++++++++-
 4 files changed, 45 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1ee..76bf0a11bdc77 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1467,6 +1467,8 @@ static inline void tcp_adjust_rcv_ssthresh(struct sock *sk)
 }
 
 void tcp_cleanup_rbuf(struct sock *sk, int copied);
+void __tcp_cleanup_rbuf(struct sock *sk, int copied);
+
 
 /* We provision sk_rcvbuf around 200% of sk_rcvlowat.
  * If 87.5 % (7/8) of the space has been consumed, we want to override
@@ -2323,6 +2325,14 @@ int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
 #endif /* CONFIG_BPF_SYSCALL */
 
+#ifdef CONFIG_INET
+void tcp_eat_skb(struct sock *sk, struct sk_buff *skb);
+#else
+static inline void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
+{
+}
+#endif
+
 int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress,
 			  struct sk_msg *msg, u32 bytes, int flags);
 #endif /* CONFIG_NET_SOCK_MSG */
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 08be5f409fb89..a9060e1f0e437 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -979,10 +979,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 		err = -EIO;
 		sk_other = psock->sk;
 		if (sock_flag(sk_other, SOCK_DEAD) ||
-		    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
-			skb_bpf_redirect_clear(skb);
+		    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
 			goto out_free;
-		}
 
 		skb_bpf_set_ingress(skb);
 
@@ -1011,18 +1009,19 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 				err = 0;
 			}
 			spin_unlock_bh(&psock->ingress_lock);
-			if (err < 0) {
-				skb_bpf_redirect_clear(skb);
+			if (err < 0)
 				goto out_free;
-			}
 		}
 		break;
 	case __SK_REDIRECT:
+		tcp_eat_skb(psock->sk, skb);
 		err = sk_psock_skb_redirect(psock, skb);
 		break;
 	case __SK_DROP:
 	default:
 out_free:
+		skb_bpf_redirect_clear(skb);
+		tcp_eat_skb(psock->sk, skb);
 		sock_drop(psock->sk, skb);
 	}
 
@@ -1067,8 +1066,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		skb_dst_drop(skb);
 		skb_bpf_redirect_clear(skb);
 		ret = bpf_prog_run_pin_on_cpu(prog, skb);
-		if (ret == SK_PASS)
-			skb_bpf_set_strparser(skb);
+		skb_bpf_set_strparser(skb);
 		ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
 		skb->sk = NULL;
 	}
@@ -1176,6 +1174,7 @@ static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
 	psock = sk_psock(sk);
 	if (unlikely(!psock)) {
 		len = 0;
+		tcp_eat_skb(sk, skb);
 		sock_drop(sk, skb);
 		goto out;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 01ea2705deea9..ed63ee8f0d7e3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1570,7 +1570,7 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
  * calculation of whether or not we must ACK for the sake of
  * a window update.
  */
-static void __tcp_cleanup_rbuf(struct sock *sk, int copied)
+void __tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
@@ -1785,14 +1785,6 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 			break;
 		}
 	}
-	WRITE_ONCE(tp->copied_seq, seq);
-
-	tcp_rcv_space_adjust(sk);
-
-	/* Clean up data we have read: This will do ACK frames. */
-	if (copied > 0)
-		__tcp_cleanup_rbuf(sk, copied);
-
 	return copied;
 }
 EXPORT_SYMBOL(tcp_read_skb);
diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 01dd76be1a584..5f93918c063c7 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -11,6 +11,24 @@
 #include <net/inet_common.h>
 #include <net/tls.h>
 
+void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tcp;
+	int copied;
+
+	if (!skb || !skb->len || !sk_is_tcp(sk))
+		return;
+
+	if (skb_bpf_strparser(skb))
+		return;
+
+	tcp = tcp_sk(sk);
+	copied = tcp->copied_seq + skb->len;
+	WRITE_ONCE(tcp->copied_seq, copied);
+	tcp_rcv_space_adjust(sk);
+	__tcp_cleanup_rbuf(sk, skb->len);
+}
+
 static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
 			   struct sk_msg *msg, u32 apply_bytes, int flags)
 {
@@ -198,8 +216,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 				  int flags,
 				  int *addr_len)
 {
+	struct tcp_sock *tcp = tcp_sk(sk);
+	u32 seq = tcp->copied_seq;
 	struct sk_psock *psock;
-	int copied;
+	int copied = 0;
 
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
@@ -244,9 +264,11 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 
 		if (is_fin) {
 			copied = 0;
+			seq++;
 			goto out;
 		}
 	}
+	seq += copied;
 	if (!copied) {
 		long timeo;
 		int data;
@@ -284,6 +306,10 @@ static int tcp_bpf_recvmsg_parser(struct sock *sk,
 		copied = -EAGAIN;
 	}
 out:
+	WRITE_ONCE(tcp->copied_seq, seq);
+	tcp_rcv_space_adjust(sk);
+	if (copied > 0)
+		__tcp_cleanup_rbuf(sk, copied);
 	release_sock(sk);
 	sk_psock_put(sk, psock);
 	return copied;
-- 
2.39.2




  parent reply	other threads:[~2023-06-01 13:26 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-01 13:20 [PATCH 6.3 00/45] 6.3.6-rc1 review Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 01/45] firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 02/45] firmware: arm_ffa: Fix usage of partition info get count flag Greg Kroah-Hartman
2023-06-01 13:20 ` [PATCH 6.3 03/45] spi: spi-geni-qcom: Select FIFO mode for chip select Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 04/45] coresight: perf: Release Coresight path when alloc trace id failed Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 05/45] ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3 Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 06/45] selftests/bpf: Fix pkg-config call building sign-file Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 07/45] power: supply: rt9467: Fix passing zero to dev_err_probe Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 08/45] platform/x86/amd/pmf: Fix CnQF and auto-mode after resume Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 09/45] bpf: netdev: init the offload table earlier Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 10/45] gpiolib: fix allocation of mixed dynamic/static GPIOs Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 11/45] tls: rx: device: fix checking decryption status Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 12/45] tls: rx: strp: set the skb->len of detached / CoWed skbs Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 13/45] tls: rx: strp: fix determining record length in copy mode Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 14/45] tls: rx: strp: force mixed decrypted records into " Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 15/45] tls: rx: strp: factor out copying skb data Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 16/45] tls: rx: strp: preserve decryption status of skbs when needed Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 17/45] tls: rx: strp: dont use GFP_KERNEL in softirq context Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 18/45] net: fec: add dma_wmb to ensure correct descriptor values Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 19/45] cxl/port: Fix NULL pointer access in devm_cxl_add_port() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 20/45] ASoC: Intel: avs: Fix module lookup Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 21/45] drm/i915: Move shared DPLL disabling into CRTC disable hook Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 22/45] drm/i915: Disable DPLLs before disconnecting the TC PHY Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 23/45] drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 24/45] net/mlx5e: TC, Fix using eswitch mapping in nic mode Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 25/45] Revert "net/mlx5: Expose steering dropped packets counter" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 26/45] Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports" Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 27/45] net/mlx5: E-switch, Devcom, sync devcom events and devcom comp register Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 28/45] gpio-f7188x: fix chip name and pin count on Nuvoton chip Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 29/45] bpf, sockmap: Pass skb ownership through read_skb Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 30/45] bpf, sockmap: Convert schedule_work into delayed_work Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 31/45] bpf, sockmap: Reschedule is now done through backlog Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 32/45] bpf, sockmap: Improved check for empty queue Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 33/45] bpf, sockmap: Handle fin correctly Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 34/45] bpf, sockmap: TCP data stall on recv before accept Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 35/45] bpf, sockmap: Wake up polling after data copy Greg Kroah-Hartman
2023-06-01 13:21 ` Greg Kroah-Hartman [this message]
2023-06-01 13:21 ` [PATCH 6.3 37/45] blk-wbt: fix that wbt cant be disabled by default Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 38/45] blk-mq: fix race condition in active queue accounting Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 39/45] vfio/type1: check pfn valid before converting to struct page Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 40/45] cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 41/45] net: phy: mscc: enable VSC8501/2 RGMII RX clock Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 42/45] bluetooth: Add cmd validity checks at the start of hci_sock_ioctl() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 43/45] cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf() Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 44/45] cpufreq: amd-pstate: Add ->fast_switch() callback Greg Kroah-Hartman
2023-06-01 13:21 ` [PATCH 6.3 45/45] netfilter: ctnetlink: Support offloaded conntrack entry deletion Greg Kroah-Hartman
2023-06-01 20:27 ` [PATCH 6.3 00/45] 6.3.6-rc1 review Shuah Khan
2023-06-01 20:27 ` Florian Fainelli
2023-06-02  6:15 ` Ron Economos
2023-06-02  7:01 ` Conor Dooley
2023-06-02  8:45 ` Jon Hunter
2023-06-02  9:02 ` Bagas Sanjaya
2023-06-02  9:44 ` Naresh Kamboju
2023-06-02 13:29 ` Markus Reichelt
2023-06-02 16:56 ` Justin Forbes
2023-06-02 22:36 ` Guenter Roeck
2023-06-05  9:19 ` Chris Paterson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230601131940.300804076@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=daniel@iogearbox.net \
    --cc=jakub@cloudflare.com \
    --cc=john.fastabend@gmail.com \
    --cc=patches@lists.linux.dev \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.