All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] sockmap fixes picked up by stress tests
@ 2021-07-19 21:48 John Fastabend
  2021-07-19 21:48 ` [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser John Fastabend
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: John Fastabend @ 2021-07-19 21:48 UTC (permalink / raw)
  To: jakub, daniel, xiyou.wangcong, alexei.starovoitov
  Cc: john.fastabend, bpf, netdev

Running stress tests with recent patch to remove an extra lock in sockmap
resulted in a couple new issues popping up. It seems only one of them
is actually related to the patch:

799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")

The other two issues had existed long before, but I guess the timing
with the serialization we had before was too tight to get any of
our tests or deployments to hit it.

With attached series stress testing sockmap+TCP with workloads that
create lots of short-lived connections no more splats like below were
seen on upstream bpf branch.

[224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
[224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
[224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
[224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
[224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
[224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
[224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
[224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
[224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
[224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
[224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
[224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
[224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
[224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[224913.936007] Call Trace:
[224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
[224913.936033]  __tcp_close+0x620/0x790
[224913.936047]  tcp_close+0x20/0x80
[224913.936056]  inet_release+0x8f/0xf0
[224913.936070]  __sock_release+0x72/0x120

John Fastabend (3):
  bpf, sockmap: zap ingress queues after stopping strparser
  bpf, sockmap: on cleanup we additionally need to remove cached skb
  bpf, sockmap: fix memleak on ingress msg enqueue

 include/linux/skmsg.h | 54 ++++++++++++++++++++++++++++---------------
 net/core/skmsg.c      | 37 +++++++++++++++++++++--------
 2 files changed, 62 insertions(+), 29 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser
  2021-07-19 21:48 [PATCH bpf 0/3] sockmap fixes picked up by stress tests John Fastabend
@ 2021-07-19 21:48 ` John Fastabend
  2021-07-20 10:27   ` Jakub Sitnicki
  2021-07-19 21:48 ` [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb John Fastabend
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: John Fastabend @ 2021-07-19 21:48 UTC (permalink / raw)
  To: jakub, daniel, xiyou.wangcong, alexei.starovoitov
  Cc: john.fastabend, bpf, netdev

We don't want strparser to run and pass skbs into skmsg handlers when
the psock is null. We just sk_drop them in this case. When removing
a live socket from map it means extra drops that we do not need to
incur. Move the zap below strparser close to avoid this condition.

This way we stop the stream parser first stopping it from processing
packets and then delete the psock.

Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 15d71288e741..28115ef742e8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -773,8 +773,6 @@ static void sk_psock_destroy(struct work_struct *work)
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
-	sk_psock_stop(psock, false);
-
 	write_lock_bh(&sk->sk_callback_lock);
 	sk_psock_restore_proto(sk, psock);
 	rcu_assign_sk_user_data(sk, NULL);
@@ -784,6 +782,8 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 		sk_psock_stop_verdict(sk, psock);
 	write_unlock_bh(&sk->sk_callback_lock);
 
+	sk_psock_stop(psock, false);
+
 	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
 	queue_rcu_work(system_wq, &psock->rwork);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb
  2021-07-19 21:48 [PATCH bpf 0/3] sockmap fixes picked up by stress tests John Fastabend
  2021-07-19 21:48 ` [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser John Fastabend
@ 2021-07-19 21:48 ` John Fastabend
  2021-07-21 10:13   ` Jakub Sitnicki
  2021-07-19 21:48 ` [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue John Fastabend
  2021-07-21 16:38 ` [PATCH bpf 0/3] sockmap fixes picked up by stress tests Jakub Sitnicki
  3 siblings, 1 reply; 13+ messages in thread
From: John Fastabend @ 2021-07-19 21:48 UTC (permalink / raw)
  To: jakub, daniel, xiyou.wangcong, alexei.starovoitov
  Cc: john.fastabend, bpf, netdev

Its possible if a socket is closed and the receive thread is under memory
pressure it may have cached a skb. We need to ensure these skbs are
free'd along with the normal ingress_skb queue.

Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
down and backlog processing both had sock_lock for the common case of
socket close or unhash. So it was not possible to have both running in
parrallel so all we would need is the kfree in those kernels.

But, latest kernels include the commit 799aa7f98d5e and this requires a
bit more work. Without the ingress_lock guarding reading/writing the
state->skb case its possible the tear down could run before the state
update causing it to leak memory or worse when the backlog reads the state
it could potentially run interleaved with the tear down and we might end up
free'ing the state->skb from tear down side but already have the reference
from backlog side. To resolve such races we wrap accesses in ingress_lock
on both sides serializing tear down and backlog case. In both cases this
only happens after an EAGAIN error case so having an extra lock in place
is likely fine. The normal path will skip the locks.

Note, we check state->skb before grabbing lock. This works because
we can only enqueue with the mutex we hold already. Avoiding a race
on adding state->skb after the check. And if tear down path is running
that is also fine if the tear down path then removes state->skb we
will simply set skb=NULL and the subsequent goto is skipped. This
slight complication avoids locking in normal case.

With this fix we no longer see this warning splat from tcp side on
socket close when we hit the above case with redirect to ingress self.

[224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
[224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
[224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
[224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
[224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
[224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
[224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
[224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
[224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
[224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
[224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
[224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
[224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
[224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[224913.936007] Call Trace:
[224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
[224913.936033]  __tcp_close+0x620/0x790
[224913.936047]  tcp_close+0x20/0x80
[224913.936056]  inet_release+0x8f/0xf0
[224913.936070]  __sock_release+0x72/0x120
[224913.936083]  sock_close+0x14/0x20

Reported-by: Jussi Maki <joamaki@gmail.com>
Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 28115ef742e8..5d956e91d05a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -590,6 +590,22 @@ static void sock_drop(struct sock *sk, struct sk_buff *skb)
 	kfree_skb(skb);
 }
 
+static void sk_psock_skb_state(struct sk_psock *psock,
+			       struct sk_psock_work_state *state,
+			       struct sk_buff *skb,
+			       int len, int off)
+{
+	spin_lock_bh(&psock->ingress_lock);
+	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+		state->skb = skb;
+		state->len = len;
+		state->off = off;
+	} else {
+		sock_drop(psock->sk, skb);
+	}
+	spin_unlock_bh(&psock->ingress_lock);
+}
+
 static void sk_psock_backlog(struct work_struct *work)
 {
 	struct sk_psock *psock = container_of(work, struct sk_psock, work);
@@ -600,13 +616,16 @@ static void sk_psock_backlog(struct work_struct *work)
 	int ret;
 
 	mutex_lock(&psock->work_mutex);
-	if (state->skb) {
+	if (unlikely(state->skb)) {
+		spin_lock_bh(&psock->ingress_lock);
 		skb = state->skb;
 		len = state->len;
 		off = state->off;
 		state->skb = NULL;
-		goto start;
+		spin_unlock_bh(&psock->ingress_lock);
 	}
+	if (skb)
+		goto start;
 
 	while ((skb = skb_dequeue(&psock->ingress_skb))) {
 		len = skb->len;
@@ -621,9 +640,8 @@ static void sk_psock_backlog(struct work_struct *work)
 							  len, ingress);
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
-					state->skb = skb;
-					state->len = len;
-					state->off = off;
+					sk_psock_skb_state(psock, state, skb,
+							   len, off);
 					goto end;
 				}
 				/* Hard errors break pipe and stop xmit. */
@@ -722,6 +740,11 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
 		skb_bpf_redirect_clear(skb);
 		sock_drop(psock->sk, skb);
 	}
+	kfree_skb(psock->work_state.skb);
+	/* We null the skb here to ensure that calls to sk_psock_backlog
+	 * do not pick up the free'd skb.
+	 */
+	psock->work_state.skb = NULL;
 	__sk_psock_purge_ingress_msg(psock);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue
  2021-07-19 21:48 [PATCH bpf 0/3] sockmap fixes picked up by stress tests John Fastabend
  2021-07-19 21:48 ` [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser John Fastabend
  2021-07-19 21:48 ` [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb John Fastabend
@ 2021-07-19 21:48 ` John Fastabend
  2021-07-21 16:36   ` Jakub Sitnicki
  2021-07-21 18:02   ` Martin KaFai Lau
  2021-07-21 16:38 ` [PATCH bpf 0/3] sockmap fixes picked up by stress tests Jakub Sitnicki
  3 siblings, 2 replies; 13+ messages in thread
From: John Fastabend @ 2021-07-19 21:48 UTC (permalink / raw)
  To: jakub, daniel, xiyou.wangcong, alexei.starovoitov
  Cc: john.fastabend, bpf, netdev

If backlog handler is running during a tear down operation we may enqueue
data on the ingress msg queue while tear down is trying to free it.

 sk_psock_backlog()
   sk_psock_handle_skb()
     skb_psock_skb_ingress()
       sk_psock_skb_ingress_enqueue()
         sk_psock_queue_msg(psock,msg)
                                           spin_lock(ingress_lock)
                                            sk_psock_zap_ingress()
                                             _sk_psock_purge_ingerss_msg()
                                              _sk_psock_purge_ingress_msg()
                                            -- free ingress_msg list --
                                           spin_unlock(ingress_lock)
           spin_lock(ingress_lock)
           list_add_tail(msg,ingress_msg) <- entry on list with no on
                                             left to free it.
           spin_unlock(ingress_lock)

To fix we only enqueue from backlog if the ENABLED bit is set. The tear
down logic clears the bit with ingress_lock set so we wont enqueue the
msg in the last step.

Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h | 54 ++++++++++++++++++++++++++++---------------
 net/core/skmsg.c      |  6 -----
 2 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 96f319099744..883638888f93 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -285,11 +285,45 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
 	return rcu_dereference_sk_user_data(sk);
 }
 
+static inline void sk_psock_set_state(struct sk_psock *psock,
+				      enum sk_psock_state_bits bit)
+{
+	set_bit(bit, &psock->state);
+}
+
+static inline void sk_psock_clear_state(struct sk_psock *psock,
+					enum sk_psock_state_bits bit)
+{
+	clear_bit(bit, &psock->state);
+}
+
+static inline bool sk_psock_test_state(const struct sk_psock *psock,
+				       enum sk_psock_state_bits bit)
+{
+	return test_bit(bit, &psock->state);
+}
+
+static void sock_drop(struct sock *sk, struct sk_buff *skb)
+{
+	sk_drops_add(sk, skb);
+	kfree_skb(skb);
+}
+
+static inline void drop_sk_msg(struct sk_psock *psock, struct sk_msg *msg)
+{
+	if (msg->skb)
+		sock_drop(psock->sk, msg->skb);
+	kfree(msg);
+}
+
 static inline void sk_psock_queue_msg(struct sk_psock *psock,
 				      struct sk_msg *msg)
 {
 	spin_lock_bh(&psock->ingress_lock);
-	list_add_tail(&msg->list, &psock->ingress_msg);
+        if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+		list_add_tail(&msg->list, &psock->ingress_msg);
+	else
+		drop_sk_msg(psock, msg);
 	spin_unlock_bh(&psock->ingress_lock);
 }
 
@@ -406,24 +440,6 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 		psock->psock_update_sk_prot(sk, psock, true);
 }
 
-static inline void sk_psock_set_state(struct sk_psock *psock,
-				      enum sk_psock_state_bits bit)
-{
-	set_bit(bit, &psock->state);
-}
-
-static inline void sk_psock_clear_state(struct sk_psock *psock,
-					enum sk_psock_state_bits bit)
-{
-	clear_bit(bit, &psock->state);
-}
-
-static inline bool sk_psock_test_state(const struct sk_psock *psock,
-				       enum sk_psock_state_bits bit)
-{
-	return test_bit(bit, &psock->state);
-}
-
 static inline struct sk_psock *sk_psock_get(struct sock *sk)
 {
 	struct sk_psock *psock;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 5d956e91d05a..3ee407bed768 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -584,12 +584,6 @@ static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
 	return sk_psock_skb_ingress(psock, skb);
 }
 
-static void sock_drop(struct sock *sk, struct sk_buff *skb)
-{
-	sk_drops_add(sk, skb);
-	kfree_skb(skb);
-}
-
 static void sk_psock_skb_state(struct sk_psock *psock,
 			       struct sk_psock_work_state *state,
 			       struct sk_buff *skb,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser
  2021-07-19 21:48 ` [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser John Fastabend
@ 2021-07-20 10:27   ` Jakub Sitnicki
  2021-07-20 17:41     ` John Fastabend
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Sitnicki @ 2021-07-20 10:27 UTC (permalink / raw)
  To: John Fastabend; +Cc: daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote:
> We don't want strparser to run and pass skbs into skmsg handlers when
> the psock is null. We just sk_drop them in this case. When removing
> a live socket from map it means extra drops that we do not need to
> incur. Move the zap below strparser close to avoid this condition.
>
> This way we stop the stream parser first stopping it from processing
> packets and then delete the psock.
>
> Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

To confirm my understanding - the extra drops can happen because
currently we are racing to clear SK_PSOCK_TX_ENABLED flag in
sk_psock_drop with sk_psock_verdict_apply, which checks the flag before
pushing skb onto psock->ingress_skb queue (or possibly straight into
psock->ingress_msg queue on no redirect).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb
  2021-07-19 21:48 ` [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb John Fastabend
  2021-07-21 10:13   ` Jakub Sitnicki
@ 2021-07-21  9:38 ` Dan Carpenter
  0 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2021-07-20 10:53 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 5474 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <20210719214834.125484-3-john.fastabend@gmail.com>
References: <20210719214834.125484-3-john.fastabend@gmail.com>
TO: John Fastabend <john.fastabend@gmail.com>
TO: jakub(a)cloudflare.com
TO: daniel(a)iogearbox.net
TO: xiyou.wangcong(a)gmail.com
TO: alexei.starovoitov(a)gmail.com
CC: john.fastabend(a)gmail.com
CC: bpf(a)vger.kernel.org
CC: netdev(a)vger.kernel.org

Hi John,

I love your patch! Perhaps something to improve:

[auto build test WARNING on bpf/master]

url:    https://github.com/0day-ci/linux/commits/John-Fastabend/sockmap-fixes-picked-up-by-stress-tests/20210720-144138
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
:::::: branch date: 4 hours ago
:::::: commit date: 4 hours ago
config: i386-randconfig-m021-20210720 (attached as .config)
compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
net/core/skmsg.c:627 sk_psock_backlog() error: uninitialized symbol 'skb'.
net/core/skmsg.c:639 sk_psock_backlog() error: uninitialized symbol 'off'.
net/core/skmsg.c:640 sk_psock_backlog() error: uninitialized symbol 'len'.

vim +/skb +627 net/core/skmsg.c

d1f6b1c794e27f John Fastabend  2021-07-19  608  
604326b41a6fb9 Daniel Borkmann 2018-10-13  609  static void sk_psock_backlog(struct work_struct *work)
604326b41a6fb9 Daniel Borkmann 2018-10-13  610  {
604326b41a6fb9 Daniel Borkmann 2018-10-13  611  	struct sk_psock *psock = container_of(work, struct sk_psock, work);
604326b41a6fb9 Daniel Borkmann 2018-10-13  612  	struct sk_psock_work_state *state = &psock->work_state;
604326b41a6fb9 Daniel Borkmann 2018-10-13  613  	struct sk_buff *skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  614  	bool ingress;
604326b41a6fb9 Daniel Borkmann 2018-10-13  615  	u32 len, off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  616  	int ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  617  
799aa7f98d53e0 Cong Wang       2021-03-30  618  	mutex_lock(&psock->work_mutex);
d1f6b1c794e27f John Fastabend  2021-07-19  619  	if (unlikely(state->skb)) {
d1f6b1c794e27f John Fastabend  2021-07-19  620  		spin_lock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  621  		skb = state->skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  622  		len = state->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  623  		off = state->off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  624  		state->skb = NULL;
d1f6b1c794e27f John Fastabend  2021-07-19  625  		spin_unlock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  626  	}
d1f6b1c794e27f John Fastabend  2021-07-19 @627  	if (skb)
d1f6b1c794e27f John Fastabend  2021-07-19  628  		goto start;
604326b41a6fb9 Daniel Borkmann 2018-10-13  629  
604326b41a6fb9 Daniel Borkmann 2018-10-13  630  	while ((skb = skb_dequeue(&psock->ingress_skb))) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  631  		len = skb->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  632  		off = 0;
604326b41a6fb9 Daniel Borkmann 2018-10-13  633  start:
e3526bb92a2084 Cong Wang       2021-02-23  634  		ingress = skb_bpf_ingress(skb);
e3526bb92a2084 Cong Wang       2021-02-23  635  		skb_bpf_redirect_clear(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  636  		do {
604326b41a6fb9 Daniel Borkmann 2018-10-13  637  			ret = -EIO;
799aa7f98d53e0 Cong Wang       2021-03-30  638  			if (!sock_flag(psock->sk, SOCK_DEAD))
604326b41a6fb9 Daniel Borkmann 2018-10-13 @639  				ret = sk_psock_handle_skb(psock, skb, off,
604326b41a6fb9 Daniel Borkmann 2018-10-13 @640  							  len, ingress);
604326b41a6fb9 Daniel Borkmann 2018-10-13  641  			if (ret <= 0) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  642  				if (ret == -EAGAIN) {
d1f6b1c794e27f John Fastabend  2021-07-19  643  					sk_psock_skb_state(psock, state, skb,
d1f6b1c794e27f John Fastabend  2021-07-19  644  							   len, off);
604326b41a6fb9 Daniel Borkmann 2018-10-13  645  					goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  646  				}
604326b41a6fb9 Daniel Borkmann 2018-10-13  647  				/* Hard errors break pipe and stop xmit. */
604326b41a6fb9 Daniel Borkmann 2018-10-13  648  				sk_psock_report_error(psock, ret ? -ret : EPIPE);
604326b41a6fb9 Daniel Borkmann 2018-10-13  649  				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
781dd0431eb549 Cong Wang       2021-06-14  650  				sock_drop(psock->sk, skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  651  				goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  652  			}
604326b41a6fb9 Daniel Borkmann 2018-10-13  653  			off += ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  654  			len -= ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  655  		} while (len);
604326b41a6fb9 Daniel Borkmann 2018-10-13  656  
604326b41a6fb9 Daniel Borkmann 2018-10-13  657  		if (!ingress)
604326b41a6fb9 Daniel Borkmann 2018-10-13  658  			kfree_skb(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  659  	}
604326b41a6fb9 Daniel Borkmann 2018-10-13  660  end:
799aa7f98d53e0 Cong Wang       2021-03-30  661  	mutex_unlock(&psock->work_mutex);
604326b41a6fb9 Daniel Borkmann 2018-10-13  662  }
604326b41a6fb9 Daniel Borkmann 2018-10-13  663  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 39368 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser
  2021-07-20 10:27   ` Jakub Sitnicki
@ 2021-07-20 17:41     ` John Fastabend
  0 siblings, 0 replies; 13+ messages in thread
From: John Fastabend @ 2021-07-20 17:41 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

Jakub Sitnicki wrote:
> On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote:
> > We don't want strparser to run and pass skbs into skmsg handlers when
> > the psock is null. We just sk_drop them in this case. When removing
> > a live socket from map it means extra drops that we do not need to
> > incur. Move the zap below strparser close to avoid this condition.
> >
> > This way we stop the stream parser first stopping it from processing
> > packets and then delete the psock.
> >
> > Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> 
> To confirm my understanding - the extra drops can happen because
> currently we are racing to clear SK_PSOCK_TX_ENABLED flag in
> sk_psock_drop with sk_psock_verdict_apply, which checks the flag before
> pushing skb onto psock->ingress_skb queue (or possibly straight into
> psock->ingress_msg queue on no redirect).


Correct. If strparser hands a skb to the sk_psock_* handlers then before
they enqueue the packet the flag is checked and the packet will be
dropped in this case. I noticed this while testing. So rather than
letting packets get into sk_psock_* handlers this patch just stops the
strparser.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb
@ 2021-07-21  9:38 ` Dan Carpenter
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Carpenter @ 2021-07-21  9:38 UTC (permalink / raw)
  To: kbuild, John Fastabend, jakub, daniel, xiyou.wangcong,
	alexei.starovoitov
  Cc: lkp, kbuild-all, john.fastabend, bpf, netdev

Hi John,

url:    https://github.com/0day-ci/linux/commits/John-Fastabend/sockmap-fixes-picked-up-by-stress-tests/20210720-144138
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
config: i386-randconfig-m021-20210720 (attached as .config)
compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
net/core/skmsg.c:627 sk_psock_backlog() error: uninitialized symbol 'skb'.
net/core/skmsg.c:639 sk_psock_backlog() error: uninitialized symbol 'off'.
net/core/skmsg.c:640 sk_psock_backlog() error: uninitialized symbol 'len'.

vim +/skb +627 net/core/skmsg.c

604326b41a6fb9 Daniel Borkmann 2018-10-13  609  static void sk_psock_backlog(struct work_struct *work)
604326b41a6fb9 Daniel Borkmann 2018-10-13  610  {
604326b41a6fb9 Daniel Borkmann 2018-10-13  611  	struct sk_psock *psock = container_of(work, struct sk_psock, work);
604326b41a6fb9 Daniel Borkmann 2018-10-13  612  	struct sk_psock_work_state *state = &psock->work_state;
604326b41a6fb9 Daniel Borkmann 2018-10-13  613  	struct sk_buff *skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  614  	bool ingress;
604326b41a6fb9 Daniel Borkmann 2018-10-13  615  	u32 len, off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  616  	int ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  617  
799aa7f98d53e0 Cong Wang       2021-03-30  618  	mutex_lock(&psock->work_mutex);
d1f6b1c794e27f John Fastabend  2021-07-19  619  	if (unlikely(state->skb)) {
d1f6b1c794e27f John Fastabend  2021-07-19  620  		spin_lock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  621  		skb = state->skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  622  		len = state->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  623  		off = state->off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  624  		state->skb = NULL;
d1f6b1c794e27f John Fastabend  2021-07-19  625  		spin_unlock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  626  	}

skb uninitialized on else path.

d1f6b1c794e27f John Fastabend  2021-07-19 @627  	if (skb)
d1f6b1c794e27f John Fastabend  2021-07-19  628  		goto start;
604326b41a6fb9 Daniel Borkmann 2018-10-13  629  
604326b41a6fb9 Daniel Borkmann 2018-10-13  630  	while ((skb = skb_dequeue(&psock->ingress_skb))) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  631  		len = skb->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  632  		off = 0;
604326b41a6fb9 Daniel Borkmann 2018-10-13  633  start:
e3526bb92a2084 Cong Wang       2021-02-23  634  		ingress = skb_bpf_ingress(skb);
e3526bb92a2084 Cong Wang       2021-02-23  635  		skb_bpf_redirect_clear(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  636  		do {
604326b41a6fb9 Daniel Borkmann 2018-10-13  637  			ret = -EIO;
799aa7f98d53e0 Cong Wang       2021-03-30  638  			if (!sock_flag(psock->sk, SOCK_DEAD))
604326b41a6fb9 Daniel Borkmann 2018-10-13 @639  				ret = sk_psock_handle_skb(psock, skb, off,
604326b41a6fb9 Daniel Borkmann 2018-10-13 @640  							  len, ingress);
604326b41a6fb9 Daniel Borkmann 2018-10-13  641  			if (ret <= 0) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  642  				if (ret == -EAGAIN) {
d1f6b1c794e27f John Fastabend  2021-07-19  643  					sk_psock_skb_state(psock, state, skb,
d1f6b1c794e27f John Fastabend  2021-07-19  644  							   len, off);
604326b41a6fb9 Daniel Borkmann 2018-10-13  645  					goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  646  				}
604326b41a6fb9 Daniel Borkmann 2018-10-13  647  				/* Hard errors break pipe and stop xmit. */
604326b41a6fb9 Daniel Borkmann 2018-10-13  648  				sk_psock_report_error(psock, ret ? -ret : EPIPE);
604326b41a6fb9 Daniel Borkmann 2018-10-13  649  				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
781dd0431eb549 Cong Wang       2021-06-14  650  				sock_drop(psock->sk, skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  651  				goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  652  			}
604326b41a6fb9 Daniel Borkmann 2018-10-13  653  			off += ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  654  			len -= ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  655  		} while (len);
604326b41a6fb9 Daniel Borkmann 2018-10-13  656  
604326b41a6fb9 Daniel Borkmann 2018-10-13  657  		if (!ingress)
604326b41a6fb9 Daniel Borkmann 2018-10-13  658  			kfree_skb(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  659  	}
604326b41a6fb9 Daniel Borkmann 2018-10-13  660  end:
799aa7f98d53e0 Cong Wang       2021-03-30  661  	mutex_unlock(&psock->work_mutex);
604326b41a6fb9 Daniel Borkmann 2018-10-13  662  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb
@ 2021-07-21  9:38 ` Dan Carpenter
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Carpenter @ 2021-07-21  9:38 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4828 bytes --]

Hi John,

url:    https://github.com/0day-ci/linux/commits/John-Fastabend/sockmap-fixes-picked-up-by-stress-tests/20210720-144138
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git master
config: i386-randconfig-m021-20210720 (attached as .config)
compiler: gcc-10 (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

smatch warnings:
net/core/skmsg.c:627 sk_psock_backlog() error: uninitialized symbol 'skb'.
net/core/skmsg.c:639 sk_psock_backlog() error: uninitialized symbol 'off'.
net/core/skmsg.c:640 sk_psock_backlog() error: uninitialized symbol 'len'.

vim +/skb +627 net/core/skmsg.c

604326b41a6fb9 Daniel Borkmann 2018-10-13  609  static void sk_psock_backlog(struct work_struct *work)
604326b41a6fb9 Daniel Borkmann 2018-10-13  610  {
604326b41a6fb9 Daniel Borkmann 2018-10-13  611  	struct sk_psock *psock = container_of(work, struct sk_psock, work);
604326b41a6fb9 Daniel Borkmann 2018-10-13  612  	struct sk_psock_work_state *state = &psock->work_state;
604326b41a6fb9 Daniel Borkmann 2018-10-13  613  	struct sk_buff *skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  614  	bool ingress;
604326b41a6fb9 Daniel Borkmann 2018-10-13  615  	u32 len, off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  616  	int ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  617  
799aa7f98d53e0 Cong Wang       2021-03-30  618  	mutex_lock(&psock->work_mutex);
d1f6b1c794e27f John Fastabend  2021-07-19  619  	if (unlikely(state->skb)) {
d1f6b1c794e27f John Fastabend  2021-07-19  620  		spin_lock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  621  		skb = state->skb;
604326b41a6fb9 Daniel Borkmann 2018-10-13  622  		len = state->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  623  		off = state->off;
604326b41a6fb9 Daniel Borkmann 2018-10-13  624  		state->skb = NULL;
d1f6b1c794e27f John Fastabend  2021-07-19  625  		spin_unlock_bh(&psock->ingress_lock);
604326b41a6fb9 Daniel Borkmann 2018-10-13  626  	}

skb uninitialized on else path.

d1f6b1c794e27f John Fastabend  2021-07-19 @627  	if (skb)
d1f6b1c794e27f John Fastabend  2021-07-19  628  		goto start;
604326b41a6fb9 Daniel Borkmann 2018-10-13  629  
604326b41a6fb9 Daniel Borkmann 2018-10-13  630  	while ((skb = skb_dequeue(&psock->ingress_skb))) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  631  		len = skb->len;
604326b41a6fb9 Daniel Borkmann 2018-10-13  632  		off = 0;
604326b41a6fb9 Daniel Borkmann 2018-10-13  633  start:
e3526bb92a2084 Cong Wang       2021-02-23  634  		ingress = skb_bpf_ingress(skb);
e3526bb92a2084 Cong Wang       2021-02-23  635  		skb_bpf_redirect_clear(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  636  		do {
604326b41a6fb9 Daniel Borkmann 2018-10-13  637  			ret = -EIO;
799aa7f98d53e0 Cong Wang       2021-03-30  638  			if (!sock_flag(psock->sk, SOCK_DEAD))
604326b41a6fb9 Daniel Borkmann 2018-10-13 @639  				ret = sk_psock_handle_skb(psock, skb, off,
604326b41a6fb9 Daniel Borkmann 2018-10-13 @640  							  len, ingress);
604326b41a6fb9 Daniel Borkmann 2018-10-13  641  			if (ret <= 0) {
604326b41a6fb9 Daniel Borkmann 2018-10-13  642  				if (ret == -EAGAIN) {
d1f6b1c794e27f John Fastabend  2021-07-19  643  					sk_psock_skb_state(psock, state, skb,
d1f6b1c794e27f John Fastabend  2021-07-19  644  							   len, off);
604326b41a6fb9 Daniel Borkmann 2018-10-13  645  					goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  646  				}
604326b41a6fb9 Daniel Borkmann 2018-10-13  647  				/* Hard errors break pipe and stop xmit. */
604326b41a6fb9 Daniel Borkmann 2018-10-13  648  				sk_psock_report_error(psock, ret ? -ret : EPIPE);
604326b41a6fb9 Daniel Borkmann 2018-10-13  649  				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
781dd0431eb549 Cong Wang       2021-06-14  650  				sock_drop(psock->sk, skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  651  				goto end;
604326b41a6fb9 Daniel Borkmann 2018-10-13  652  			}
604326b41a6fb9 Daniel Borkmann 2018-10-13  653  			off += ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  654  			len -= ret;
604326b41a6fb9 Daniel Borkmann 2018-10-13  655  		} while (len);
604326b41a6fb9 Daniel Borkmann 2018-10-13  656  
604326b41a6fb9 Daniel Borkmann 2018-10-13  657  		if (!ingress)
604326b41a6fb9 Daniel Borkmann 2018-10-13  658  			kfree_skb(skb);
604326b41a6fb9 Daniel Borkmann 2018-10-13  659  	}
604326b41a6fb9 Daniel Borkmann 2018-10-13  660  end:
799aa7f98d53e0 Cong Wang       2021-03-30  661  	mutex_unlock(&psock->work_mutex);
604326b41a6fb9 Daniel Borkmann 2018-10-13  662  }

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb
  2021-07-19 21:48 ` [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb John Fastabend
@ 2021-07-21 10:13   ` Jakub Sitnicki
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2021-07-21 10:13 UTC (permalink / raw)
  To: John Fastabend; +Cc: daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote:
> Its possible if a socket is closed and the receive thread is under memory
> pressure it may have cached a skb. We need to ensure these skbs are
> free'd along with the normal ingress_skb queue.
>
> Before 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()") tear
> down and backlog processing both had sock_lock for the common case of
> socket close or unhash. So it was not possible to have both running in
> parrallel so all we would need is the kfree in those kernels.
>
> But, latest kernels include the commit 799aa7f98d5e and this requires a
> bit more work. Without the ingress_lock guarding reading/writing the
> state->skb case its possible the tear down could run before the state
> update causing it to leak memory or worse when the backlog reads the state
> it could potentially run interleaved with the tear down and we might end up
> free'ing the state->skb from tear down side but already have the reference
> from backlog side. To resolve such races we wrap accesses in ingress_lock
> on both sides serializing tear down and backlog case. In both cases this
> only happens after an EAGAIN error case so having an extra lock in place
> is likely fine. The normal path will skip the locks.
>
> Note, we check state->skb before grabbing lock. This works because
> we can only enqueue with the mutex we hold already. Avoiding a race
> on adding state->skb after the check. And if tear down path is running
> that is also fine if the tear down path then removes state->skb we
> will simply set skb=NULL and the subsequent goto is skipped. This
> slight complication avoids locking in normal case.
>
> With this fix we no longer see this warning splat from tcp side on
> socket close when we hit the above case with redirect to ingress self.
>
> [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
> [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
> [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
> [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
> [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
> [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
> [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
> [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
> [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
> [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
> [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
> [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
> [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
> [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
> [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [224913.936007] Call Trace:
> [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
> [224913.936033]  __tcp_close+0x620/0x790
> [224913.936047]  tcp_close+0x20/0x80
> [224913.936056]  inet_release+0x8f/0xf0
> [224913.936070]  __sock_release+0x72/0x120
> [224913.936083]  sock_close+0x14/0x20
>
> Reported-by: Jussi Maki <joamaki@gmail.com>
> Fixes: a136678c0bdbb ("bpf: sk_msg, zap ingress queue on psock down")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

This looks fine to me.

I've played around with the idea of wrapping read & write access to
psock->work_state in helpers with set/steal semantics. Building on an
example from net/core/fib_rules.c where nla_get_kuid_range() returns a
(u32, u32) pair. But I'm not sure if what I ended up with is actually
nicer. Leaving it for your consideration, if you want to use any of it.

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 96f319099744..ed067986a7b5 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -71,12 +71,16 @@ struct sk_psock_link {
 	void				*link_raw;
 };
 
-struct sk_psock_work_state {
-	struct sk_buff			*skb;
+struct sk_psock_skb_slice {
 	u32				len;
 	u32				off;
 };
 
+struct sk_psock_work_state {
+	struct sk_buff			*skb;
+	struct sk_psock_skb_slice	slice;
+};
+
 struct sk_psock {
 	struct sock			*sk;
 	struct sock			*sk_redir;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 15d71288e741..da0542074c24 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -590,40 +590,82 @@ static void sock_drop(struct sock *sk, struct sk_buff *skb)
 	kfree_skb(skb);
 }
 
+static void __sk_psock_skb_state_set(struct sk_psock *psock,
+				     struct sk_buff *skb,
+				     const struct sk_psock_skb_slice *slice)
+{
+	struct sk_psock_work_state *state = &psock->work_state;
+
+	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+		state->skb = skb;
+		state->slice = *slice;
+	} else {
+		sock_drop(psock->sk, skb);
+	}
+}
+
+static void sk_psock_skb_state_set(struct sk_psock *psock,
+				   struct sk_buff *skb,
+				   const struct sk_psock_skb_slice *slice)
+{
+	spin_lock_bh(&psock->ingress_lock);
+	__sk_psock_skb_state_set(psock, skb, slice);
+	spin_unlock_bh(&psock->ingress_lock);
+}
+
+static struct sk_psock_skb_slice __sk_psock_skb_state_steal(struct sk_psock *psock,
+							    struct sk_buff **pskb)
+{
+	struct sk_psock_work_state *state = &psock->work_state;
+
+	*pskb = state->skb;
+	state->skb = NULL;
+
+	return state->slice;
+}
+
+static struct sk_psock_skb_slice sk_psock_skb_state_steal(struct sk_psock *psock,
+							  struct sk_buff **pskb)
+{
+	struct sk_psock_skb_slice ret;
+
+	spin_lock_bh(&psock->ingress_lock);
+	ret = __sk_psock_skb_state_steal(psock, pskb);
+	spin_unlock_bh(&psock->ingress_lock);
+
+	return ret;
+}
+
 static void sk_psock_backlog(struct work_struct *work)
 {
 	struct sk_psock *psock = container_of(work, struct sk_psock, work);
 	struct sk_psock_work_state *state = &psock->work_state;
-	struct sk_buff *skb;
+	struct sk_psock_skb_slice slice = {};
+	struct sk_buff *skb = NULL;
 	bool ingress;
-	u32 len, off;
 	int ret;
 
 	mutex_lock(&psock->work_mutex);
 	if (state->skb) {
-		skb = state->skb;
-		len = state->len;
-		off = state->off;
-		state->skb = NULL;
+		slice = sk_psock_skb_state_steal(psock, &skb);
 		goto start;
 	}
 
 	while ((skb = skb_dequeue(&psock->ingress_skb))) {
-		len = skb->len;
-		off = 0;
+		slice.len = skb->len;
+		slice.off = 0;
 start:
 		ingress = skb_bpf_ingress(skb);
 		skb_bpf_redirect_clear(skb);
 		do {
 			ret = -EIO;
 			if (!sock_flag(psock->sk, SOCK_DEAD))
-				ret = sk_psock_handle_skb(psock, skb, off,
-							  len, ingress);
+				ret = sk_psock_handle_skb(psock, skb, slice.off,
+							  slice.len, ingress);
 			if (ret <= 0) {
 				if (ret == -EAGAIN) {
-					state->skb = skb;
-					state->len = len;
-					state->off = off;
+					sk_psock_skb_state_set(psock, skb,
+							       &slice);
 					goto end;
 				}
 				/* Hard errors break pipe and stop xmit. */
@@ -632,9 +674,9 @@ static void sk_psock_backlog(struct work_struct *work)
 				sock_drop(psock->sk, skb);
 				goto end;
 			}
-			off += ret;
-			len -= ret;
-		} while (len);
+			slice.off += ret;
+			slice.len -= ret;
+		} while (slice.len);
 
 		if (!ingress)
 			kfree_skb(skb);
@@ -723,6 +765,12 @@ static void __sk_psock_zap_ingress(struct sk_psock *psock)
 		sock_drop(psock->sk, skb);
 	}
 	__sk_psock_purge_ingress_msg(psock);
+
+	/* We steal the skb here to ensure that calls to sk_psock_backlog
+	 * do not pick up the free'd skb.
+	 */
+	__sk_psock_skb_state_steal(psock, &skb);
+	kfree_skb(skb);
 }
 
 static void sk_psock_link_destroy(struct sk_psock *psock)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue
  2021-07-19 21:48 ` [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue John Fastabend
@ 2021-07-21 16:36   ` Jakub Sitnicki
  2021-07-21 18:02   ` Martin KaFai Lau
  1 sibling, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2021-07-21 16:36 UTC (permalink / raw)
  To: John Fastabend; +Cc: daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote:
> If backlog handler is running during a tear down operation we may enqueue
> data on the ingress msg queue while tear down is trying to free it.
>
>  sk_psock_backlog()
>    sk_psock_handle_skb()
>      skb_psock_skb_ingress()
>        sk_psock_skb_ingress_enqueue()
>          sk_psock_queue_msg(psock,msg)
>                                            spin_lock(ingress_lock)
>                                             sk_psock_zap_ingress()
>                                              _sk_psock_purge_ingerss_msg()
>                                               _sk_psock_purge_ingress_msg()
>                                             -- free ingress_msg list --
>                                            spin_unlock(ingress_lock)
>            spin_lock(ingress_lock)
>            list_add_tail(msg,ingress_msg) <- entry on list with no on
>                                              left to free it.
>            spin_unlock(ingress_lock)
>
> To fix we only enqueue from backlog if the ENABLED bit is set. The tear
> down logic clears the bit with ingress_lock set so we wont enqueue the
> msg in the last step.
>
> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h | 54 ++++++++++++++++++++++++++++---------------
>  net/core/skmsg.c      |  6 -----
>  2 files changed, 35 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 96f319099744..883638888f93 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -285,11 +285,45 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
>  	return rcu_dereference_sk_user_data(sk);
>  }
>  
> +static inline void sk_psock_set_state(struct sk_psock *psock,
> +				      enum sk_psock_state_bits bit)
> +{
> +	set_bit(bit, &psock->state);
> +}
> +
> +static inline void sk_psock_clear_state(struct sk_psock *psock,
> +					enum sk_psock_state_bits bit)
> +{
> +	clear_bit(bit, &psock->state);
> +}
> +
> +static inline bool sk_psock_test_state(const struct sk_psock *psock,
> +				       enum sk_psock_state_bits bit)
> +{
> +	return test_bit(bit, &psock->state);
> +}
> +
> +static void sock_drop(struct sock *sk, struct sk_buff *skb)
> +{
> +	sk_drops_add(sk, skb);
> +	kfree_skb(skb);
> +}
> +
> +static inline void drop_sk_msg(struct sk_psock *psock, struct sk_msg *msg)
> +{
> +	if (msg->skb)
> +		sock_drop(psock->sk, msg->skb);
> +	kfree(msg);
> +}
> +
>  static inline void sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
>  	spin_lock_bh(&psock->ingress_lock);
> -	list_add_tail(&msg->list, &psock->ingress_msg);
> +        if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))


Whitespace issue ^. Otherwise LGTM.

> +		list_add_tail(&msg->list, &psock->ingress_msg);
> +	else
> +		drop_sk_msg(psock, msg);
>  	spin_unlock_bh(&psock->ingress_lock);
>  }
>  

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 0/3] sockmap fixes picked up by stress tests
  2021-07-19 21:48 [PATCH bpf 0/3] sockmap fixes picked up by stress tests John Fastabend
                   ` (2 preceding siblings ...)
  2021-07-19 21:48 ` [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue John Fastabend
@ 2021-07-21 16:38 ` Jakub Sitnicki
  3 siblings, 0 replies; 13+ messages in thread
From: Jakub Sitnicki @ 2021-07-21 16:38 UTC (permalink / raw)
  To: John Fastabend; +Cc: daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

On Mon, Jul 19, 2021 at 11:48 PM CEST, John Fastabend wrote:
> Running stress tests with recent patch to remove an extra lock in sockmap
> resulted in a couple new issues popping up. It seems only one of them
> is actually related to the patch:
>
> 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
>
> The other two issues had existed long before, but I guess the timing
> with the serialization we had before was too tight to get any of
> our tests or deployments to hit it.
>
> With attached series stress testing sockmap+TCP with workloads that
> create lots of short-lived connections no more splats like below were
> seen on upstream bpf branch.
>
> [224913.935822] WARNING: CPU: 3 PID: 32100 at net/core/stream.c:208 sk_stream_kill_queues+0x212/0x220
> [224913.935841] Modules linked in: fuse overlay bpf_preload x86_pkg_temp_thermal intel_uncore wmi_bmof squashfs sch_fq_codel efivarfs ip_tables x_tables uas xhci_pci ixgbe mdio xfrm_algo xhci_hcd wmi
> [224913.935897] CPU: 3 PID: 32100 Comm: fgs-bench Tainted: G          I       5.14.0-rc1alu+ #181
> [224913.935908] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
> [224913.935914] RIP: 0010:sk_stream_kill_queues+0x212/0x220
> [224913.935923] Code: 8b 83 20 02 00 00 85 c0 75 20 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 df e8 2b 11 fe ff eb c3 0f 0b e9 7c ff ff ff 0f 0b eb ce <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 0f 1f 44 00 00 41 57 41
> [224913.935932] RSP: 0018:ffff88816271fd38 EFLAGS: 00010206
> [224913.935941] RAX: 0000000000000ae8 RBX: ffff88815acd5240 RCX: dffffc0000000000
> [224913.935948] RDX: 0000000000000003 RSI: 0000000000000ae8 RDI: ffff88815acd5460
> [224913.935954] RBP: ffff88815acd5460 R08: ffffffff955c0ae8 R09: fffffbfff2e6f543
> [224913.935961] R10: ffffffff9737aa17 R11: fffffbfff2e6f542 R12: ffff88815acd5390
> [224913.935967] R13: ffff88815acd5480 R14: ffffffff98d0c080 R15: ffffffff96267500
> [224913.935974] FS:  00007f86e6bd1700(0000) GS:ffff888451cc0000(0000) knlGS:0000000000000000
> [224913.935981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [224913.935988] CR2: 000000c0008eb000 CR3: 00000001020e0005 CR4: 00000000003706e0
> [224913.935994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [224913.936000] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [224913.936007] Call Trace:
> [224913.936016]  inet_csk_destroy_sock+0xba/0x1f0
> [224913.936033]  __tcp_close+0x620/0x790
> [224913.936047]  tcp_close+0x20/0x80
> [224913.936056]  inet_release+0x8f/0xf0
> [224913.936070]  __sock_release+0x72/0x120
>
> John Fastabend (3):
>   bpf, sockmap: zap ingress queues after stopping strparser
>   bpf, sockmap: on cleanup we additionally need to remove cached skb
>   bpf, sockmap: fix memleak on ingress msg enqueue
>
>  include/linux/skmsg.h | 54 ++++++++++++++++++++++++++++---------------
>  net/core/skmsg.c      | 37 +++++++++++++++++++++--------
>  2 files changed, 62 insertions(+), 29 deletions(-)

Except for the uninitialized memory read reported by 0-day CI, this
series LGTM. Feel free to add my stamp to v2:

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue
  2021-07-19 21:48 ` [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue John Fastabend
  2021-07-21 16:36   ` Jakub Sitnicki
@ 2021-07-21 18:02   ` Martin KaFai Lau
  1 sibling, 0 replies; 13+ messages in thread
From: Martin KaFai Lau @ 2021-07-21 18:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: jakub, daniel, xiyou.wangcong, alexei.starovoitov, bpf, netdev

On Mon, Jul 19, 2021 at 02:48:34PM -0700, John Fastabend wrote:
> If backlog handler is running during a tear down operation we may enqueue
> data on the ingress msg queue while tear down is trying to free it.
> 
>  sk_psock_backlog()
>    sk_psock_handle_skb()
>      skb_psock_skb_ingress()
>        sk_psock_skb_ingress_enqueue()
>          sk_psock_queue_msg(psock,msg)
>                                            spin_lock(ingress_lock)
>                                             sk_psock_zap_ingress()
>                                              _sk_psock_purge_ingerss_msg()
>                                               _sk_psock_purge_ingress_msg()
>                                             -- free ingress_msg list --
>                                            spin_unlock(ingress_lock)
>            spin_lock(ingress_lock)
>            list_add_tail(msg,ingress_msg) <- entry on list with no on
s/on/one/

>                                              left to free it.
>            spin_unlock(ingress_lock)
> 
> To fix we only enqueue from backlog if the ENABLED bit is set. The tear
> down logic clears the bit with ingress_lock set so we wont enqueue the
> msg in the last step.
> 
> Fixes: 799aa7f98d53 ("skmsg: Avoid lock_sock() in sk_psock_backlog()")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h | 54 ++++++++++++++++++++++++++++---------------
>  net/core/skmsg.c      |  6 -----
>  2 files changed, 35 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 96f319099744..883638888f93 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -285,11 +285,45 @@ static inline struct sk_psock *sk_psock(const struct sock *sk)
>  	return rcu_dereference_sk_user_data(sk);
>  }
>  
> +static inline void sk_psock_set_state(struct sk_psock *psock,
> +				      enum sk_psock_state_bits bit)
> +{
> +	set_bit(bit, &psock->state);
> +}
> +
> +static inline void sk_psock_clear_state(struct sk_psock *psock,
> +					enum sk_psock_state_bits bit)
> +{
> +	clear_bit(bit, &psock->state);
> +}
> +
> +static inline bool sk_psock_test_state(const struct sk_psock *psock,
> +				       enum sk_psock_state_bits bit)
> +{
> +	return test_bit(bit, &psock->state);
> +}
> +
> +static void sock_drop(struct sock *sk, struct sk_buff *skb)
inline

> +{
> +	sk_drops_add(sk, skb);
> +	kfree_skb(skb);
> +}
> +
> +static inline void drop_sk_msg(struct sk_psock *psock, struct sk_msg *msg)
> +{
> +	if (msg->skb)
> +		sock_drop(psock->sk, msg->skb);
> +	kfree(msg);
> +}
> +
>  static inline void sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
>  	spin_lock_bh(&psock->ingress_lock);
> -	list_add_tail(&msg->list, &psock->ingress_msg);
> +        if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
Indentation is off.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-07-21 18:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-19 21:48 [PATCH bpf 0/3] sockmap fixes picked up by stress tests John Fastabend
2021-07-19 21:48 ` [PATCH bpf 1/3] bpf, sockmap: zap ingress queues after stopping strparser John Fastabend
2021-07-20 10:27   ` Jakub Sitnicki
2021-07-20 17:41     ` John Fastabend
2021-07-19 21:48 ` [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb John Fastabend
2021-07-21 10:13   ` Jakub Sitnicki
2021-07-19 21:48 ` [PATCH bpf 3/3] bpf, sockmap: fix memleak on ingress msg enqueue John Fastabend
2021-07-21 16:36   ` Jakub Sitnicki
2021-07-21 18:02   ` Martin KaFai Lau
2021-07-21 16:38 ` [PATCH bpf 0/3] sockmap fixes picked up by stress tests Jakub Sitnicki
2021-07-20 10:53 [PATCH bpf 2/3] bpf, sockmap: on cleanup we additionally need to remove cached skb kernel test robot
2021-07-21  9:38 ` Dan Carpenter
2021-07-21  9:38 ` Dan Carpenter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.