BPF Archive on lore.kernel.org
 help / color / Atom feed
* [bpf PATCH v2 0/3] Sockmap RCU splat fix
@ 2020-06-25 23:12 John Fastabend
  2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: John Fastabend @ 2020-06-25 23:12 UTC (permalink / raw)
  To: kafai, jakub, daniel, ast; +Cc: netdev, bpf, john.fastabend

Fix a splat introduced by recent changes to avoid skipping ingress policy
when kTLS is enabled. The RCU splat was introduced because in the non-TLS
case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS
case we have a reference to the psock and the caller did not wrap its
call in rcu_read_lock/unlock.

To fix extend the RCU section to include the redirect case which was
missed. From v1->v2 I changed the location a bit to simplify the code
some. See patch 1.

But, then Martin asked why it was not needed in the non-TLS case. The
answer for patch 1 was, as stated above, because the caller has the
rcu read lock. However, there was still a missing case where a BPF
user could in-theory line up a set of parameters to hit a case
where the code was entered from strparser side from a different context
then the initial caller. To hit this user would need a parser program
to return value greater than skb->len then an ENOMEM error could happen
in the strparser codepath triggering strparser to retry from a workqueue
and without rcu_read_lock original caller used. See patch 2 for details.

Finally, we don't actually have any selftests for parser returning a
value geater than skb->len so add one in patch 3. This is especially
needed because at least I don't have any code that uses the parser
to return value greater than skb->len. So I wouldn't have caught any
errors here in my own testing.

Thanks, John

v1->v2: simplify code in patch 1 some and add patches 2 and 3.

---

John Fastabend (3):
      bpf, sockmap: RCU splat with redirect and strparser error or TLS
      bpf, sockmap: RCU dereferenced psock may be used outside RCU block
      bpf, sockmap: Add ingres skb tests that utilize merge skbs


 net/core/skmsg.c                                   |   23 +++++++++++++-------
 .../selftests/bpf/progs/test_sockmap_kern.h        |    8 ++++++-
 tools/testing/selftests/bpf/test_sockmap.c         |   18 ++++++++++++++++
 3 files changed, 40 insertions(+), 9 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS
  2020-06-25 23:12 [bpf PATCH v2 0/3] Sockmap RCU splat fix John Fastabend
@ 2020-06-25 23:12 ` John Fastabend
  2020-06-26  6:10   ` Martin KaFai Lau
  2020-06-26 11:11   ` Jakub Sitnicki
  2020-06-25 23:13 ` [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block John Fastabend
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: John Fastabend @ 2020-06-25 23:12 UTC (permalink / raw)
  To: kafai, jakub, daniel, ast; +Cc: netdev, bpf, john.fastabend

There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.

However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,

 static void sk_psock_strp_data_ready(struct sock *sk)
 {
	struct sk_psock *psock;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock)) {
		if (tls_sw_has_ctx_rx(sk)) {
			psock->parser.saved_data_ready(sk);
		} else {
			write_lock_bh(&sk->sk_callback_lock);
			strp_data_ready(&psock->parser.strp);
			write_unlock_bh(&sk->sk_callback_lock);
		}
	}
	rcu_read_unlock();
 }

If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)

To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.

[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G        W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191]  dump_stack+0x8f/0xd0
[ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]

v2: Improve commit message to identify non-TLS redirect plus error case
    condition as well as more common TLS case. In the process I decided
    doing the rcu_read_unlock followed by the lock/unlock inside branches
    was unnecessarily complex. We can just extend the current rcu block
    and get the same effeective without the shuffling and branching.
    Thanks Martin!

Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 351afbf6bfba..c41ab6906b21 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -683,7 +683,7 @@ static struct sk_psock *sk_psock_from_strp(struct strparser *strp)
 	return container_of(parser, struct sk_psock, parser);
 }
 
-static void sk_psock_skb_redirect(struct sk_psock *psock, struct sk_buff *skb)
+static void sk_psock_skb_redirect(struct sk_buff *skb)
 {
 	struct sk_psock *psock_other;
 	struct sock *sk_other;
@@ -715,12 +715,11 @@ static void sk_psock_skb_redirect(struct sk_psock *psock, struct sk_buff *skb)
 	}
 }
 
-static void sk_psock_tls_verdict_apply(struct sk_psock *psock,
-				       struct sk_buff *skb, int verdict)
+static void sk_psock_tls_verdict_apply(struct sk_buff *skb, int verdict)
 {
 	switch (verdict) {
 	case __SK_REDIRECT:
-		sk_psock_skb_redirect(psock, skb);
+		sk_psock_skb_redirect(skb);
 		break;
 	case __SK_PASS:
 	case __SK_DROP:
@@ -741,8 +740,8 @@ int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
 	}
+	sk_psock_tls_verdict_apply(skb, ret);
 	rcu_read_unlock();
-	sk_psock_tls_verdict_apply(psock, skb, ret);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);
@@ -770,7 +769,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 		}
 		goto out_free;
 	case __SK_REDIRECT:
-		sk_psock_skb_redirect(psock, skb);
+		sk_psock_skb_redirect(skb);
 		break;
 	case __SK_DROP:
 		/* fall-through */
@@ -794,8 +793,8 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		ret = sk_psock_bpf_run(psock, prog, skb);
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
 	}
-	rcu_read_unlock();
 	sk_psock_verdict_apply(psock, skb, ret);
+	rcu_read_unlock();
 }
 
 static int sk_psock_strp_read_done(struct strparser *strp, int err)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block
  2020-06-25 23:12 [bpf PATCH v2 0/3] Sockmap RCU splat fix John Fastabend
  2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
@ 2020-06-25 23:13 ` John Fastabend
  2020-06-26  6:13   ` Martin KaFai Lau
  2020-06-29  7:42   ` Jakub Sitnicki
  2020-06-25 23:13 ` [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs John Fastabend
  2020-06-28 15:40 ` [bpf PATCH v2 0/3] Sockmap RCU splat fix Alexei Starovoitov
  3 siblings, 2 replies; 10+ messages in thread
From: John Fastabend @ 2020-06-25 23:13 UTC (permalink / raw)
  To: kafai, jakub, daniel, ast; +Cc: netdev, bpf, john.fastabend

If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.

But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.

There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.

This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.

Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index c41ab6906b21..6a32a1fd34f8 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -781,11 +781,18 @@ static void sk_psock_verdict_apply(struct sk_psock *psock,
 
 static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 {
-	struct sk_psock *psock = sk_psock_from_strp(strp);
+	struct sk_psock *psock;
 	struct bpf_prog *prog;
 	int ret = __SK_DROP;
+	struct sock *sk;
 
 	rcu_read_lock();
+	sk = strp->sk;
+	psock = sk_psock(sk);
+	if (unlikely(!psock)) {
+		kfree_skb(skb);
+		goto out;
+	}
 	prog = READ_ONCE(psock->progs.skb_verdict);
 	if (likely(prog)) {
 		skb_orphan(skb);
@@ -794,6 +801,7 @@ static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
 		ret = sk_psock_map_verd(ret, tcp_skb_bpf_redirect_fetch(skb));
 	}
 	sk_psock_verdict_apply(psock, skb, ret);
+out:
 	rcu_read_unlock();
 }
 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs
  2020-06-25 23:12 [bpf PATCH v2 0/3] Sockmap RCU splat fix John Fastabend
  2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
  2020-06-25 23:13 ` [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block John Fastabend
@ 2020-06-25 23:13 ` John Fastabend
  2020-06-26  6:14   ` Martin KaFai Lau
  2020-06-28 15:40 ` [bpf PATCH v2 0/3] Sockmap RCU splat fix Alexei Starovoitov
  3 siblings, 1 reply; 10+ messages in thread
From: John Fastabend @ 2020-06-25 23:13 UTC (permalink / raw)
  To: kafai, jakub, daniel, ast; +Cc: netdev, bpf, john.fastabend

Add a test to check strparser merging skbs is working.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 .../selftests/bpf/progs/test_sockmap_kern.h        |    8 +++++++-
 tools/testing/selftests/bpf/test_sockmap.c         |   18 ++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_kern.h b/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
index 057036ca1111..3dca4c2e2418 100644
--- a/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_kern.h
@@ -79,7 +79,7 @@ struct {
 
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
-	__uint(max_entries, 2);
+	__uint(max_entries, 3);
 	__type(key, int);
 	__type(value, int);
 } sock_skb_opts SEC(".maps");
@@ -94,6 +94,12 @@ struct {
 SEC("sk_skb1")
 int bpf_prog1(struct __sk_buff *skb)
 {
+	int *f, two = 2;
+
+	f = bpf_map_lookup_elem(&sock_skb_opts, &two);
+	if (f && *f) {
+		return *f;
+	}
 	return skb->len;
 }
 
diff --git a/tools/testing/selftests/bpf/test_sockmap.c b/tools/testing/selftests/bpf/test_sockmap.c
index 37695fc8096a..78789b27e573 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -85,6 +85,7 @@ int txmsg_ktls_skb_drop;
 int txmsg_ktls_skb_redir;
 int ktls;
 int peek_flag;
+int skb_use_parser;
 
 static const struct option long_options[] = {
 	{"help",	no_argument,		NULL, 'h' },
@@ -174,6 +175,7 @@ static void test_reset(void)
 	txmsg_apply = txmsg_cork = 0;
 	txmsg_ingress = txmsg_redir_skb = 0;
 	txmsg_ktls_skb = txmsg_ktls_skb_drop = txmsg_ktls_skb_redir = 0;
+	skb_use_parser = 0;
 }
 
 static int test_start_subtest(const struct _test *t, struct sockmap_options *o)
@@ -1211,6 +1213,11 @@ static int run_options(struct sockmap_options *options, int cg_fd,  int test)
 		}
 	}
 
+	if (skb_use_parser) {
+		i = 2;
+		err = bpf_map_update_elem(map_fd[7], &i, &skb_use_parser, BPF_ANY);
+	}
+
 	if (txmsg_drop)
 		options->drop_expected = true;
 
@@ -1650,6 +1657,16 @@ static void test_txmsg_cork(int cgrp, struct sockmap_options *opt)
 	test_send(opt, cgrp);
 }
 
+static void test_txmsg_ingress_parser(int cgrp, struct sockmap_options *opt)
+{
+	txmsg_pass = 1;
+	skb_use_parser = 512;
+	opt->iov_length = 256;
+	opt->iov_count = 1;
+	opt->rate = 2;
+	test_exec(cgrp, opt);
+}
+
 char *map_names[] = {
 	"sock_map",
 	"sock_map_txmsg",
@@ -1748,6 +1765,7 @@ struct _test test[] = {
 	{"txmsg test pull-data", test_txmsg_pull},
 	{"txmsg test pop-data", test_txmsg_pop},
 	{"txmsg test push/pop data", test_txmsg_push_pop},
+	{"txmsg text ingress parser", test_txmsg_ingress_parser},
 };
 
 static int check_whitelist(struct _test *t, struct sockmap_options *opt)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS
  2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
@ 2020-06-26  6:10   ` Martin KaFai Lau
  2020-06-26 11:11   ` Jakub Sitnicki
  1 sibling, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2020-06-26  6:10 UTC (permalink / raw)
  To: John Fastabend; +Cc: jakub, daniel, ast, netdev, bpf

On Thu, Jun 25, 2020 at 04:12:59PM -0700, John Fastabend wrote:
> There are two paths to generate the below RCU splat the first and
> most obvious is the result of the BPF verdict program issuing a
> redirect on a TLS socket (This is the splat shown below). Unlike
> the non-TLS case the caller of the *strp_read() hooks does not
> wrap the call in a rcu_read_lock/unlock. Then if the BPF program
> issues a redirect action we hit the RCU splat.
> 
> However, in the non-TLS socket case the splat appears to be
> relatively rare, because the skmsg caller into the strp_data_ready()
> is wrapped in a rcu_read_lock/unlock. Shown here,
> 
>  static void sk_psock_strp_data_ready(struct sock *sk)
>  {
> 	struct sk_psock *psock;
> 
> 	rcu_read_lock();
> 	psock = sk_psock(sk);
> 	if (likely(psock)) {
> 		if (tls_sw_has_ctx_rx(sk)) {
> 			psock->parser.saved_data_ready(sk);
> 		} else {
> 			write_lock_bh(&sk->sk_callback_lock);
> 			strp_data_ready(&psock->parser.strp);
> 			write_unlock_bh(&sk->sk_callback_lock);
> 		}
> 	}
> 	rcu_read_unlock();
>  }
> 
> If the above was the only way to run the verdict program we
> would be safe. But, there is a case where the strparser may throw an
> ENOMEM error while parsing the skb. This is a result of a failed
> skb_clone, or alloc_skb_for_msg while building a new merged skb when
> the msg length needed spans multiple skbs. This will in turn put the
> skb on the strp_wrk workqueue in the strparser code. The skb will
> later be dequeued and verdict programs run, but now from a
> different context without the rcu_read_lock()/unlock() critical
> section in sk_psock_strp_data_ready() shown above. In practice
> I have not seen this yet, because as far as I know most users of the
> verdict programs are also only working on single skbs. In this case no
> merge happens which could trigger the above ENOMEM errors. In addition
> the system would need to be under memory pressure. For example, we
> can't hit the above case in selftests because we missed having tests
> to merge skbs. (Added in later patch)
> 
> To fix the below splat extend the rcu_read_lock/unnlock block to
> include the call to sk_psock_tls_verdict_apply(). This will fix both
> TLS redirect case and non-TLS redirect+error case. Also remove
> psock from the sk_psock_tls_verdict_apply() function signature its
> not used there.
Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block
  2020-06-25 23:13 ` [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block John Fastabend
@ 2020-06-26  6:13   ` Martin KaFai Lau
  2020-06-29  7:42   ` Jakub Sitnicki
  1 sibling, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2020-06-26  6:13 UTC (permalink / raw)
  To: John Fastabend; +Cc: jakub, daniel, ast, netdev, bpf

On Thu, Jun 25, 2020 at 04:13:18PM -0700, John Fastabend wrote:
> If an ingress verdict program specifies message sizes greater than
> skb->len and there is an ENOMEM error due to memory pressure we
> may call the rcv_msg handler outside the strp_data_ready() caller
> context. This is because on an ENOMEM error the strparser will
> retry from a workqueue. The caller currently protects the use of
> psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
> block.
> 
> But, in above workqueue error case the psock is accessed outside
> the read_lock/unlock block of the caller. So instead of using
> psock directly we must do a look up against the sk again to
> ensure the psock is available.
> 
> There is an an ugly piece here where we must handle
> the case where we paused the strp and removed the psock. On
> psock removal we first pause the strparser and then remove
> the psock. If the strparser is paused while an skb is
> scheduled on the workqueue the skb will be dropped on the
> flow and kfree_skb() is called. If the workqueue manages
> to get called before we pause the strparser but runs the rcvmsg
> callback after the psock is removed we will hit the unlikely
> case where we run the sockmap rcvmsg handler but do not have
> a psock. For now we will follow strparser logic and drop the
> skb on the floor with skb_kfree(). This is ugly because the
> data is dropped. To date this has not caused problems in practice
> because either the application controlling the sockmap is
> coordinating with the datapath so that skbs are "flushed"
> before removal or we simply wait for the sock to be closed before
> removing it.
> 
> This patch fixes the describe RCU bug and dropping the skb doesn't
> make things worse. Future patches will improve this by allowing
> the normal case where skbs are not merged to skip the strparser
> altogether. In practice many (most?) use cases have no need to
> merge skbs so its both a code complexity hit as seen above and
> a performance issue. For example, in the Cilium case we always
> set the strparser up to return sbks 1:1 without any merging and
> have avoided above issues.
Thanks for the details explanation.  I have to admit that I cannot
fully comprehend the concurrency situation in skmsg and psock.
The change makes sense to me after reading the description though.

Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs
  2020-06-25 23:13 ` [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs John Fastabend
@ 2020-06-26  6:14   ` Martin KaFai Lau
  0 siblings, 0 replies; 10+ messages in thread
From: Martin KaFai Lau @ 2020-06-26  6:14 UTC (permalink / raw)
  To: John Fastabend; +Cc: jakub, daniel, ast, netdev, bpf

On Thu, Jun 25, 2020 at 04:13:38PM -0700, John Fastabend wrote:
> Add a test to check strparser merging skbs is working.
Acked-by: Martin KaFai Lau <kafai@fb.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS
  2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
  2020-06-26  6:10   ` Martin KaFai Lau
@ 2020-06-26 11:11   ` Jakub Sitnicki
  1 sibling, 0 replies; 10+ messages in thread
From: Jakub Sitnicki @ 2020-06-26 11:11 UTC (permalink / raw)
  To: John Fastabend; +Cc: kafai, daniel, ast, netdev, bpf

On Fri, Jun 26, 2020 at 01:12 AM CEST, John Fastabend wrote:
> There are two paths to generate the below RCU splat the first and
> most obvious is the result of the BPF verdict program issuing a
> redirect on a TLS socket (This is the splat shown below). Unlike
> the non-TLS case the caller of the *strp_read() hooks does not
> wrap the call in a rcu_read_lock/unlock. Then if the BPF program
> issues a redirect action we hit the RCU splat.
>
> However, in the non-TLS socket case the splat appears to be
> relatively rare, because the skmsg caller into the strp_data_ready()
> is wrapped in a rcu_read_lock/unlock. Shown here,
>
>  static void sk_psock_strp_data_ready(struct sock *sk)
>  {
> 	struct sk_psock *psock;
>
> 	rcu_read_lock();
> 	psock = sk_psock(sk);
> 	if (likely(psock)) {
> 		if (tls_sw_has_ctx_rx(sk)) {
> 			psock->parser.saved_data_ready(sk);
> 		} else {
> 			write_lock_bh(&sk->sk_callback_lock);
> 			strp_data_ready(&psock->parser.strp);
> 			write_unlock_bh(&sk->sk_callback_lock);
> 		}
> 	}
> 	rcu_read_unlock();
>  }
>
> If the above was the only way to run the verdict program we
> would be safe. But, there is a case where the strparser may throw an
> ENOMEM error while parsing the skb. This is a result of a failed
> skb_clone, or alloc_skb_for_msg while building a new merged skb when
> the msg length needed spans multiple skbs. This will in turn put the
> skb on the strp_wrk workqueue in the strparser code. The skb will
> later be dequeued and verdict programs run, but now from a
> different context without the rcu_read_lock()/unlock() critical
> section in sk_psock_strp_data_ready() shown above. In practice
> I have not seen this yet, because as far as I know most users of the
> verdict programs are also only working on single skbs. In this case no
> merge happens which could trigger the above ENOMEM errors. In addition
> the system would need to be under memory pressure. For example, we
> can't hit the above case in selftests because we missed having tests
> to merge skbs. (Added in later patch)
>
> To fix the below splat extend the rcu_read_lock/unnlock block to
> include the call to sk_psock_tls_verdict_apply(). This will fix both
> TLS redirect case and non-TLS redirect+error case. Also remove
> psock from the sk_psock_tls_verdict_apply() function signature its
> not used there.
>
> [ 1095.937597] WARNING: suspicious RCU usage
> [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G        W
> [ 1095.944363] -----------------------------
> [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
> [ 1095.950866]
> [ 1095.950866] other info that might help us debug this:
> [ 1095.950866]
> [ 1095.957146]
> [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
> [ 1095.961482] 1 lock held by test_sockmap/15970:
> [ 1095.964501]  #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
> [ 1095.968568]
> [ 1095.968568] stack backtrace:
> [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G        W         5.7.0-rc7-02911-g463bac5f1ca79 #1
> [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [ 1095.980519] Call Trace:
> [ 1095.982191]  dump_stack+0x8f/0xd0
> [ 1095.984040]  sk_psock_skb_redirect+0xa6/0xf0
> [ 1095.986073]  sk_psock_tls_strp_read+0x1d8/0x250
> [ 1095.988095]  tls_sw_recvmsg+0x714/0x840 [tls]
>
> v2: Improve commit message to identify non-TLS redirect plus error case
>     condition as well as more common TLS case. In the process I decided
>     doing the rcu_read_unlock followed by the lock/unlock inside branches
>     was unnecessarily complex. We can just extend the current rcu block
>     and get the same effeective without the shuffling and branching.
>     Thanks Martin!
>
> Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
> Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Thanks for the detailed explanation.

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 0/3] Sockmap RCU splat fix
  2020-06-25 23:12 [bpf PATCH v2 0/3] Sockmap RCU splat fix John Fastabend
                   ` (2 preceding siblings ...)
  2020-06-25 23:13 ` [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs John Fastabend
@ 2020-06-28 15:40 ` Alexei Starovoitov
  3 siblings, 0 replies; 10+ messages in thread
From: Alexei Starovoitov @ 2020-06-28 15:40 UTC (permalink / raw)
  To: John Fastabend
  Cc: Martin KaFai Lau, Jakub Sitnicki, Daniel Borkmann,
	Alexei Starovoitov, Network Development, bpf

On Thu, Jun 25, 2020 at 4:12 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Fix a splat introduced by recent changes to avoid skipping ingress policy
> when kTLS is enabled. The RCU splat was introduced because in the non-TLS
> case the caller is wrapped in an rcu_read_lock/unlock. But, in the TLS
> case we have a reference to the psock and the caller did not wrap its
> call in rcu_read_lock/unlock.
>
> To fix extend the RCU section to include the redirect case which was
> missed. From v1->v2 I changed the location a bit to simplify the code
> some. See patch 1.
>
> But, then Martin asked why it was not needed in the non-TLS case. The
> answer for patch 1 was, as stated above, because the caller has the
> rcu read lock. However, there was still a missing case where a BPF
> user could in-theory line up a set of parameters to hit a case
> where the code was entered from strparser side from a different context
> then the initial caller. To hit this user would need a parser program
> to return value greater than skb->len then an ENOMEM error could happen
> in the strparser codepath triggering strparser to retry from a workqueue
> and without rcu_read_lock original caller used. See patch 2 for details.
>
> Finally, we don't actually have any selftests for parser returning a
> value geater than skb->len so add one in patch 3. This is especially
> needed because at least I don't have any code that uses the parser
> to return value greater than skb->len. So I wouldn't have caught any
> errors here in my own testing.
>
> Thanks, John
>
> v1->v2: simplify code in patch 1 some and add patches 2 and 3.

Applied. Thanks

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block
  2020-06-25 23:13 ` [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block John Fastabend
  2020-06-26  6:13   ` Martin KaFai Lau
@ 2020-06-29  7:42   ` Jakub Sitnicki
  1 sibling, 0 replies; 10+ messages in thread
From: Jakub Sitnicki @ 2020-06-29  7:42 UTC (permalink / raw)
  To: John Fastabend; +Cc: kafai, daniel, ast, netdev, bpf

On Fri, Jun 26, 2020 at 01:13 AM CEST, John Fastabend wrote:
> If an ingress verdict program specifies message sizes greater than
> skb->len and there is an ENOMEM error due to memory pressure we
> may call the rcv_msg handler outside the strp_data_ready() caller
> context. This is because on an ENOMEM error the strparser will
> retry from a workqueue. The caller currently protects the use of
> psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
> block.
>
> But, in above workqueue error case the psock is accessed outside
> the read_lock/unlock block of the caller. So instead of using
> psock directly we must do a look up against the sk again to
> ensure the psock is available.
>
> There is an an ugly piece here where we must handle
> the case where we paused the strp and removed the psock. On
> psock removal we first pause the strparser and then remove
> the psock. If the strparser is paused while an skb is
> scheduled on the workqueue the skb will be dropped on the
> flow and kfree_skb() is called. If the workqueue manages
> to get called before we pause the strparser but runs the rcvmsg
> callback after the psock is removed we will hit the unlikely
> case where we run the sockmap rcvmsg handler but do not have
> a psock. For now we will follow strparser logic and drop the
> skb on the floor with skb_kfree(). This is ugly because the
> data is dropped. To date this has not caused problems in practice
> because either the application controlling the sockmap is
> coordinating with the datapath so that skbs are "flushed"
> before removal or we simply wait for the sock to be closed before
> removing it.
>
> This patch fixes the describe RCU bug and dropping the skb doesn't
> make things worse. Future patches will improve this by allowing
> the normal case where skbs are not merged to skip the strparser
> altogether. In practice many (most?) use cases have no need to
> merge skbs so its both a code complexity hit as seen above and
> a performance issue. For example, in the Cilium case we always
> set the strparser up to return sbks 1:1 without any merging and
> have avoided above issues.
>
> Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

LGTM. Sorry for the delay, needed to make sure I understand this.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-25 23:12 [bpf PATCH v2 0/3] Sockmap RCU splat fix John Fastabend
2020-06-25 23:12 ` [bpf PATCH v2 1/3] bpf, sockmap: RCU splat with redirect and strparser error or TLS John Fastabend
2020-06-26  6:10   ` Martin KaFai Lau
2020-06-26 11:11   ` Jakub Sitnicki
2020-06-25 23:13 ` [bpf PATCH v2 2/3] bpf, sockmap: RCU dereferenced psock may be used outside RCU block John Fastabend
2020-06-26  6:13   ` Martin KaFai Lau
2020-06-29  7:42   ` Jakub Sitnicki
2020-06-25 23:13 ` [bpf PATCH v2 3/3] bpf, sockmap: Add ingres skb tests that utilize merge skbs John Fastabend
2020-06-26  6:14   ` Martin KaFai Lau
2020-06-28 15:40 ` [bpf PATCH v2 0/3] Sockmap RCU splat fix Alexei Starovoitov

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git