netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs
@ 2020-01-11  6:11 John Fastabend
  2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:11 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

To date our usage of sockmap/tls has been fairly simple, the BPF programs
did only well-defined pop, push, pull and apply/cork operations.

Now that we started to push more complex programs into sockmap we uncovered
a series of issues addressed here. Further OpenSSL3.0 version should be
released soon with kTLS support so its important to get any remaining
issues on BPF and kTLS support resolved.

Additionally, I have a patch under development to allow sockmap to be
enabled/disabled at runtime for Cilium endpoints. This allows us to stress
the map insert/delete with kTLS more than previously where Cilium only
added the socket to the map when it entered ESTABLISHED state and never
touched it from the control path side again relying on the sockets own
close() hook to remove it.

To test I have a set of test cases in test_sockmap.c that expose these
issues. Once we get fixes here merged and in bpf-next I'll submit the
tests to bpf-next tree to ensure we don't regress again. Also I've run
these patches in the Cilium CI with OpenSSL (master branch) this will
run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
testing.

I'm aware of two more issues that we are working to resolve in another
couple (probably two) patches. First we see an auth tag corruption in
kTLS when sending small 1byte chunks under stress. I've not pinned this
down yet. But, guessing because its under 1B stress tests it must be
some error path being triggered. And second we need to ensure BPF RX
programs are not skipped when kTLS ULP is loaded. This breaks some of
the sockmap selftests when running with kTLS. I'll send a follow up
for this.

v2: I dropped a patch that added !0 size check in tls_push_record
    this originated from a panic I caught awhile ago with a trace
    in the crypto stack. But I can not reproduce it anymore so will
    dig into that and send another patch later if needed. Anyways
    after a bit of thought it would be nicer if tls/crypto/bpf didn't
    require special case handling for the !0 size.

John Fastabend (8):
  bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  bpf: sockmap, ensure sock lock held during tear down
  bpf: sockmap/tls, push write_space updates through ulp updates
  bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
  bpf: sockmap/tls, msg_push_data may leave end mark in place
  bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
  bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra
    chaining
  bpf: sockmap/tls, fix pop data with SK_DROP return code

 include/linux/skmsg.h | 13 +++++++++----
 include/net/tcp.h     |  6 ++++--
 net/core/filter.c     | 11 ++++++-----
 net/core/skmsg.c      |  2 ++
 net/core/sock_map.c   |  7 ++++++-
 net/ipv4/tcp_bpf.c    |  5 +----
 net/ipv4/tcp_ulp.c    |  6 ++++--
 net/tls/tls_main.c    | 10 +++++++---
 net/tls/tls_sw.c      | 31 +++++++++++++++++++++++++++----
 9 files changed, 66 insertions(+), 25 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
@ 2020-01-11  6:11 ` John Fastabend
  2020-01-13 13:29   ` Jakub Sitnicki
  2020-01-11  6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:11 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

When a sockmap is free'd and a socket in the map is enabled with tls
we tear down the bpf context on the socket, the psock struct and state,
and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
the tls stack it needs to update its saved sock ops so that when the tls
socket is later destroyed it doesn't try to call the now destroyed psock
hooks.

This is about keeping stacked ULPs in good shape so they always have
the right set of stacked ops.

However, recently unhash() hook was removed from TLS side. But, the
sockmap/bpf side is not doing any extra work to update the unhash op
when is torn down instead expecting TLS side to manage it. So both
TLS and sockmap believe the other side is managing the op and instead
no one updates the hook so it continues to point at tcp_bpf_unhash().
When unhash hook is called we call tcp_bpf_unhash() which detects the
psock has already been destroyed and calls sk->sk_prot_unhash() which
calls tcp_bpf_unhash() yet again and so on looping and hanging the core.

To fix have sockmap tear down logic fixup the stale pointer.

Fixes: 5d92e631b8be ("net/tls: partially revert fix transition through disconnect with close")
Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index ef7031f8a304..b6afe01f8592 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -358,6 +358,7 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
+	sk->sk_prot->unhash = psock->saved_unhash;
 	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
  2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-02-05 17:55   ` Jakub Sitnicki
  2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Cc: stable@vger.kernel.org
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c    | 2 ++
 net/core/sock_map.c | 7 ++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ded2d5227678..3866d7e20c07 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
 
 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
+	sock_owned_by_me(sk);
+
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);
 
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index eb114ee419b6..8998e356f423 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
 		struct sock *sk;
 
 		sk = xchg(psk, NULL);
-		if (sk)
+		if (sk) {
+			lock_sock(sk);
 			sock_map_unref(sk, psk);
+			release_sock(sk);
+		}
 	}
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
@@ -862,7 +865,9 @@ static void sock_hash_free(struct bpf_map *map)
 		raw_spin_lock_bh(&bucket->lock);
 		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
 			hlist_del_rcu(&elem->node);
+			lock_sock(elem->sk);
 			sock_map_unref(elem->sk, elem);
+			release_sock(elem->sk);
 		}
 		raw_spin_unlock_bh(&bucket->lock);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
  2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
  2020-01-11  6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-12  1:00   ` Jonathan Lemon
  2020-01-13 13:59   ` Jakub Sitnicki
  2020-01-11  6:12 ` [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.

This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.

Cc: stable@vger.kernel.org
Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h | 12 ++++++++----
 include/net/tcp.h     |  6 ++++--
 net/ipv4/tcp_ulp.c    |  6 ++++--
 net/tls/tls_main.c    | 10 +++++++---
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index b6afe01f8592..14d61bba0b79 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
 	sk->sk_prot->unhash = psock->saved_unhash;
-	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
 		struct inet_connection_sock *icsk = inet_csk(sk);
 		bool has_ulp = !!icsk->icsk_ulp_data;
 
-		if (has_ulp)
-			tcp_update_ulp(sk, psock->sk_proto);
-		else
+		if (has_ulp) {
+			tcp_update_ulp(sk, psock->sk_proto,
+				       psock->saved_write_space);
+		} else {
 			sk->sk_prot = psock->sk_proto;
+			sk->sk_write_space = psock->saved_write_space;
+		}
 		psock->sk_proto = NULL;
+	} else {
+		sk->sk_write_space = psock->saved_write_space;
 	}
 }
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e460ea7f767b..e6f48384dc71 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2147,7 +2147,8 @@ struct tcp_ulp_ops {
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
 	/* update ulp */
-	void (*update)(struct sock *sk, struct proto *p);
+	void (*update)(struct sock *sk, struct proto *p,
+		       void (*write_space)(struct sock *sk));
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 	/* diagnostic */
@@ -2162,7 +2163,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
-void tcp_update_ulp(struct sock *sk, struct proto *p);
+void tcp_update_ulp(struct sock *sk, struct proto *p,
+		    void (*write_space)(struct sock *sk));
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 12ab5db2b71c..38d3ad141161 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -99,17 +99,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
-void tcp_update_ulp(struct sock *sk, struct proto *proto)
+void tcp_update_ulp(struct sock *sk, struct proto *proto,
+		    void (*write_space)(struct sock *sk))
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	if (!icsk->icsk_ulp_ops) {
+		sk->sk_write_space = write_space;
 		sk->sk_prot = proto;
 		return;
 	}
 
 	if (icsk->icsk_ulp_ops->update)
-		icsk->icsk_ulp_ops->update(sk, proto);
+		icsk->icsk_ulp_ops->update(sk, proto, write_space);
 }
 
 void tcp_cleanup_ulp(struct sock *sk)
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index dac24c7aa7d4..94774c0e5ff3 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -732,15 +732,19 @@ static int tls_init(struct sock *sk)
 	return rc;
 }
 
-static void tls_update(struct sock *sk, struct proto *p)
+static void tls_update(struct sock *sk, struct proto *p,
+		       void (*write_space)(struct sock *sk))
 {
 	struct tls_context *ctx;
 
 	ctx = tls_get_ctx(sk);
-	if (likely(ctx))
+	if (likely(ctx)) {
+		ctx->sk_write_space = write_space;
 		ctx->sk_proto = p;
-	else
+	} else {
 		sk->sk_prot = p;
+		sk->sk_write_space = write_space;
+	}
 }
 
 static int tls_get_info(const struct sock *sk, struct sk_buff *skb)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (2 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-11  6:12 ` [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

In the push, pull, and pop helpers operating on skmsg objects to make
data writable or insert/remove data we use this bounds check to ensure
specified data is valid,

 /* Bounds checks: start and pop must be inside message */
 if (start >= offset + l || last >= msg->sg.size)
     return -EINVAL;

The problem here is offset has already included the length of the
current element the 'l' above. So start could be past the end of
the scatterlist element in the case where start also points into an
offset on the last skmsg element.

To fix do the accounting slightly different by adding the length of
the previous entry to offset at the start of the iteration. And
ensure its initialized to zero so that the first iteration does
nothing.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
CC: stable@vger.kernel.org
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d22d108fc6e3..ffa2278020d7 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2231,10 +2231,10 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += len;
 		len = sk_msg_elem(msg, i)->length;
 		if (start < offset + len)
 			break;
-		offset += len;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 
@@ -2346,7 +2346,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 	   u32, len, u64, flags)
 {
 	struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
-	u32 new, i = 0, l, space, copy = 0, offset = 0;
+	u32 new, i = 0, l = 0, space, copy = 0, offset = 0;
 	u8 *raw, *to, *from;
 	struct page *page;
 
@@ -2356,11 +2356,11 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += l;
 		l = sk_msg_elem(msg, i)->length;
 
 		if (start < offset + l)
 			break;
-		offset += l;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 
@@ -2506,7 +2506,7 @@ static void sk_msg_shift_right(struct sk_msg *msg, int i)
 BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
 	   u32, len, u64, flags)
 {
-	u32 i = 0, l, space, offset = 0;
+	u32 i = 0, l = 0, space, offset = 0;
 	u64 last = start + len;
 	int pop;
 
@@ -2516,11 +2516,11 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += l;
 		l = sk_msg_elem(msg, i)->length;
 
 		if (start < offset + l)
 			break;
-		offset += l;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (3 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-11  6:12 ` [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

Leaving an incorrect end mark in place when passing to crypto
layer will cause crypto layer to stop processing data before
all data is encrypted. To fix clear the end mark on push
data instead of expecting users of the helper to clear the
mark value after the fact.

This happens when we push data into the middle of a skmsg and
have room for it so we don't do a set of copies that already
clear the end flag.

Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
Cc: stable@vger.kernel.org
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index ffa2278020d7..538f6a735a19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2415,6 +2415,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 
 		sk_msg_iter_var_next(i);
 		sg_unmark_end(psge);
+		sg_unmark_end(&rsge);
 		sk_msg_iter_next(msg, end);
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (4 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-11  6:12 ` [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

It is possible to build a plaintext buffer using push helper that is larger
than the allocated encrypt buffer. When this record is pushed to crypto
layers this can result in a NULL pointer dereference because the crypto
API expects the encrypt buffer is large enough to fit the plaintext
buffer. Kernel splat below.

To resolve catch the cases this can happen and split the buffer into two
records to send individually. Unfortunately, there is still one case to
handle where the split creates a zero sized buffer. In this case we merge
the buffers and unmark the split. This happens when apply is zero and user
pushed data beyond encrypt buffer. This fixes the original case as well
because the split allocated an encrypt buffer larger than the plaintext
buffer and the merge simply moves the pointers around so we now have
a reference to the new (larger) encrypt buffer.

Perhaps its not ideal but it seems the best solution for a fixes branch
and avoids handling these two cases, (a) apply that needs split and (b)
non apply case. The are edge cases anyways so optimizing them seems not
necessary unless someone wants later in next branches.

[  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
[  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
[...]
[  306.770350] Call Trace:
[  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
[  306.772026]  gcm_enc_copy_hash+0x4b/0x50
[  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
[  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
[  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
[  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
[  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
[  306.778239]  gcm_hash_init_continue+0x8f/0xa0
[  306.779121]  gcm_hash+0x73/0x80
[  306.779762]  gcm_encrypt_continue+0x6d/0x80
[  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
[  306.781474]  crypto_aead_encrypt+0x1f/0x30
[  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
[  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
[  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
[  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]

test_sockmap test signature to trigger bug,

[TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):

Cc: stable@vger.kernel.org
Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/tls/tls_sw.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index c6803a82b769..31f6bbbc8992 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -682,12 +682,32 @@ static int tls_push_record(struct sock *sk, int flags,
 
 	split_point = msg_pl->apply_bytes;
 	split = split_point && split_point < msg_pl->sg.size;
+	if (unlikely((!split &&
+		      msg_pl->sg.size +
+		      prot->overhead_size > msg_en->sg.size) ||
+		     (split &&
+		      split_point +
+		      prot->overhead_size > msg_en->sg.size))) {
+		split = true;
+		split_point = msg_en->sg.size;
+	}
 	if (split) {
 		rc = tls_split_open_record(sk, rec, &tmp, msg_pl, msg_en,
 					   split_point, prot->overhead_size,
 					   &orig_end);
 		if (rc < 0)
 			return rc;
+		/* This can happen if above tls_split_open_record allocates
+		 * a single large encryption buffer instead of two smaller
+		 * ones. In this case adjust pointers and continue without
+		 * split.
+		 */
+		if (!msg_pl->sg.size) {
+			tls_merge_open_record(sk, rec, tmp, orig_end);
+			msg_pl = &rec->msg_plaintext;
+			msg_en = &rec->msg_encrypted;
+			split = false;
+		}
 		sk_msg_trim(sk, msg_en, msg_pl->sg.size +
 			    prot->overhead_size);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (5 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-11  6:12 ` [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
  2020-01-15 22:37 ` [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs Daniel Borkmann
  8 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

Its possible through a set of push, pop, apply helper calls to construct
a skmsg, which is just a ring of scatterlist elements, with the start
value larger than the end value. For example,

      end       start
  |_0_|_1_| ... |_n_|_n+1_|

Where end points at 1 and start points and n so that valid elements is
the set {n, n+1, 0, 1}.

Currently, because we don't build the correct chain only {n, n+1} will
be sent. This adds a check and sg_chain call to correctly submit the
above to the crypto and tls send path.

Cc: stable@vger.kernel.org
Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/tls/tls_sw.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 31f6bbbc8992..21c7725d17ca 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -729,6 +729,12 @@ static int tls_push_record(struct sock *sk, int flags,
 		sg_mark_end(sk_msg_elem(msg_pl, i));
 	}
 
+	if (msg_pl->sg.end < msg_pl->sg.start) {
+		sg_chain(&msg_pl->sg.data[msg_pl->sg.start],
+			 MAX_SKB_FRAGS - msg_pl->sg.start + 1,
+			 msg_pl->sg.data);
+	}
+
 	i = msg_pl->sg.start;
 	sg_chain(rec->sg_aead_in, 2, &msg_pl->sg.data[i]);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (6 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
@ 2020-01-11  6:12 ` John Fastabend
  2020-01-15 22:37 ` [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs Daniel Borkmann
  8 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-11  6:12 UTC (permalink / raw)
  To: bpf; +Cc: netdev, daniel, ast, john.fastabend, song, jonathan.lemon

When user returns SK_DROP we need to reset the number of copied bytes
to indicate to the user the bytes were dropped and not sent. If we
don't reset the copied arg sendmsg will return as if those bytes were
copied giving the user a positive return value.

This works as expected today except in the case where the user also
pops bytes. In the pop case the sg.size is reduced but we don't correctly
account for this when copied bytes is reset. The popped bytes are not
accounted for and we return a small positive value potentially confusing
the user.

The reason this happens is due to a typo where we do the wrong comparison
when accounting for pop bytes. In this fix notice the if/else is not
needed and that we have a similar problem if we push data except its not
visible to the user because if delta is larger the sg.size we return a
negative value so it appears as an error regardless.

Cc: stable@vger.kernel.org
Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c | 5 +----
 net/tls/tls_sw.c   | 5 +----
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e6b08b5a0895..8a01428f80c1 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -315,10 +315,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 		 */
 		delta = msg->sg.size;
 		psock->eval = sk_psock_msg_verdict(sk, psock, msg);
-		if (msg->sg.size < delta)
-			delta -= msg->sg.size;
-		else
-			delta = 0;
+		delta -= msg->sg.size;
 	}
 
 	if (msg->cork_bytes &&
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 21c7725d17ca..159d49dab403 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -809,10 +809,7 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk,
 	if (psock->eval == __SK_NONE) {
 		delta = msg->sg.size;
 		psock->eval = sk_psock_msg_verdict(sk, psock, msg);
-		if (delta < msg->sg.size)
-			delta -= msg->sg.size;
-		else
-			delta = 0;
+		delta -= msg->sg.size;
 	}
 	if (msg->cork_bytes && msg->cork_bytes > msg->sg.size &&
 	    !enospc && !full_record) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
@ 2020-01-12  1:00   ` Jonathan Lemon
  2020-01-13 13:59   ` Jakub Sitnicki
  1 sibling, 0 replies; 18+ messages in thread
From: Jonathan Lemon @ 2020-01-12  1:00 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, ast, song



On 10 Jan 2020, at 22:12, John Fastabend wrote:

> When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
> and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
> don't push the write_space callback up and instead simply overwrite the
> op with the psock stored previous op. This may or may not be correct so
> to ensure we don't overwrite the TLS write space hook pass this field to
> the ULP and have it fixup the ctx.
>
> This completes a previous fix that pushed the ops through to the ULP
> but at the time missed doing this for write_space, presumably because
> write_space TLS hook was added around the same time.
>
> Cc: stable@vger.kernel.org
> Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
@ 2020-01-13 13:29   ` Jakub Sitnicki
  2020-01-14  3:19     ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 13:29 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

On Sat, Jan 11, 2020 at 07:11 AM CET, John Fastabend wrote:
> When a sockmap is free'd and a socket in the map is enabled with tls
> we tear down the bpf context on the socket, the psock struct and state,
> and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
> the tls stack it needs to update its saved sock ops so that when the tls
> socket is later destroyed it doesn't try to call the now destroyed psock
> hooks.
>
> This is about keeping stacked ULPs in good shape so they always have
> the right set of stacked ops.
>
> However, recently unhash() hook was removed from TLS side. But, the
> sockmap/bpf side is not doing any extra work to update the unhash op
> when is torn down instead expecting TLS side to manage it. So both
> TLS and sockmap believe the other side is managing the op and instead
> no one updates the hook so it continues to point at tcp_bpf_unhash().
> When unhash hook is called we call tcp_bpf_unhash() which detects the
> psock has already been destroyed and calls sk->sk_prot_unhash() which
> calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
>
> To fix have sockmap tear down logic fixup the stale pointer.
>
> Fixes: 5d92e631b8be ("net/tls: partially revert fix transition through disconnect with close")
> Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
> Cc: stable@vger.kernel.org
> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index ef7031f8a304..b6afe01f8592 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -358,6 +358,7 @@ static inline void sk_psock_update_proto(struct sock *sk,
>  static inline void sk_psock_restore_proto(struct sock *sk,
>  					  struct sk_psock *psock)
>  {
> +	sk->sk_prot->unhash = psock->saved_unhash;

We could also restore it from psock->sk_proto->unhash if we were not
NULL'ing on first call, right?

I've been wondering what is the purpose of having psock->saved_unhash
and psock->saved_close if we have the whole sk->sk_prot saved in
psock->sk_proto.

>  	sk->sk_write_space = psock->saved_write_space;
>
>  	if (psock->sk_proto) {

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
  2020-01-12  1:00   ` Jonathan Lemon
@ 2020-01-13 13:59   ` Jakub Sitnicki
  1 sibling, 0 replies; 18+ messages in thread
From: Jakub Sitnicki @ 2020-01-13 13:59 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
> When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
> and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
> don't push the write_space callback up and instead simply overwrite the
> op with the psock stored previous op. This may or may not be correct so
> to ensure we don't overwrite the TLS write space hook pass this field to
> the ULP and have it fixup the ctx.
>
> This completes a previous fix that pushed the ops through to the ULP
> but at the time missed doing this for write_space, presumably because
> write_space TLS hook was added around the same time.
>
> Cc: stable@vger.kernel.org
> Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h | 12 ++++++++----
>  include/net/tcp.h     |  6 ++++--
>  net/ipv4/tcp_ulp.c    |  6 ++++--
>  net/tls/tls_main.c    | 10 +++++++---
>  4 files changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index b6afe01f8592..14d61bba0b79 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>  					  struct sk_psock *psock)
>  {
>  	sk->sk_prot->unhash = psock->saved_unhash;
> -	sk->sk_write_space = psock->saved_write_space;
>  
>  	if (psock->sk_proto) {
>  		struct inet_connection_sock *icsk = inet_csk(sk);
>  		bool has_ulp = !!icsk->icsk_ulp_data;
>  
> -		if (has_ulp)
> -			tcp_update_ulp(sk, psock->sk_proto);
> -		else
> +		if (has_ulp) {
> +			tcp_update_ulp(sk, psock->sk_proto,
> +				       psock->saved_write_space);
> +		} else {
>  			sk->sk_prot = psock->sk_proto;
> +			sk->sk_write_space = psock->saved_write_space;
> +		}
>  		psock->sk_proto = NULL;
> +	} else {
> +		sk->sk_write_space = psock->saved_write_space;
>  	}
>  }
>  
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e460ea7f767b..e6f48384dc71 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2147,7 +2147,8 @@ struct tcp_ulp_ops {
>  	/* initialize ulp */
>  	int (*init)(struct sock *sk);
>  	/* update ulp */
> -	void (*update)(struct sock *sk, struct proto *p);
> +	void (*update)(struct sock *sk, struct proto *p,
> +		       void (*write_space)(struct sock *sk));
>  	/* cleanup ulp */
>  	void (*release)(struct sock *sk);
>  	/* diagnostic */
> @@ -2162,7 +2163,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
>  int tcp_set_ulp(struct sock *sk, const char *name);
>  void tcp_get_available_ulp(char *buf, size_t len);
>  void tcp_cleanup_ulp(struct sock *sk);
> -void tcp_update_ulp(struct sock *sk, struct proto *p);
> +void tcp_update_ulp(struct sock *sk, struct proto *p,
> +		    void (*write_space)(struct sock *sk));
>  
>  #define MODULE_ALIAS_TCP_ULP(name)				\
>  	__MODULE_INFO(alias, alias_userspace, name);		\
> diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
> index 12ab5db2b71c..38d3ad141161 100644
> --- a/net/ipv4/tcp_ulp.c
> +++ b/net/ipv4/tcp_ulp.c
> @@ -99,17 +99,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
>  	rcu_read_unlock();
>  }
>  
> -void tcp_update_ulp(struct sock *sk, struct proto *proto)
> +void tcp_update_ulp(struct sock *sk, struct proto *proto,
> +		    void (*write_space)(struct sock *sk))
>  {
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  
>  	if (!icsk->icsk_ulp_ops) {
> +		sk->sk_write_space = write_space;
>  		sk->sk_prot = proto;
>  		return;
>  	}
>  
>  	if (icsk->icsk_ulp_ops->update)
> -		icsk->icsk_ulp_ops->update(sk, proto);
> +		icsk->icsk_ulp_ops->update(sk, proto, write_space);
>  }
>  
>  void tcp_cleanup_ulp(struct sock *sk)
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index dac24c7aa7d4..94774c0e5ff3 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -732,15 +732,19 @@ static int tls_init(struct sock *sk)
>  	return rc;
>  }
>  
> -static void tls_update(struct sock *sk, struct proto *p)
> +static void tls_update(struct sock *sk, struct proto *p,
> +		       void (*write_space)(struct sock *sk))
>  {
>  	struct tls_context *ctx;
>  
>  	ctx = tls_get_ctx(sk);
> -	if (likely(ctx))
> +	if (likely(ctx)) {
> +		ctx->sk_write_space = write_space;
>  		ctx->sk_proto = p;
> -	else
> +	} else {
>  		sk->sk_prot = p;
> +		sk->sk_write_space = write_space;
> +	}
>  }
>  
>  static int tls_get_info(const struct sock *sk, struct sk_buff *skb)

Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  2020-01-13 13:29   ` Jakub Sitnicki
@ 2020-01-14  3:19     ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-01-14  3:19 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

Jakub Sitnicki wrote:
> On Sat, Jan 11, 2020 at 07:11 AM CET, John Fastabend wrote:
> > When a sockmap is free'd and a socket in the map is enabled with tls
> > we tear down the bpf context on the socket, the psock struct and state,
> > and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
> > the tls stack it needs to update its saved sock ops so that when the tls
> > socket is later destroyed it doesn't try to call the now destroyed psock
> > hooks.
> >
> > This is about keeping stacked ULPs in good shape so they always have
> > the right set of stacked ops.
> >
> > However, recently unhash() hook was removed from TLS side. But, the
> > sockmap/bpf side is not doing any extra work to update the unhash op
> > when is torn down instead expecting TLS side to manage it. So both
> > TLS and sockmap believe the other side is managing the op and instead
> > no one updates the hook so it continues to point at tcp_bpf_unhash().
> > When unhash hook is called we call tcp_bpf_unhash() which detects the
> > psock has already been destroyed and calls sk->sk_prot_unhash() which
> > calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
> >
> > To fix have sockmap tear down logic fixup the stale pointer.
> >
> > Fixes: 5d92e631b8be ("net/tls: partially revert fix transition through disconnect with close")
> > Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
> > Cc: stable@vger.kernel.org
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  include/linux/skmsg.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> > index ef7031f8a304..b6afe01f8592 100644
> > --- a/include/linux/skmsg.h
> > +++ b/include/linux/skmsg.h
> > @@ -358,6 +358,7 @@ static inline void sk_psock_update_proto(struct sock *sk,
> >  static inline void sk_psock_restore_proto(struct sock *sk,
> >  					  struct sk_psock *psock)
> >  {
> > +	sk->sk_prot->unhash = psock->saved_unhash;
> 
> We could also restore it from psock->sk_proto->unhash if we were not
> NULL'ing on first call, right?
> 
> I've been wondering what is the purpose of having psock->saved_unhash
> and psock->saved_close if we have the whole sk->sk_prot saved in
> psock->sk_proto.

Its historical we can do some cleanups in net-next to simplify this a bit.
I don't think it needs to be done in bpf tree though.

> 
> >  	sk->sk_write_space = psock->saved_write_space;
> >
> >  	if (psock->sk_proto) {
> 
> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs
  2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (7 preceding siblings ...)
  2020-01-11  6:12 ` [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
@ 2020-01-15 22:37 ` Daniel Borkmann
  8 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2020-01-15 22:37 UTC (permalink / raw)
  To: John Fastabend, bpf; +Cc: netdev, ast, song, jonathan.lemon

On 1/11/20 7:11 AM, John Fastabend wrote:
> To date our usage of sockmap/tls has been fairly simple, the BPF programs
> did only well-defined pop, push, pull and apply/cork operations.
> 
> Now that we started to push more complex programs into sockmap we uncovered
> a series of issues addressed here. Further OpenSSL3.0 version should be
> released soon with kTLS support so its important to get any remaining
> issues on BPF and kTLS support resolved.
> 
> Additionally, I have a patch under development to allow sockmap to be
> enabled/disabled at runtime for Cilium endpoints. This allows us to stress
> the map insert/delete with kTLS more than previously where Cilium only
> added the socket to the map when it entered ESTABLISHED state and never
> touched it from the control path side again relying on the sockets own
> close() hook to remove it.
> 
> To test I have a set of test cases in test_sockmap.c that expose these
> issues. Once we get fixes here merged and in bpf-next I'll submit the
> tests to bpf-next tree to ensure we don't regress again. Also I've run
> these patches in the Cilium CI with OpenSSL (master branch) this will
> run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
> testing.
> 
> I'm aware of two more issues that we are working to resolve in another
> couple (probably two) patches. First we see an auth tag corruption in
> kTLS when sending small 1byte chunks under stress. I've not pinned this
> down yet. But, guessing because its under 1B stress tests it must be
> some error path being triggered. And second we need to ensure BPF RX
> programs are not skipped when kTLS ULP is loaded. This breaks some of
> the sockmap selftests when running with kTLS. I'll send a follow up
> for this.
> 
> v2: I dropped a patch that added !0 size check in tls_push_record
>      this originated from a panic I caught awhile ago with a trace
>      in the crypto stack. But I can not reproduce it anymore so will
>      dig into that and send another patch later if needed. Anyways
>      after a bit of thought it would be nicer if tls/crypto/bpf didn't
>      require special case handling for the !0 size.
> 
> John Fastabend (8):
>    bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
>    bpf: sockmap, ensure sock lock held during tear down
>    bpf: sockmap/tls, push write_space updates through ulp updates
>    bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
>    bpf: sockmap/tls, msg_push_data may leave end mark in place
>    bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
>    bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra
>      chaining
>    bpf: sockmap/tls, fix pop data with SK_DROP return code
> 
>   include/linux/skmsg.h | 13 +++++++++----
>   include/net/tcp.h     |  6 ++++--
>   net/core/filter.c     | 11 ++++++-----
>   net/core/skmsg.c      |  2 ++
>   net/core/sock_map.c   |  7 ++++++-
>   net/ipv4/tcp_bpf.c    |  5 +----
>   net/ipv4/tcp_ulp.c    |  6 ++++--
>   net/tls/tls_main.c    | 10 +++++++---
>   net/tls/tls_sw.c      | 31 +++++++++++++++++++++++++++----
>   9 files changed, 66 insertions(+), 25 deletions(-)
> 

Applied, thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
  2020-01-11  6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
@ 2020-02-05 17:55   ` Jakub Sitnicki
  2020-02-06  5:51     ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Sitnicki @ 2020-02-05 17:55 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
> The sock_map_free() and sock_hash_free() paths used to delete sockmap
> and sockhash maps walk the maps and destroy psock and bpf state associated
> with the socks in the map. When done the socks no longer have BPF programs
> attached and will function normally. This can happen while the socks in
> the map are still "live" meaning data may be sent/received during the walk.
>
> Currently, though we don't take the sock_lock when the psock and bpf state
> is removed through this path. Specifically, this means we can be writing
> into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
> while they are also being called from the networking side. This is not
> safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
> believed it was safe. Further its not clear to me its even a good idea
> to try and do this on "live" sockets while networking side might also
> be using the socket. Instead of trying to reason about using the socks
> from both sides lets realize that every use case I'm aware of rarely
> deletes maps, in fact kubernetes/Cilium case builds map at init and
> never tears it down except on errors. So lets do the simple fix and
> grab sock lock.
>
> This patch wraps sock deletes from maps in sock lock and adds some
> annotations so we catch any other cases easier.
>
> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> Cc: stable@vger.kernel.org
> Acked-by: Song Liu <songliubraving@fb.com>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/core/skmsg.c    | 2 ++
>  net/core/sock_map.c | 7 ++++++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index ded2d5227678..3866d7e20c07 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
>
>  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>  {
> +	sock_owned_by_me(sk);
> +
>  	sk_psock_cork_free(psock);
>  	sk_psock_zap_ingress(psock);
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index eb114ee419b6..8998e356f423 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
>  		struct sock *sk;
>
>  		sk = xchg(psk, NULL);
> -		if (sk)
> +		if (sk) {
> +			lock_sock(sk);
>  			sock_map_unref(sk, psk);
> +			release_sock(sk);
> +		}
>  	}
>  	raw_spin_unlock_bh(&stab->lock);
>  	rcu_read_unlock();

John, I've noticed this is triggering warnings that we might sleep in
lock_sock while (1) in RCU read-side section, and (2) holding a spin
lock:

=============================
WARNING: suspicious RCU usage
5.5.0-04012-g38c811e4cd3c #443 Not tainted
-----------------------------
include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
4 locks held by kworker/0:1/62:
 #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
 #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
 #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
 #3: ffff8881383dbdf8 (&stab->lock){+.-.}, at: sock_map_free+0x64/0x170

stack backtrace:
CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04012-g38c811e4cd3c #443
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
Workqueue: events bpf_map_free_deferred
Call Trace:
 dump_stack+0x71/0xa0
 ___might_sleep+0x105/0x190
 lock_sock_nested+0x28/0x90
 sock_map_free+0x95/0x170
 bpf_map_free_deferred+0x58/0x80
 process_one_work+0x260/0x5e0
 worker_thread+0x4d/0x3e0
 kthread+0x108/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x3a/0x50
BUG: sleeping function called from invalid context at net/core/sock.c:2942
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
4 locks held by kworker/0:1/62:
 #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
 #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
 #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
 #3: ffff8881383dbdf8 (&stab->lock){+.-.}, at: sock_map_free+0x64/0x170
CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04012-g38c811e4cd3c #443
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
Workqueue: events bpf_map_free_deferred
Call Trace:
 dump_stack+0x71/0xa0
 ___might_sleep.cold+0xa6/0xb6
 lock_sock_nested+0x28/0x90
 sock_map_free+0x95/0x170
 bpf_map_free_deferred+0x58/0x80
 process_one_work+0x260/0x5e0
 worker_thread+0x4d/0x3e0
 kthread+0x108/0x140
 ? process_one_work+0x5e0/0x5e0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x3a/0x50

Easy to trigger on a VM with 1 vCPU, reproducer below.

Here's an idea how to change the locking. I'm still wrapping my head
around what protects what in sock_map_free, so please bear with me:

1. synchronize_rcu before we iterate over the array is not needed,
   AFAICT. We are not free'ing the map just yet, hence any readers
   accessing the map via the psock are not in danger of use-after-free.

2. rcu_read_lock is needed to protect access to psock inside
   sock_map_unref, but we can't sleep while in RCU read-side.  So push
   it down, after we grab the sock lock.

3. Grabbing stab->lock seems not needed, either. We get called from
   bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
   racing with any other map user to modify its contents.

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 2cbde385e1a0..7f54e0d27d32 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -259,9 +259,6 @@ static void sock_map_free(struct bpf_map *map)
        struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
        int i;

-       synchronize_rcu();
-       rcu_read_lock();
-       raw_spin_lock_bh(&stab->lock);
        for (i = 0; i < stab->map.max_entries; i++) {
                struct sock **psk = &stab->sks[i];
                struct sock *sk;
@@ -269,12 +266,12 @@ static void sock_map_free(struct bpf_map *map)
                sk = xchg(psk, NULL);
                if (sk) {
                        lock_sock(sk);
+                       rcu_read_lock();
                        sock_map_unref(sk, psk);
+                       rcu_read_unlock();
                        release_sock(sk);
                }
        }
-       raw_spin_unlock_bh(&stab->lock);
-       rcu_read_unlock();

        synchronize_rcu();

> @@ -862,7 +865,9 @@ static void sock_hash_free(struct bpf_map *map)
>  		raw_spin_lock_bh(&bucket->lock);
>  		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
>  			hlist_del_rcu(&elem->node);
> +			lock_sock(elem->sk);
>  			sock_map_unref(elem->sk, elem);
> +			release_sock(elem->sk);
>  		}
>  		raw_spin_unlock_bh(&bucket->lock);
>  	}

Similar problems here. With one extra that it seems we're missing a
synchronize_rcu *after* walking over the htable for the same reason as
it got added to sock_map_free in 2bb90e5cc90e ("bpf: sockmap,
synchronize_rcu before free'ing map"):

    We need to have a synchronize_rcu before free'ing the sockmap because
    any outstanding psock references will have a pointer to the map and
    when they use this could trigger a use after free.

WDYT?

Reproducer follows.

---8<---

/* sockmap_update.c */

#include <errno.h>
#include <error.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <unistd.h>

#include <bpf/bpf.h>

#define fail(fmt...) error_at_line(1, errno, __func__, __LINE__, fmt)

int main(void)
{
	struct sockaddr_storage addr_ = {0};
	struct sockaddr *addr = (void *) &addr_;
	socklen_t len = sizeof(addr_);
	int srv, cli, map, key;

	srv = socket(AF_INET, SOCK_STREAM, 0);
	if (srv == -1)
		fail("socket(cli)");

	if (listen(srv, SOMAXCONN))
		fail("listen");

	if (getsockname(srv, addr, &len))
		fail("getsockname");

	cli = socket(AF_INET, SOCK_STREAM, 0);
	if (cli == -1)
		fail("socket(srv)");

	if (connect(cli, addr, len))
		fail("connect");

	map = bpf_create_map(BPF_MAP_TYPE_SOCKMAP, sizeof(int), sizeof(int), 1, 0);
	if (map == -1)
		fail("bpf_create_map");

	key = 0;
	if (bpf_map_update_elem(map, &key, &cli, BPF_NOEXIST))
		fail("bpf_map_update_elem");

	if (close(map))
		fail("close(map)");

	if (close(cli))
		fail("close(cli)");

	if (close(srv))
		fail("close(srv)");

	return 0;
}

--->8---

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
  2020-02-05 17:55   ` Jakub Sitnicki
@ 2020-02-06  5:51     ` John Fastabend
  2020-02-06 12:26       ` Jakub Sitnicki
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2020-02-06  5:51 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

Jakub Sitnicki wrote:
> On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
> > The sock_map_free() and sock_hash_free() paths used to delete sockmap
> > and sockhash maps walk the maps and destroy psock and bpf state associated
> > with the socks in the map. When done the socks no longer have BPF programs
> > attached and will function normally. This can happen while the socks in
> > the map are still "live" meaning data may be sent/received during the walk.
> >
> > Currently, though we don't take the sock_lock when the psock and bpf state
> > is removed through this path. Specifically, this means we can be writing
> > into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
> > while they are also being called from the networking side. This is not
> > safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
> > believed it was safe. Further its not clear to me its even a good idea
> > to try and do this on "live" sockets while networking side might also
> > be using the socket. Instead of trying to reason about using the socks
> > from both sides lets realize that every use case I'm aware of rarely
> > deletes maps, in fact kubernetes/Cilium case builds map at init and
> > never tears it down except on errors. So lets do the simple fix and
> > grab sock lock.
> >
> > This patch wraps sock deletes from maps in sock lock and adds some
> > annotations so we catch any other cases easier.
> >
> > Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> > Cc: stable@vger.kernel.org
> > Acked-by: Song Liu <songliubraving@fb.com>
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  net/core/skmsg.c    | 2 ++
> >  net/core/sock_map.c | 7 ++++++-
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index ded2d5227678..3866d7e20c07 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
> >
> >  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> >  {
> > +	sock_owned_by_me(sk);
> > +
> >  	sk_psock_cork_free(psock);
> >  	sk_psock_zap_ingress(psock);
> >
> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > index eb114ee419b6..8998e356f423 100644
> > --- a/net/core/sock_map.c
> > +++ b/net/core/sock_map.c
> > @@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
> >  		struct sock *sk;
> >
> >  		sk = xchg(psk, NULL);
> > -		if (sk)
> > +		if (sk) {
> > +			lock_sock(sk);
> >  			sock_map_unref(sk, psk);
> > +			release_sock(sk);
> > +		}
> >  	}
> >  	raw_spin_unlock_bh(&stab->lock);
> >  	rcu_read_unlock();
> 
> John, I've noticed this is triggering warnings that we might sleep in
> lock_sock while (1) in RCU read-side section, and (2) holding a spin
> lock:

OK, this patch was misguided, it makes things worse rather than better,
the real fix is to do proper READ_ONCE/WRITE_ONCE on the sk ops. That
is coming soon. Thanks for following up on it.

> 
> =============================
> WARNING: suspicious RCU usage
> 5.5.0-04012-g38c811e4cd3c #443 Not tainted
> -----------------------------
> include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
> 
> other info that might help us debug this:

[...]

> 
> Easy to trigger on a VM with 1 vCPU, reproducer below.

When net-next opens we should add some tests to cover this its not something
I test often because our use case the maps are rarely free'd.

> 
> Here's an idea how to change the locking. I'm still wrapping my head
> around what protects what in sock_map_free, so please bear with me:
> 
> 1. synchronize_rcu before we iterate over the array is not needed,
>    AFAICT. We are not free'ing the map just yet, hence any readers
>    accessing the map via the psock are not in danger of use-after-free.

Agreed. When we added 2bb90e5cc90e ("bpf: sockmap, synchronize_rcu before
free'ing map") we could have done this.

> 
> 2. rcu_read_lock is needed to protect access to psock inside
>    sock_map_unref, but we can't sleep while in RCU read-side.  So push
>    it down, after we grab the sock lock.

yes this looks better.

> 
> 3. Grabbing stab->lock seems not needed, either. We get called from
>    bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
>    racing with any other map user to modify its contents.

This I'll need to think on a bit. We have the link-lock there so
probably should be safe to drop. But will need to trace this through
git history to be sure.

> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 2cbde385e1a0..7f54e0d27d32 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -259,9 +259,6 @@ static void sock_map_free(struct bpf_map *map)
>         struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
>         int i;
> 
> -       synchronize_rcu();
> -       rcu_read_lock();
> -       raw_spin_lock_bh(&stab->lock);
>         for (i = 0; i < stab->map.max_entries; i++) {
>                 struct sock **psk = &stab->sks[i];
>                 struct sock *sk;
> @@ -269,12 +266,12 @@ static void sock_map_free(struct bpf_map *map)
>                 sk = xchg(psk, NULL);
>                 if (sk) {
>                         lock_sock(sk);
> +                       rcu_read_lock();
>                         sock_map_unref(sk, psk);
> +                       rcu_read_unlock();
>                         release_sock(sk);
>                 }
>         }
> -       raw_spin_unlock_bh(&stab->lock);
> -       rcu_read_unlock();
> 
>         synchronize_rcu();
> 
> > @@ -862,7 +865,9 @@ static void sock_hash_free(struct bpf_map *map)
> >  		raw_spin_lock_bh(&bucket->lock);
> >  		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
> >  			hlist_del_rcu(&elem->node);
> > +			lock_sock(elem->sk);
> >  			sock_map_unref(elem->sk, elem);
> > +			release_sock(elem->sk);
> >  		}
> >  		raw_spin_unlock_bh(&bucket->lock);
> >  	}
> 
> Similar problems here. With one extra that it seems we're missing a
> synchronize_rcu *after* walking over the htable for the same reason as
> it got added to sock_map_free in 2bb90e5cc90e ("bpf: sockmap,
> synchronize_rcu before free'ing map"):
> 
>     We need to have a synchronize_rcu before free'ing the sockmap because
>     any outstanding psock references will have a pointer to the map and
>     when they use this could trigger a use after free.
> 
> WDYT?

Can you push the fix to bpf but leave the stab->lock for now. I think
we can do a slightly better cleanup on stab->lock in bpf-next.

> 
> Reproducer follows.

push reproducer into selftests?

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
  2020-02-06  5:51     ` John Fastabend
@ 2020-02-06 12:26       ` Jakub Sitnicki
  2020-02-06 19:04         ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Sitnicki @ 2020-02-06 12:26 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

On Thu, Feb 06, 2020 at 06:51 AM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
>> > The sock_map_free() and sock_hash_free() paths used to delete sockmap
>> > and sockhash maps walk the maps and destroy psock and bpf state associated
>> > with the socks in the map. When done the socks no longer have BPF programs
>> > attached and will function normally. This can happen while the socks in
>> > the map are still "live" meaning data may be sent/received during the walk.
>> >
>> > Currently, though we don't take the sock_lock when the psock and bpf state
>> > is removed through this path. Specifically, this means we can be writing
>> > into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
>> > while they are also being called from the networking side. This is not
>> > safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
>> > believed it was safe. Further its not clear to me its even a good idea
>> > to try and do this on "live" sockets while networking side might also
>> > be using the socket. Instead of trying to reason about using the socks
>> > from both sides lets realize that every use case I'm aware of rarely
>> > deletes maps, in fact kubernetes/Cilium case builds map at init and
>> > never tears it down except on errors. So lets do the simple fix and
>> > grab sock lock.
>> >
>> > This patch wraps sock deletes from maps in sock lock and adds some
>> > annotations so we catch any other cases easier.
>> >
>> > Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
>> > Cc: stable@vger.kernel.org
>> > Acked-by: Song Liu <songliubraving@fb.com>
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  net/core/skmsg.c    | 2 ++
>> >  net/core/sock_map.c | 7 ++++++-
>> >  2 files changed, 8 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> > index ded2d5227678..3866d7e20c07 100644
>> > --- a/net/core/skmsg.c
>> > +++ b/net/core/skmsg.c
>> > @@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);
>> >
>> >  void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
>> >  {
>> > +	sock_owned_by_me(sk);
>> > +
>> >  	sk_psock_cork_free(psock);
>> >  	sk_psock_zap_ingress(psock);
>> >
>> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>> > index eb114ee419b6..8998e356f423 100644
>> > --- a/net/core/sock_map.c
>> > +++ b/net/core/sock_map.c
>> > @@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
>> >  		struct sock *sk;
>> >
>> >  		sk = xchg(psk, NULL);
>> > -		if (sk)
>> > +		if (sk) {
>> > +			lock_sock(sk);
>> >  			sock_map_unref(sk, psk);
>> > +			release_sock(sk);
>> > +		}
>> >  	}
>> >  	raw_spin_unlock_bh(&stab->lock);
>> >  	rcu_read_unlock();
>>
>> John, I've noticed this is triggering warnings that we might sleep in
>> lock_sock while (1) in RCU read-side section, and (2) holding a spin
>> lock:

[...]

>>
>> Here's an idea how to change the locking. I'm still wrapping my head
>> around what protects what in sock_map_free, so please bear with me:
>>
>> 1. synchronize_rcu before we iterate over the array is not needed,
>>    AFAICT. We are not free'ing the map just yet, hence any readers
>>    accessing the map via the psock are not in danger of use-after-free.
>
> Agreed. When we added 2bb90e5cc90e ("bpf: sockmap, synchronize_rcu before
> free'ing map") we could have done this.
>
>>
>> 2. rcu_read_lock is needed to protect access to psock inside
>>    sock_map_unref, but we can't sleep while in RCU read-side.  So push
>>    it down, after we grab the sock lock.
>
> yes this looks better.
>
>>
>> 3. Grabbing stab->lock seems not needed, either. We get called from
>>    bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
>>    racing with any other map user to modify its contents.
>
> This I'll need to think on a bit. We have the link-lock there so
> probably should be safe to drop. But will need to trace this through
> git history to be sure.
>

[...]

>> WDYT?
>
> Can you push the fix to bpf but leave the stab->lock for now. I think
> we can do a slightly better cleanup on stab->lock in bpf-next.

Here it is:

https://lore.kernel.org/bpf/20200206111652.694507-1-jakub@cloudflare.com/T/#t

I left the "extra" synchronize_rcu before walking the map. On second
thought, this isn't a bug. Just adds extra wait. bpf-next material?

>
>>
>> Reproducer follows.
>
> push reproducer into selftests?

Included the reproducer with the fixes. If it gets dropped from the
series, I'll resubmit it once bpf-next reopens.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down
  2020-02-06 12:26       ` Jakub Sitnicki
@ 2020-02-06 19:04         ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2020-02-06 19:04 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend
  Cc: bpf, netdev, daniel, ast, song, jonathan.lemon

Jakub Sitnicki wrote:
> On Thu, Feb 06, 2020 at 06:51 AM CET, John Fastabend wrote:
> > Jakub Sitnicki wrote:
> >> On Sat, Jan 11, 2020 at 07:12 AM CET, John Fastabend wrote:
> >> > The sock_map_free() and sock_hash_free() paths used to delete sockmap
> >> > and sockhash maps walk the maps and destroy psock and bpf state associated
> >> > with the socks in the map. When done the socks no longer have BPF programs
> >> > attached and will function normally. This can happen while the socks in
> >> > the map are still "live" meaning data may be sent/received during the walk.

[...]

> >>
> >> John, I've noticed this is triggering warnings that we might sleep in
> >> lock_sock while (1) in RCU read-side section, and (2) holding a spin
> >> lock:
> 
> [...]
> 
> >>
> >> Here's an idea how to change the locking. I'm still wrapping my head
> >> around what protects what in sock_map_free, so please bear with me:
> >>
> >> 1. synchronize_rcu before we iterate over the array is not needed,
> >>    AFAICT. We are not free'ing the map just yet, hence any readers
> >>    accessing the map via the psock are not in danger of use-after-free.
> >
> > Agreed. When we added 2bb90e5cc90e ("bpf: sockmap, synchronize_rcu before
> > free'ing map") we could have done this.
> >
> >>
> >> 2. rcu_read_lock is needed to protect access to psock inside
> >>    sock_map_unref, but we can't sleep while in RCU read-side.  So push
> >>    it down, after we grab the sock lock.
> >
> > yes this looks better.
> >
> >>
> >> 3. Grabbing stab->lock seems not needed, either. We get called from
> >>    bpf_map_free_deferred, after map refcnt dropped to 0, so we're not
> >>    racing with any other map user to modify its contents.
> >
> > This I'll need to think on a bit. We have the link-lock there so
> > probably should be safe to drop. But will need to trace this through
> > git history to be sure.
> >
> 
> [...]
> 
> >> WDYT?
> >
> > Can you push the fix to bpf but leave the stab->lock for now. I think
> > we can do a slightly better cleanup on stab->lock in bpf-next.
> 
> Here it is:
> 
> https://lore.kernel.org/bpf/20200206111652.694507-1-jakub@cloudflare.com/T/#t
> 
> I left the "extra" synchronize_rcu before walking the map. On second
> thought, this isn't a bug. Just adds extra wait. bpf-next material?

Agree.

> 
> >
> >>
> >> Reproducer follows.
> >
> > push reproducer into selftests?
> 
> Included the reproducer with the fixes. If it gets dropped from the
> series, I'll resubmit it once bpf-next reopens.

Yeah, I don't have a strong preference where it lands I have a set of
tests for bpf-next once it opens as well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-02-06 19:05 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-11  6:11 [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs John Fastabend
2020-01-11  6:11 ` [bpf PATCH v2 1/8] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
2020-01-13 13:29   ` Jakub Sitnicki
2020-01-14  3:19     ` John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 2/8] bpf: sockmap, ensure sock lock held during tear down John Fastabend
2020-02-05 17:55   ` Jakub Sitnicki
2020-02-06  5:51     ` John Fastabend
2020-02-06 12:26       ` Jakub Sitnicki
2020-02-06 19:04         ` John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 3/8] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
2020-01-12  1:00   ` Jonathan Lemon
2020-01-13 13:59   ` Jakub Sitnicki
2020-01-11  6:12 ` [bpf PATCH v2 4/8] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 5/8] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 6/8] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 7/8] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
2020-01-11  6:12 ` [bpf PATCH v2 8/8] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
2020-01-15 22:37 ` [bpf PATCH v2 0/8] Fixes for sockmap/tls from more complex BPF progs Daniel Borkmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).