[bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs
@ 2020-01-08 21:13 John Fastabend
  2020-01-08 21:14 ` [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:13 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

To date our usage of sockmap/tls has been fairly simple, the BPF programs
did only well-defined pop, push, pull and apply/cork operations.

Now that we started to push more complex programs into sockmap we uncovered
a series of issues addressed here. Further OpenSSL3.0 version should be
released soon with kTLS support so its important to get any remaining
issues on BPF and kTLS support resolved.

Additionally, I have a patch under development to allow sockmap to be
enabled/disabled at runtime for Cilium endpoints. This allows us to stress
the map insert/delete with kTLS more than previously where Cilium only
added the socket to the map when it entered ESTABLISHED state and never
touched it from the control path side again relying on the sockets own
close() hook to remove it. The selftests are great but a cluster
full of thousands of sockets finds these things fairly quickly.

To test I have a set of test cases in test_sockmap.c that expose these
issues. Once we get fixes here merged and in bpf-next I'll submit the
tests to bpf-next tree to ensure we don't regress again. Also I've run
these patches in the Cilium CI with OpenSSL (master branch) this will
run tools such as netperf, ab, wrk2, curl, etc. to get a broad set of
testing.

I'm aware of two more issues that we are working to resolve in another
couple (probably two) patches. First we see an auth tag corruption in
kTLS when sending small 1byte chunks under stress. I've not pinned this
down yet. But, guessing because its under 1B stress tests it must be
some error path being triggered. And second we need to ensure BPF RX
programs are not skipped when kTLS ULP is loaded. This breaks some of
the sockmap selftests when running with kTLS. I'll send a follow up
for this.

Any review/comments appreciated. Thanks!

---

John Fastabend (9):
      bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
      bpf: sockmap, ensure sock lock held during tear down
      bpf: sockmap/tls, push write_space updates through ulp updates
      bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
      bpf: sockmap/tls, msg_push_data may leave end mark in place
      bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
      bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining
      bpf: sockmap/tls, tls_push_record can not handle zero length skmsg
      bpf: sockmap/tls, fix pop data with SK_DROP return code

 include/linux/skmsg.h |   13 +++++++++----
 include/net/tcp.h     |    6 ++++--
 net/core/filter.c     |   11 ++++++-----
 net/core/skmsg.c      |    2 ++
 net/core/sock_map.c   |    7 ++++++-
 net/ipv4/tcp_bpf.c    |    5 +----
 net/ipv4/tcp_ulp.c    |    6 ++++--
 net/tls/tls_main.c    |   10 +++++++---
 net/tls/tls_sw.c      |   34 ++++++++++++++++++++++++++++++----
 9 files changed, 69 insertions(+), 25 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
@ 2020-01-08 21:14 ` John Fastabend
  2020-01-09  1:34   ` Song Liu
  2020-01-08 21:14 ` [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down John Fastabend
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:14 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

When a sockmap is free'd and a socket in the map is enabled with tls
we tear down the bpf context on the socket, the psock struct and state,
and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
the tls stack it needs to update its saved sock ops so that when the tls
socket is later destroyed it doesn't try to call the now destroyed psock
hooks.

This is about keeping stacked ULPs in good shape so they always have
the right set of stacked ops.

However, recently unhash() hook was removed from TLS side. But, the
sockmap/bpf side is not doing any extra work to update the unhash op
when is torn down instead expecting TLS side to manage it. So both
TLS and sockmap believe the other side is managing the op and instead
no one updates the hook so it continues to point at tcp_bpf_unhash().
When unhash hook is called we call tcp_bpf_unhash() which detects the
psock has already been destroyed and calls sk->sk_prot_unhash() which
calls tcp_bpf_unhash() yet again and so on looping and hanging the core.

To fix have sockmap tear down logic fixup the stale pointer.

Fixes: 5d92e631b8be ("net/tls: partially revert fix transition through disconnect with close")
Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index ef7031f8a304..b6afe01f8592 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -358,6 +358,7 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
+	sk->sk_prot->unhash = psock->saved_unhash;
 	sk->sk_write_space = psock->saved_write_space;

 	if (psock->sk_proto) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
  2020-01-08 21:14 ` [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
@ 2020-01-08 21:14 ` John Fastabend
  2020-01-09 17:10   ` Song Liu
  2020-01-08 21:14 ` [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:14 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/skmsg.c    |    2 ++
 net/core/sock_map.c |    7 ++++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index ded2d5227678..3866d7e20c07 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -594,6 +594,8 @@ EXPORT_SYMBOL_GPL(sk_psock_destroy);

 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
 {
+	sock_owned_by_me(sk);
+
 	sk_psock_cork_free(psock);
 	sk_psock_zap_ingress(psock);

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index eb114ee419b6..8998e356f423 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -241,8 +241,11 @@ static void sock_map_free(struct bpf_map *map)
 		struct sock *sk;

 		sk = xchg(psk, NULL);
-		if (sk)
+		if (sk) {
+			lock_sock(sk);
 			sock_map_unref(sk, psk);
+			release_sock(sk);
+		}
 	}
 	raw_spin_unlock_bh(&stab->lock);
 	rcu_read_unlock();
@@ -862,7 +865,9 @@ static void sock_hash_free(struct bpf_map *map)
 		raw_spin_lock_bh(&bucket->lock);
 		hlist_for_each_entry_safe(elem, node, &bucket->head, node) {
 			hlist_del_rcu(&elem->node);
+			lock_sock(elem->sk);
 			sock_map_unref(elem->sk, elem);
+			release_sock(elem->sk);
 		}
 		raw_spin_unlock_bh(&bucket->lock);
 	}

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
  2020-01-08 21:14 ` [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
  2020-01-08 21:14 ` [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down John Fastabend
@ 2020-01-08 21:14 ` John Fastabend
  2020-01-09 10:33   ` Jakub Sitnicki
  2020-01-08 21:14 ` [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:14 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.

This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.

Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 include/linux/skmsg.h |   12 ++++++++----
 include/net/tcp.h     |    6 ++++--
 net/ipv4/tcp_ulp.c    |    6 ++++--
 net/tls/tls_main.c    |   10 +++++++---
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index b6afe01f8592..14d61bba0b79 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
 					  struct sk_psock *psock)
 {
 	sk->sk_prot->unhash = psock->saved_unhash;
-	sk->sk_write_space = psock->saved_write_space;
 
 	if (psock->sk_proto) {
 		struct inet_connection_sock *icsk = inet_csk(sk);
 		bool has_ulp = !!icsk->icsk_ulp_data;
 
-		if (has_ulp)
-			tcp_update_ulp(sk, psock->sk_proto);
-		else
+		if (has_ulp) {
+			tcp_update_ulp(sk, psock->sk_proto,
+				       psock->saved_write_space);
+		} else {
 			sk->sk_prot = psock->sk_proto;
+			sk->sk_write_space = psock->saved_write_space;
+		}
 		psock->sk_proto = NULL;
+	} else {
+		sk->sk_write_space = psock->saved_write_space;
 	}
 }
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index e460ea7f767b..e6f48384dc71 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2147,7 +2147,8 @@ struct tcp_ulp_ops {
 	/* initialize ulp */
 	int (*init)(struct sock *sk);
 	/* update ulp */
-	void (*update)(struct sock *sk, struct proto *p);
+	void (*update)(struct sock *sk, struct proto *p,
+		       void (*write_space)(struct sock *sk));
 	/* cleanup ulp */
 	void (*release)(struct sock *sk);
 	/* diagnostic */
@@ -2162,7 +2163,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
 int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
-void tcp_update_ulp(struct sock *sk, struct proto *p);
+void tcp_update_ulp(struct sock *sk, struct proto *p,
+		    void (*write_space)(struct sock *sk));
 
 #define MODULE_ALIAS_TCP_ULP(name)				\
 	__MODULE_INFO(alias, alias_userspace, name);		\
diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
index 12ab5db2b71c..38d3ad141161 100644
--- a/net/ipv4/tcp_ulp.c
+++ b/net/ipv4/tcp_ulp.c
@@ -99,17 +99,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
 	rcu_read_unlock();
 }
 
-void tcp_update_ulp(struct sock *sk, struct proto *proto)
+void tcp_update_ulp(struct sock *sk, struct proto *proto,
+		    void (*write_space)(struct sock *sk))
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
 	if (!icsk->icsk_ulp_ops) {
+		sk->sk_write_space = write_space;
 		sk->sk_prot = proto;
 		return;
 	}
 
 	if (icsk->icsk_ulp_ops->update)
-		icsk->icsk_ulp_ops->update(sk, proto);
+		icsk->icsk_ulp_ops->update(sk, proto, write_space);
 }
 
 void tcp_cleanup_ulp(struct sock *sk)
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index dac24c7aa7d4..94774c0e5ff3 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -732,15 +732,19 @@ static int tls_init(struct sock *sk)
 	return rc;
 }
 
-static void tls_update(struct sock *sk, struct proto *p)
+static void tls_update(struct sock *sk, struct proto *p,
+		       void (*write_space)(struct sock *sk))
 {
 	struct tls_context *ctx;
 
 	ctx = tls_get_ctx(sk);
-	if (likely(ctx))
+	if (likely(ctx)) {
+		ctx->sk_write_space = write_space;
 		ctx->sk_proto = p;
-	else
+	} else {
 		sk->sk_prot = p;
+		sk->sk_write_space = write_space;
+	}
 }
 
 static int tls_get_info(const struct sock *sk, struct sk_buff *skb)


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (2 preceding siblings ...)
  2020-01-08 21:14 ` [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
@ 2020-01-08 21:14 ` John Fastabend
  2020-01-09 18:37   ` Song Liu
  2020-01-08 21:15 ` [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:14 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

In the push, pull, and pop helpers operating on skmsg objects to make
data writable or insert/remove data we use this bounds check to ensure
specified data is valid,

 /* Bounds checks: start and pop must be inside message */
 if (start >= offset + l || last >= msg->sg.size)
     return -EINVAL;

The problem here is offset has already included the length of the
current element the 'l' above. So start could be past the end of
the scatterlist element in the case where start also points into an
offset on the last skmsg element.

To fix do the accounting slightly different by adding the length of
the previous entry to offset at the start of the iteration. And
ensure its initialized to zero so that the first iteration does
nothing.

Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 28b3c258188c..34d8eb0823f4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2231,10 +2231,10 @@ BPF_CALL_4(bpf_msg_pull_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += len;
 		len = sk_msg_elem(msg, i)->length;
 		if (start < offset + len)
 			break;
-		offset += len;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 
@@ -2346,7 +2346,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 	   u32, len, u64, flags)
 {
 	struct scatterlist sge, nsge, nnsge, rsge = {0}, *psge;
-	u32 new, i = 0, l, space, copy = 0, offset = 0;
+	u32 new, i = 0, l = 0, space, copy = 0, offset = 0;
 	u8 *raw, *to, *from;
 	struct page *page;
 
@@ -2356,11 +2356,11 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += l;
 		l = sk_msg_elem(msg, i)->length;
 
 		if (start < offset + l)
 			break;
-		offset += l;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 
@@ -2506,7 +2506,7 @@ static void sk_msg_shift_right(struct sk_msg *msg, int i)
 BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
 	   u32, len, u64, flags)
 {
-	u32 i = 0, l, space, offset = 0;
+	u32 i = 0, l = 0, space, offset = 0;
 	u64 last = start + len;
 	int pop;
 
@@ -2516,11 +2516,11 @@ BPF_CALL_4(bpf_msg_pop_data, struct sk_msg *, msg, u32, start,
 	/* First find the starting scatterlist element */
 	i = msg->sg.start;
 	do {
+		offset += l;
 		l = sk_msg_elem(msg, i)->length;
 
 		if (start < offset + l)
 			break;
-		offset += l;
 		sk_msg_iter_var_next(i);
 	} while (i != msg->sg.end);
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (3 preceding siblings ...)
  2020-01-08 21:14 ` [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
@ 2020-01-08 21:15 ` John Fastabend
  2020-01-09 18:51   ` Song Liu
  2020-01-08 21:15 ` [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:15 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

Leaving an incorrect end mark in place when passing to crypto
layer will cause crypto layer to stop processing data before
all data is encrypted. To fix clear the end mark on push
data instead of expecting users of the helper to clear the
mark value after the fact.

This happens when we push data into the middle of a skmsg and
have room for it so we don't do a set of copies that already
clear the end flag.

Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 34d8eb0823f4..21d0190b5413 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2415,6 +2415,7 @@ BPF_CALL_4(bpf_msg_push_data, struct sk_msg *, msg, u32, start,
 
 		sk_msg_iter_var_next(i);
 		sg_unmark_end(psge);
+		sg_unmark_end(&rsge);
 		sk_msg_iter_next(msg, end);
 	}
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (4 preceding siblings ...)
  2020-01-08 21:15 ` [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
@ 2020-01-08 21:15 ` John Fastabend
  2020-01-09 23:04   ` Jonathan Lemon
  2020-01-08 21:15 ` [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:15 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

It is possible to build a plaintext buffer using push helper that is larger
than the allocated encrypt buffer. When this record is pushed to crypto
layers this can result in a NULL pointer dereference because the crypto
API expects the encrypt buffer is large enough to fit the plaintext
buffer.

To resolve catch the cases this can happen and split the buffer into two
records to send individually. Unfortunately, there is still one case to
handle where the split creates a zero sized buffer. In this case we merge
the buffers and unmark the split. This happens when apply is zero and user
pushed data beyond encrypt buffer. This fixes the original case as well
because the split allocated an encrypt buffer larger than the plaintext
buffer and the merge simply moves the pointers around so we now have
a reference to the new (larger) encrypt buffer.

Perhaps its not ideal but it seems the best solution for a fixes branch
and avoids handling these two cases, (a) apply that needs split and (b)
non apply case. The are edge cases anyways so optimizing them seems not
necessary unless someone wants later in next branches.

Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/tls/tls_sw.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index c6803a82b769..31f6bbbc8992 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -682,12 +682,32 @@ static int tls_push_record(struct sock *sk, int flags,
 
 	split_point = msg_pl->apply_bytes;
 	split = split_point && split_point < msg_pl->sg.size;
+	if (unlikely((!split &&
+		      msg_pl->sg.size +
+		      prot->overhead_size > msg_en->sg.size) ||
+		     (split &&
+		      split_point +
+		      prot->overhead_size > msg_en->sg.size))) {
+		split = true;
+		split_point = msg_en->sg.size;
+	}
 	if (split) {
 		rc = tls_split_open_record(sk, rec, &tmp, msg_pl, msg_en,
 					   split_point, prot->overhead_size,
 					   &orig_end);
 		if (rc < 0)
 			return rc;
+		/* This can happen if above tls_split_open_record allocates
+		 * a single large encryption buffer instead of two smaller
+		 * ones. In this case adjust pointers and continue without
+		 * split.
+		 */
+		if (!msg_pl->sg.size) {
+			tls_merge_open_record(sk, rec, tmp, orig_end);
+			msg_pl = &rec->msg_plaintext;
+			msg_en = &rec->msg_encrypted;
+			split = false;
+		}
 		sk_msg_trim(sk, msg_en, msg_pl->sg.size +
 			    prot->overhead_size);
 	}


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (5 preceding siblings ...)
  2020-01-08 21:15 ` [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
@ 2020-01-08 21:15 ` John Fastabend
  2020-01-09 23:13   ` Jonathan Lemon
  2020-01-08 21:16 ` [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg John Fastabend
  2020-01-08 21:16 ` [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:15 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

Its possible through a set of push, pop, apply helper calls to construct
a skmsg, which is just a ring of scatterlist elements, with the start
value larger than the end value. For example,

      end       start
  |_0_|_1_| ... |_n_|_n+1_|

Where end points at 1 and start points and n so that valid elements is
the set {n, n+1, 0, 1}.

Currently, because we don't build the correct chain only {n, n+1} will
be sent. This adds a check and sg_chain call to correctly submit the
above to the crypto and tls send path.

Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/tls/tls_sw.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 31f6bbbc8992..21c7725d17ca 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -729,6 +729,12 @@ static int tls_push_record(struct sock *sk, int flags,
 		sg_mark_end(sk_msg_elem(msg_pl, i));
 	}
 
+	if (msg_pl->sg.end < msg_pl->sg.start) {
+		sg_chain(&msg_pl->sg.data[msg_pl->sg.start],
+			 MAX_SKB_FRAGS - msg_pl->sg.start + 1,
+			 msg_pl->sg.data);
+	}
+
 	i = msg_pl->sg.start;
 	sg_chain(rec->sg_aead_in, 2, &msg_pl->sg.data[i]);
 


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (6 preceding siblings ...)
  2020-01-08 21:15 ` [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
@ 2020-01-08 21:16 ` John Fastabend
  2020-01-09 20:08   ` Song Liu
  2020-01-08 21:16 ` [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

When passed a zero length skmsg tls_push_record() causes a NULL ptr
deref. To resolve for fixes do a simple length check at start of
routine.

To create this case a user can create a BPF program to pop all the
data off the message then return SK_PASS. Its not a very practical
or useful thing to do so we mark it unlikely.

Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/tls/tls_sw.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 21c7725d17ca..0326e916ab01 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -680,6 +680,9 @@ static int tls_push_record(struct sock *sk, int flags,
 	msg_pl = &rec->msg_plaintext;
 	msg_en = &rec->msg_encrypted;
 
+	if (unlikely(!msg_pl->sg.size))
+		return 0;
+
 	split_point = msg_pl->apply_bytes;
 	split = split_point && split_point < msg_pl->sg.size;
 	if (unlikely((!split &&


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code
  2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
                   ` (7 preceding siblings ...)
  2020-01-08 21:16 ` [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg John Fastabend
@ 2020-01-08 21:16 ` John Fastabend
  2020-01-09 23:28   ` Jonathan Lemon
  8 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-08 21:16 UTC (permalink / raw)
  To: bpf; +Cc: netdev, john.fastabend, ast, daniel

When user returns SK_DROP we need to reset the number of copied bytes
to indicate to the user the bytes were dropped and not sent. If we
don't reset the copied arg sendmsg will return as if those bytes were
copied giving the user a positive return value.

This works as expected today except in the case where the user also
pops bytes. In the pop case the sg.size is reduced but we don't correctly
account for this when copied bytes is reset. The popped bytes are not
accounted for and we return a small positive value potentially confusing
the user.

The reason this happens is due to a typo where we do the wrong comparison
when accounting for pop bytes. In this fix notice the if/else is not
needed and that we have a similar problem if we push data except its not
visible to the user because if delta is larger the sg.size we return a
negative value so it appears as an error regardless.

Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c |    5 +----
 net/tls/tls_sw.c   |    5 +----
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index e38705165ac9..587d55611814 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -315,10 +315,7 @@ static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
 		 */
 		delta = msg->sg.size;
 		psock->eval = sk_psock_msg_verdict(sk, psock, msg);
-		if (msg->sg.size < delta)
-			delta -= msg->sg.size;
-		else
-			delta = 0;
+		delta -= msg->sg.size;
 	}

 	if (msg->cork_bytes &&
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0326e916ab01..d9a757c0ba0c 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -812,10 +812,7 @@ static int bpf_exec_tx_verdict(struct sk_msg *msg, struct sock *sk,
 	if (psock->eval == __SK_NONE) {
 		delta = msg->sg.size;
 		psock->eval = sk_psock_msg_verdict(sk, psock, msg);
-		if (delta < msg->sg.size)
-			delta -= msg->sg.size;
-		else
-			delta = 0;
+		delta -= msg->sg.size;
 	}
 	if (msg->cork_bytes && msg->cork_bytes > msg->sg.size &&
 	    !enospc && !full_record) {

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop
  2020-01-08 21:14 ` [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
@ 2020-01-09  1:34   ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2020-01-09  1:34 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

On Wed, Jan 8, 2020 at 1:14 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> When a sockmap is free'd and a socket in the map is enabled with tls
> we tear down the bpf context on the socket, the psock struct and state,
> and then call tcp_update_ulp(). The tcp_update_ulp() call is to inform
> the tls stack it needs to update its saved sock ops so that when the tls
> socket is later destroyed it doesn't try to call the now destroyed psock
> hooks.
>
> This is about keeping stacked ULPs in good shape so they always have
> the right set of stacked ops.
>
> However, recently unhash() hook was removed from TLS side. But, the
> sockmap/bpf side is not doing any extra work to update the unhash op
> when is torn down instead expecting TLS side to manage it. So both
> TLS and sockmap believe the other side is managing the op and instead
> no one updates the hook so it continues to point at tcp_bpf_unhash().
> When unhash hook is called we call tcp_bpf_unhash() which detects the
> psock has already been destroyed and calls sk->sk_prot_unhash() which
> calls tcp_bpf_unhash() yet again and so on looping and hanging the core.
>
> To fix have sockmap tear down logic fixup the stale pointer.
>
> Fixes: 5d92e631b8be ("net/tls: partially revert fix transition through disconnect with close")
> Reported-by: syzbot+83979935eb6304f8cd46@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Cc: stable@vger.kernel.org
Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-08 21:14 ` [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
@ 2020-01-09 10:33   ` Jakub Sitnicki
  2020-01-09 21:22     ` John Fastabend
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Sitnicki @ 2020-01-09 10:33 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, ast, daniel

On Wed, Jan 08, 2020 at 10:14 PM CET, John Fastabend wrote:
> When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
> and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
> don't push the write_space callback up and instead simply overwrite the
> op with the psock stored previous op. This may or may not be correct so
> to ensure we don't overwrite the TLS write space hook pass this field to
> the ULP and have it fixup the ctx.
>
> This completes a previous fix that pushed the ops through to the ULP
> but at the time missed doing this for write_space, presumably because
> write_space TLS hook was added around the same time.
>
> Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  include/linux/skmsg.h |   12 ++++++++----
>  include/net/tcp.h     |    6 ++++--
>  net/ipv4/tcp_ulp.c    |    6 ++++--
>  net/tls/tls_main.c    |   10 +++++++---
>  4 files changed, 23 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index b6afe01f8592..14d61bba0b79 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>  					  struct sk_psock *psock)
>  {
>  	sk->sk_prot->unhash = psock->saved_unhash;
> -	sk->sk_write_space = psock->saved_write_space;
>
>  	if (psock->sk_proto) {
>  		struct inet_connection_sock *icsk = inet_csk(sk);
>  		bool has_ulp = !!icsk->icsk_ulp_data;
>
> -		if (has_ulp)
> -			tcp_update_ulp(sk, psock->sk_proto);
> -		else
> +		if (has_ulp) {
> +			tcp_update_ulp(sk, psock->sk_proto,
> +				       psock->saved_write_space);
> +		} else {
>  			sk->sk_prot = psock->sk_proto;
> +			sk->sk_write_space = psock->saved_write_space;
> +		}

I'm wondering if we need the above fallback branch for no-ULP case?
tcp_update_ulp repeats the ULP check and has the same fallback. Perhaps
it can be reduced to:

	if (psock->sk_proto) {
		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
		psock->sk_proto = NULL;
	} else {
		sk->sk_write_space = psock->saved_write_space;
	}

Then there's the question if it's okay to leave psock->sk_proto set and
potentially restore it more than once? Reading tls_update, the only user
ULP 'update' callback, it looks fine.

Can sk_psock_restore_proto be as simple as:

static inline void sk_psock_restore_proto(struct sock *sk,
					  struct sk_psock *psock)
{
	tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
}

... or am I missing something?

Asking becuase I have a patch [0] like this in the queue and haven't
seen issues with it during testing.

-jkbs

[0] https://github.com/jsitnicki/linux/commit/2d2152593c8e6c5f38548796501a81a6ba20b6dc

>  		psock->sk_proto = NULL;
> +	} else {
> +		sk->sk_write_space = psock->saved_write_space;
>  	}
>  }
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e460ea7f767b..e6f48384dc71 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2147,7 +2147,8 @@ struct tcp_ulp_ops {
>  	/* initialize ulp */
>  	int (*init)(struct sock *sk);
>  	/* update ulp */
> -	void (*update)(struct sock *sk, struct proto *p);
> +	void (*update)(struct sock *sk, struct proto *p,
> +		       void (*write_space)(struct sock *sk));
>  	/* cleanup ulp */
>  	void (*release)(struct sock *sk);
>  	/* diagnostic */
> @@ -2162,7 +2163,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
>  int tcp_set_ulp(struct sock *sk, const char *name);
>  void tcp_get_available_ulp(char *buf, size_t len);
>  void tcp_cleanup_ulp(struct sock *sk);
> -void tcp_update_ulp(struct sock *sk, struct proto *p);
> +void tcp_update_ulp(struct sock *sk, struct proto *p,
> +		    void (*write_space)(struct sock *sk));
>
>  #define MODULE_ALIAS_TCP_ULP(name)				\
>  	__MODULE_INFO(alias, alias_userspace, name);		\
> diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
> index 12ab5db2b71c..38d3ad141161 100644
> --- a/net/ipv4/tcp_ulp.c
> +++ b/net/ipv4/tcp_ulp.c
> @@ -99,17 +99,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
>  	rcu_read_unlock();
>  }
>
> -void tcp_update_ulp(struct sock *sk, struct proto *proto)
> +void tcp_update_ulp(struct sock *sk, struct proto *proto,
> +		    void (*write_space)(struct sock *sk))
>  {
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>
>  	if (!icsk->icsk_ulp_ops) {
> +		sk->sk_write_space = write_space;
>  		sk->sk_prot = proto;
>  		return;
>  	}
>
>  	if (icsk->icsk_ulp_ops->update)
> -		icsk->icsk_ulp_ops->update(sk, proto);
> +		icsk->icsk_ulp_ops->update(sk, proto, write_space);
>  }
>
>  void tcp_cleanup_ulp(struct sock *sk)
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index dac24c7aa7d4..94774c0e5ff3 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -732,15 +732,19 @@ static int tls_init(struct sock *sk)
>  	return rc;
>  }
>
> -static void tls_update(struct sock *sk, struct proto *p)
> +static void tls_update(struct sock *sk, struct proto *p,
> +		       void (*write_space)(struct sock *sk))
>  {
>  	struct tls_context *ctx;
>
>  	ctx = tls_get_ctx(sk);
> -	if (likely(ctx))
> +	if (likely(ctx)) {
> +		ctx->sk_write_space = write_space;
>  		ctx->sk_proto = p;
> -	else
> +	} else {
>  		sk->sk_prot = p;
> +		sk->sk_write_space = write_space;
> +	}
>  }
>
>  static int tls_get_info(const struct sock *sk, struct sk_buff *skb)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down
  2020-01-08 21:14 ` [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down John Fastabend
@ 2020-01-09 17:10   ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2020-01-09 17:10 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

On Wed, Jan 8, 2020 at 1:14 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> The sock_map_free() and sock_hash_free() paths used to delete sockmap
> and sockhash maps walk the maps and destroy psock and bpf state associated
> with the socks in the map. When done the socks no longer have BPF programs
> attached and will function normally. This can happen while the socks in
> the map are still "live" meaning data may be sent/received during the walk.
>
> Currently, though we don't take the sock_lock when the psock and bpf state
> is removed through this path. Specifically, this means we can be writing
> into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
> while they are also being called from the networking side. This is not
> safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
> believed it was safe. Further its not clear to me its even a good idea
> to try and do this on "live" sockets while networking side might also
> be using the socket. Instead of trying to reason about using the socks
> from both sides lets realize that every use case I'm aware of rarely
> deletes maps, in fact kubernetes/Cilium case builds map at init and
> never tears it down except on errors. So lets do the simple fix and
> grab sock lock.
>
> This patch wraps sock deletes from maps in sock lock and adds some
> annotations so we catch any other cases easier.
>
> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds
  2020-01-08 21:14 ` [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
@ 2020-01-09 18:37   ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2020-01-09 18:37 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

On Wed, Jan 8, 2020 at 1:15 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> In the push, pull, and pop helpers operating on skmsg objects to make
> data writable or insert/remove data we use this bounds check to ensure
> specified data is valid,
>
>  /* Bounds checks: start and pop must be inside message */
>  if (start >= offset + l || last >= msg->sg.size)
>      return -EINVAL;
>
> The problem here is offset has already included the length of the
> current element the 'l' above. So start could be past the end of
> the scatterlist element in the case where start also points into an
> offset on the last skmsg element.
>
> To fix do the accounting slightly different by adding the length of
> the previous entry to offset at the start of the iteration. And
> ensure its initialized to zero so that the first iteration does
> nothing.
>
> Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
> Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

This is pretty tricky... But it looks right.

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place
  2020-01-08 21:15 ` [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
@ 2020-01-09 18:51   ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2020-01-09 18:51 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

On Wed, Jan 8, 2020 at 1:15 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Leaving an incorrect end mark in place when passing to crypto
> layer will cause crypto layer to stop processing data before
> all data is encrypted. To fix clear the end mark on push
> data instead of expecting users of the helper to clear the
> mark value after the fact.
>
> This happens when we push data into the middle of a skmsg and
> have room for it so we don't do a set of copies that already
> clear the end flag.
>
> Fixes: 6fff607e2f14b ("bpf: sk_msg program helper bpf_msg_push_data")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Song Liu <songliubraving@fb.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg
  2020-01-08 21:16 ` [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg John Fastabend
@ 2020-01-09 20:08   ` Song Liu
  2020-01-09 21:25     ` John Fastabend
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2020-01-09 20:08 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

On Wed, Jan 8, 2020 at 1:17 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> When passed a zero length skmsg tls_push_record() causes a NULL ptr
> deref. To resolve for fixes do a simple length check at start of
> routine.

Could you please include the stack dump for the NULL deref?

Thanks,
Song

>
> To create this case a user can create a BPF program to pop all the
> data off the message then return SK_PASS. Its not a very practical
> or useful thing to do so we mark it unlikely.
>
> Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---
>  net/tls/tls_sw.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 21c7725d17ca..0326e916ab01 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -680,6 +680,9 @@ static int tls_push_record(struct sock *sk, int flags,
>         msg_pl = &rec->msg_plaintext;
>         msg_en = &rec->msg_encrypted;
>
> +       if (unlikely(!msg_pl->sg.size))
> +               return 0;
> +
>         split_point = msg_pl->apply_bytes;
>         split = split_point && split_point < msg_pl->sg.size;
>         if (unlikely((!split &&
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-09 10:33   ` Jakub Sitnicki
@ 2020-01-09 21:22     ` John Fastabend
  2020-01-10 13:40       ` Jakub Sitnicki
  0 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-09 21:22 UTC (permalink / raw)
  To: Jakub Sitnicki, John Fastabend; +Cc: bpf, netdev, ast, daniel

Jakub Sitnicki wrote:
> On Wed, Jan 08, 2020 at 10:14 PM CET, John Fastabend wrote:
> > When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
> > and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
> > don't push the write_space callback up and instead simply overwrite the
> > op with the psock stored previous op. This may or may not be correct so
> > to ensure we don't overwrite the TLS write space hook pass this field to
> > the ULP and have it fixup the ctx.
> >
> > This completes a previous fix that pushed the ops through to the ULP
> > but at the time missed doing this for write_space, presumably because
> > write_space TLS hook was added around the same time.
> >
> > Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> > ---
> >  include/linux/skmsg.h |   12 ++++++++----
> >  include/net/tcp.h     |    6 ++++--
> >  net/ipv4/tcp_ulp.c    |    6 ++++--
> >  net/tls/tls_main.c    |   10 +++++++---
> >  4 files changed, 23 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> > index b6afe01f8592..14d61bba0b79 100644
> > --- a/include/linux/skmsg.h
> > +++ b/include/linux/skmsg.h
> > @@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
> >  					  struct sk_psock *psock)
> >  {
> >  	sk->sk_prot->unhash = psock->saved_unhash;
> > -	sk->sk_write_space = psock->saved_write_space;
> >
> >  	if (psock->sk_proto) {
> >  		struct inet_connection_sock *icsk = inet_csk(sk);
> >  		bool has_ulp = !!icsk->icsk_ulp_data;
> >
> > -		if (has_ulp)
> > -			tcp_update_ulp(sk, psock->sk_proto);
> > -		else
> > +		if (has_ulp) {
> > +			tcp_update_ulp(sk, psock->sk_proto,
> > +				       psock->saved_write_space);
> > +		} else {
> >  			sk->sk_prot = psock->sk_proto;
> > +			sk->sk_write_space = psock->saved_write_space;
> > +		}
> 
> I'm wondering if we need the above fallback branch for no-ULP case?
> tcp_update_ulp repeats the ULP check and has the same fallback. Perhaps
> it can be reduced to:
> 
> 	if (psock->sk_proto) {
> 		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
> 		psock->sk_proto = NULL;
> 	} else {
> 		sk->sk_write_space = psock->saved_write_space;
> 	}

Yeah that is a bit nicer. How about pushing it for bpf-next? I'm not
sure its needed for bpf and the patch I pushed is the minimal change
needed for the fix and pushes the saved_write_space around.

> 
> Then there's the question if it's okay to leave psock->sk_proto set and
> potentially restore it more than once? Reading tls_update, the only user
> ULP 'update' callback, it looks fine.
> 
> Can sk_psock_restore_proto be as simple as:
> 
> static inline void sk_psock_restore_proto(struct sock *sk,
> 					  struct sk_psock *psock)
> {
> 	tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
> }
> 
> ... or am I missing something?

I think that is good. bpf-next?

> 
> Asking becuase I have a patch [0] like this in the queue and haven't
> seen issues with it during testing.

+1 Want to push it after we sort out this series?

> 
> -jkbs
> 
> [0] https://github.com/jsitnicki/linux/commit/2d2152593c8e6c5f38548796501a81a6ba20b6dc
> 
> >  		psock->sk_proto = NULL;
> > +	} else {
> > +		sk->sk_write_space = psock->saved_write_space;
> >  	}
> >  }
> >
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index e460ea7f767b..e6f48384dc71 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -2147,7 +2147,8 @@ struct tcp_ulp_ops {
> >  	/* initialize ulp */
> >  	int (*init)(struct sock *sk);
> >  	/* update ulp */
> > -	void (*update)(struct sock *sk, struct proto *p);
> > +	void (*update)(struct sock *sk, struct proto *p,
> > +		       void (*write_space)(struct sock *sk));
> >  	/* cleanup ulp */
> >  	void (*release)(struct sock *sk);
> >  	/* diagnostic */
> > @@ -2162,7 +2163,8 @@ void tcp_unregister_ulp(struct tcp_ulp_ops *type);
> >  int tcp_set_ulp(struct sock *sk, const char *name);
> >  void tcp_get_available_ulp(char *buf, size_t len);
> >  void tcp_cleanup_ulp(struct sock *sk);
> > -void tcp_update_ulp(struct sock *sk, struct proto *p);
> > +void tcp_update_ulp(struct sock *sk, struct proto *p,
> > +		    void (*write_space)(struct sock *sk));
> >
> >  #define MODULE_ALIAS_TCP_ULP(name)				\
> >  	__MODULE_INFO(alias, alias_userspace, name);		\
> > diff --git a/net/ipv4/tcp_ulp.c b/net/ipv4/tcp_ulp.c
> > index 12ab5db2b71c..38d3ad141161 100644
> > --- a/net/ipv4/tcp_ulp.c
> > +++ b/net/ipv4/tcp_ulp.c
> > @@ -99,17 +99,19 @@ void tcp_get_available_ulp(char *buf, size_t maxlen)
> >  	rcu_read_unlock();
> >  }
> >
> > -void tcp_update_ulp(struct sock *sk, struct proto *proto)
> > +void tcp_update_ulp(struct sock *sk, struct proto *proto,
> > +		    void (*write_space)(struct sock *sk))
> >  {
> >  	struct inet_connection_sock *icsk = inet_csk(sk);
> >
> >  	if (!icsk->icsk_ulp_ops) {
> > +		sk->sk_write_space = write_space;
> >  		sk->sk_prot = proto;
> >  		return;
> >  	}
> >
> >  	if (icsk->icsk_ulp_ops->update)
> > -		icsk->icsk_ulp_ops->update(sk, proto);
> > +		icsk->icsk_ulp_ops->update(sk, proto, write_space);
> >  }
> >
> >  void tcp_cleanup_ulp(struct sock *sk)
> > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > index dac24c7aa7d4..94774c0e5ff3 100644
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -732,15 +732,19 @@ static int tls_init(struct sock *sk)
> >  	return rc;
> >  }
> >
> > -static void tls_update(struct sock *sk, struct proto *p)
> > +static void tls_update(struct sock *sk, struct proto *p,
> > +		       void (*write_space)(struct sock *sk))
> >  {
> >  	struct tls_context *ctx;
> >
> >  	ctx = tls_get_ctx(sk);
> > -	if (likely(ctx))
> > +	if (likely(ctx)) {
> > +		ctx->sk_write_space = write_space;
> >  		ctx->sk_proto = p;
> > -	else
> > +	} else {
> >  		sk->sk_prot = p;
> > +		sk->sk_write_space = write_space;
> > +	}
> >  }
> >
> >  static int tls_get_info(const struct sock *sk, struct sk_buff *skb)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg
  2020-01-09 20:08   ` Song Liu
@ 2020-01-09 21:25     ` John Fastabend
  2020-01-10 23:20       ` John Fastabend
  0 siblings, 1 reply; 23+ messages in thread
From: John Fastabend @ 2020-01-09 21:25 UTC (permalink / raw)
  To: Song Liu, John Fastabend
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

Song Liu wrote:
> On Wed, Jan 8, 2020 at 1:17 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > When passed a zero length skmsg tls_push_record() causes a NULL ptr
> > deref. To resolve for fixes do a simple length check at start of
> > routine.
> 
> Could you please include the stack dump for the NULL deref?
> 
> Thanks,
> Song

Sure I'll send a v2 with the stack dump.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
  2020-01-08 21:15 ` [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
@ 2020-01-09 23:04   ` Jonathan Lemon
  0 siblings, 0 replies; 23+ messages in thread
From: Jonathan Lemon @ 2020-01-09 23:04 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, ast, daniel

On 8 Jan 2020, at 13:15, John Fastabend wrote:

> It is possible to build a plaintext buffer using push helper that is larger
> than the allocated encrypt buffer. When this record is pushed to crypto
> layers this can result in a NULL pointer dereference because the crypto
> API expects the encrypt buffer is large enough to fit the plaintext
> buffer.
>
> To resolve catch the cases this can happen and split the buffer into two
> records to send individually. Unfortunately, there is still one case to
> handle where the split creates a zero sized buffer. In this case we merge
> the buffers and unmark the split. This happens when apply is zero and user
> pushed data beyond encrypt buffer. This fixes the original case as well
> because the split allocated an encrypt buffer larger than the plaintext
> buffer and the merge simply moves the pointers around so we now have
> a reference to the new (larger) encrypt buffer.
>
> Perhaps its not ideal but it seems the best solution for a fixes branch
> and avoids handling these two cases, (a) apply that needs split and (b)
> non apply case. The are edge cases anyways so optimizing them seems not
> necessary unless someone wants later in next branches.
>
> Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining
  2020-01-08 21:15 ` [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
@ 2020-01-09 23:13   ` Jonathan Lemon
  0 siblings, 0 replies; 23+ messages in thread
From: Jonathan Lemon @ 2020-01-09 23:13 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, ast, daniel



On 8 Jan 2020, at 13:15, John Fastabend wrote:

> Its possible through a set of push, pop, apply helper calls to construct
> a skmsg, which is just a ring of scatterlist elements, with the start
> value larger than the end value. For example,
>
>       end       start
>   |_0_|_1_| ... |_n_|_n+1_|
>
> Where end points at 1 and start points and n so that valid elements is
> the set {n, n+1, 0, 1}.
>
> Currently, because we don't build the correct chain only {n, n+1} will
> be sent. This adds a check and sg_chain call to correctly submit the
> above to the crypto and tls send path.
>
> Fixes: d3b18ad31f93d ("tls: add bpf support to sk_msg handling")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code
  2020-01-08 21:16 ` [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
@ 2020-01-09 23:28   ` Jonathan Lemon
  0 siblings, 0 replies; 23+ messages in thread
From: Jonathan Lemon @ 2020-01-09 23:28 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, ast, daniel



On 8 Jan 2020, at 13:16, John Fastabend wrote:

> When user returns SK_DROP we need to reset the number of copied bytes
> to indicate to the user the bytes were dropped and not sent. If we
> don't reset the copied arg sendmsg will return as if those bytes were
> copied giving the user a positive return value.
>
> This works as expected today except in the case where the user also
> pops bytes. In the pop case the sg.size is reduced but we don't correctly
> account for this when copied bytes is reset. The popped bytes are not
> accounted for and we return a small positive value potentially confusing
> the user.
>
> The reason this happens is due to a typo where we do the wrong comparison
> when accounting for pop bytes. In this fix notice the if/else is not
> needed and that we have a similar problem if we push data except its not
> visible to the user because if delta is larger the sg.size we return a
> negative value so it appears as an error regardless.
>
> Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates
  2020-01-09 21:22     ` John Fastabend
@ 2020-01-10 13:40       ` Jakub Sitnicki
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Sitnicki @ 2020-01-10 13:40 UTC (permalink / raw)
  To: John Fastabend; +Cc: bpf, netdev, ast, daniel

On Thu, Jan 09, 2020 at 10:22 PM CET, John Fastabend wrote:
> Jakub Sitnicki wrote:
>> On Wed, Jan 08, 2020 at 10:14 PM CET, John Fastabend wrote:
>> > When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
>> > and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
>> > don't push the write_space callback up and instead simply overwrite the
>> > op with the psock stored previous op. This may or may not be correct so
>> > to ensure we don't overwrite the TLS write space hook pass this field to
>> > the ULP and have it fixup the ctx.
>> >
>> > This completes a previous fix that pushed the ops through to the ULP
>> > but at the time missed doing this for write_space, presumably because
>> > write_space TLS hook was added around the same time.
>> >
>> > Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
>> > Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> > ---
>> >  include/linux/skmsg.h |   12 ++++++++----
>> >  include/net/tcp.h     |    6 ++++--
>> >  net/ipv4/tcp_ulp.c    |    6 ++++--
>> >  net/tls/tls_main.c    |   10 +++++++---
>> >  4 files changed, 23 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
>> > index b6afe01f8592..14d61bba0b79 100644
>> > --- a/include/linux/skmsg.h
>> > +++ b/include/linux/skmsg.h
>> > @@ -359,17 +359,21 @@ static inline void sk_psock_restore_proto(struct sock *sk,
>> >  					  struct sk_psock *psock)
>> >  {
>> >  	sk->sk_prot->unhash = psock->saved_unhash;
>> > -	sk->sk_write_space = psock->saved_write_space;
>> >
>> >  	if (psock->sk_proto) {
>> >  		struct inet_connection_sock *icsk = inet_csk(sk);
>> >  		bool has_ulp = !!icsk->icsk_ulp_data;
>> >
>> > -		if (has_ulp)
>> > -			tcp_update_ulp(sk, psock->sk_proto);
>> > -		else
>> > +		if (has_ulp) {
>> > +			tcp_update_ulp(sk, psock->sk_proto,
>> > +				       psock->saved_write_space);
>> > +		} else {
>> >  			sk->sk_prot = psock->sk_proto;
>> > +			sk->sk_write_space = psock->saved_write_space;
>> > +		}
>>
>> I'm wondering if we need the above fallback branch for no-ULP case?
>> tcp_update_ulp repeats the ULP check and has the same fallback. Perhaps
>> it can be reduced to:
>>
>> 	if (psock->sk_proto) {
>> 		tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
>> 		psock->sk_proto = NULL;
>> 	} else {
>> 		sk->sk_write_space = psock->saved_write_space;
>> 	}
>
> Yeah that is a bit nicer. How about pushing it for bpf-next? I'm not
> sure its needed for bpf and the patch I pushed is the minimal change
> needed for the fix and pushes the saved_write_space around.

Yeah, this is bpf-next material.

>> Then there's the question if it's okay to leave psock->sk_proto set and
>> potentially restore it more than once? Reading tls_update, the only user
>> ULP 'update' callback, it looks fine.
>>
>> Can sk_psock_restore_proto be as simple as:
>>
>> static inline void sk_psock_restore_proto(struct sock *sk,
>> 					  struct sk_psock *psock)
>> {
>> 	tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
>> }
>>
>> ... or am I missing something?
>
> I think that is good. bpf-next?

Great, I needed to confirm my thinking.

>> Asking becuase I have a patch [0] like this in the queue and haven't
>> seen issues with it during testing.
>
> +1 Want to push it after we sort out this series?

I've actually pushed it earlier today with next iteration of "Extend
SOCKMAP to store listening sockets" to collect feedback [0]. I will
adapt it once it shows up in bpf-next (or split it out and submit
separately).

-jkbs

[0] https://lore.kernel.org/bpf/20200110105027.257877-1-jakub@cloudflare.com/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg
  2020-01-09 21:25     ` John Fastabend
@ 2020-01-10 23:20       ` John Fastabend
  0 siblings, 0 replies; 23+ messages in thread
From: John Fastabend @ 2020-01-10 23:20 UTC (permalink / raw)
  To: John Fastabend, Song Liu, John Fastabend
  Cc: bpf, Networking, Alexei Starovoitov, Daniel Borkmann

John Fastabend wrote:
> Song Liu wrote:
> > On Wed, Jan 8, 2020 at 1:17 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > When passed a zero length skmsg tls_push_record() causes a NULL ptr
> > > deref. To resolve for fixes do a simple length check at start of
> > > routine.
> > 
> > Could you please include the stack dump for the NULL deref?
> > 
> > Thanks,
> > Song
> 
> Sure I'll send a v2 with the stack dump.

Hi Song, I'm having a bit of trouble reproducing this now. I'm going to
drop this patch from the series for now and see if something changed in
crypto layers or elsewhere. I originally saw this on a bit older kernel
so something might changed. Feels a bit like a work-around anyways so
I'll dig into it a bit more.

Either way I'll try and understand this a bit better. Thanks for the
question.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2020-01-10 23:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-08 21:13 [bpf PATCH 0/9] Fixes for sockmap/tls from more complex BPF progs John Fastabend
2020-01-08 21:14 ` [bpf PATCH 1/9] bpf: sockmap/tls, during free we may call tcp_bpf_unhash() in loop John Fastabend
2020-01-09  1:34   ` Song Liu
2020-01-08 21:14 ` [bpf PATCH 2/9] bpf: sockmap, ensure sock lock held during tear down John Fastabend
2020-01-09 17:10   ` Song Liu
2020-01-08 21:14 ` [bpf PATCH 3/9] bpf: sockmap/tls, push write_space updates through ulp updates John Fastabend
2020-01-09 10:33   ` Jakub Sitnicki
2020-01-09 21:22     ` John Fastabend
2020-01-10 13:40       ` Jakub Sitnicki
2020-01-08 21:14 ` [bpf PATCH 4/9] bpf: sockmap, skmsg helper overestimates push, pull, and pop bounds John Fastabend
2020-01-09 18:37   ` Song Liu
2020-01-08 21:15 ` [bpf PATCH 5/9] bpf: sockmap/tls, msg_push_data may leave end mark in place John Fastabend
2020-01-09 18:51   ` Song Liu
2020-01-08 21:15 ` [bpf PATCH 6/9] bpf: sockmap/tls, tls_sw can create a plaintext buf > encrypt buf John Fastabend
2020-01-09 23:04   ` Jonathan Lemon
2020-01-08 21:15 ` [bpf PATCH 7/9] bpf: sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining John Fastabend
2020-01-09 23:13   ` Jonathan Lemon
2020-01-08 21:16 ` [bpf PATCH 8/9] bpf: sockmap/tls, tls_push_record can not handle zero length skmsg John Fastabend
2020-01-09 20:08   ` Song Liu
2020-01-09 21:25     ` John Fastabend
2020-01-10 23:20       ` John Fastabend
2020-01-08 21:16 ` [bpf PATCH 9/9] bpf: sockmap/tls, fix pop data with SK_DROP return code John Fastabend
2020-01-09 23:28   ` Jonathan Lemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).