All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data()
@ 2022-07-15  5:22 Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time Jakub Kicinski
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

TLS calls skb_cow_data() on the skb it received from strparser
whenever it needs to hold onto the skb with the decrypted data.
(The alternative being decrypting directly to a user space buffer
in whic case the input skb doesn't get modified or used after.)
TLS needs the decrypted skb:
 - almost always with TLS 1.3 (unless the new NoPad is enabled);
 - when user space buffer is too small to fit the record;
 - when BPF sockmap is enabled.

Most of the time the skb we get out of strparser is a clone of
a 64kB data unit coalsced by GRO. To make things worse skb_cow_data()
tries to output a linear skb and allocates it with GFP_ATOMIC.
This occasionally fails even under moderate memory pressure.

This patch set rejigs the TLS Rx so that we don't expect decryption
in place. The decryption handlers return an skb which may or may not
be the skb from strparser. For TLS 1.3 this results in a 20-30%
performance improvement without NoPad enabled.

v2: rebase after 3d8c51b25a23 ("net/tls: Check for errors in tls_device_init")

Jakub Kicinski (11):
  tls: rx: allow only one reader at a time
  tls: rx: don't try to keep the skbs always on the list
  tls: rx: don't keep decrypted skbs on ctx->recv_pkt
  tls: rx: remove the message decrypted tracking
  tls: rx: factor out device darg update
  tls: rx: read the input skb from ctx->recv_pkt
  tls: rx: return the decrypted skb via darg
  tls: rx: async: adjust record geometry immediately
  tls: rx: async: hold onto the input skb
  tls: rx: async: don't put async zc on the list
  tls: rx: decrypt into a fresh skb

 include/net/strparser.h |   1 -
 include/net/tls.h       |   4 +
 net/tls/Makefile        |   2 +-
 net/tls/tls.h           |  20 +-
 net/tls/tls_device.c    |  25 ++-
 net/tls/tls_strp.c      |  17 ++
 net/tls/tls_sw.c        | 458 ++++++++++++++++++++++++----------------
 7 files changed, 333 insertions(+), 194 deletions(-)
 create mode 100644 net/tls/tls_strp.c

-- 
2.36.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-20  8:37   ` Eric Dumazet
  2022-07-15  5:22 ` [PATCH net-next v2 02/11] tls: rx: don't try to keep the skbs always on the list Jakub Kicinski
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

recvmsg() in TLS gets data from the skb list (rx_list) or fresh
skbs we read from TCP via strparser. The former holds skbs which were
already decrypted for peek or decrypted and partially consumed.

tls_wait_data() only notices appearance of fresh skbs coming out
of TCP (or psock). It is possible, if there is a concurrent call
to peek() and recv() that the peek() will move the data from input
to rx_list without recv() noticing. recv() will then read data out
of order or never wake up.

This is not a practical use case/concern, but it makes the self
tests less reliable. This patch solves the problem by allowing
only one reader in.

Because having multiple processes calling read()/peek() is not
normal avoid adding a lock and try to fast-path the single reader
case.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/tls.h |  3 +++
 net/tls/tls_sw.c  | 61 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 57 insertions(+), 7 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 8742e13bc362..e8935cfe0cd6 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -116,11 +116,14 @@ struct tls_sw_context_rx {
 	void (*saved_data_ready)(struct sock *sk);
 
 	struct sk_buff *recv_pkt;
+	u8 reader_present;
 	u8 async_capable:1;
 	u8 zc_capable:1;
+	u8 reader_contended:1;
 	atomic_t decrypt_pending;
 	/* protect crypto_wait with decrypt_pending*/
 	spinlock_t decrypt_compl_lock;
+	struct wait_queue_head wq;
 };
 
 struct tls_record_info {
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 68d79ee48a56..761a63751616 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1753,6 +1753,51 @@ tls_read_flush_backlog(struct sock *sk, struct tls_prot_info *prot,
 	sk_flush_backlog(sk);
 }
 
+static long tls_rx_reader_lock(struct sock *sk, struct tls_sw_context_rx *ctx,
+			       bool nonblock)
+{
+	long timeo;
+
+	lock_sock(sk);
+
+	timeo = sock_rcvtimeo(sk, nonblock);
+
+	while (unlikely(ctx->reader_present)) {
+		DEFINE_WAIT_FUNC(wait, woken_wake_function);
+
+		ctx->reader_contended = 1;
+
+		add_wait_queue(&ctx->wq, &wait);
+		sk_wait_event(sk, &timeo,
+			      !READ_ONCE(ctx->reader_present), &wait);
+		remove_wait_queue(&ctx->wq, &wait);
+
+		if (!timeo)
+			return -EAGAIN;
+		if (signal_pending(current))
+			return sock_intr_errno(timeo);
+	}
+
+	WRITE_ONCE(ctx->reader_present, 1);
+
+	return timeo;
+}
+
+static void tls_rx_reader_unlock(struct sock *sk, struct tls_sw_context_rx *ctx)
+{
+	if (unlikely(ctx->reader_contended)) {
+		if (wq_has_sleeper(&ctx->wq))
+			wake_up(&ctx->wq);
+		else
+			ctx->reader_contended = 0;
+
+		WARN_ON_ONCE(!ctx->reader_present);
+	}
+
+	WRITE_ONCE(ctx->reader_present, 0);
+	release_sock(sk);
+}
+
 int tls_sw_recvmsg(struct sock *sk,
 		   struct msghdr *msg,
 		   size_t len,
@@ -1782,7 +1827,9 @@ int tls_sw_recvmsg(struct sock *sk,
 		return sock_recv_errqueue(sk, msg, len, SOL_IP, IP_RECVERR);
 
 	psock = sk_psock_get(sk);
-	lock_sock(sk);
+	timeo = tls_rx_reader_lock(sk, ctx, flags & MSG_DONTWAIT);
+	if (timeo < 0)
+		return timeo;
 	bpf_strp_enabled = sk_psock_strp_enabled(psock);
 
 	/* If crypto failed the connection is broken */
@@ -1801,7 +1848,6 @@ int tls_sw_recvmsg(struct sock *sk,
 
 	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
 	len = len - copied;
-	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 
 	zc_capable = !bpf_strp_enabled && !is_kvec && !is_peek &&
 		ctx->zc_capable;
@@ -1956,7 +2002,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	copied += decrypted;
 
 end:
-	release_sock(sk);
+	tls_rx_reader_unlock(sk, ctx);
 	if (psock)
 		sk_psock_put(sk, psock);
 	return copied ? : err;
@@ -1978,9 +2024,9 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	long timeo;
 	int chunk;
 
-	lock_sock(sk);
-
-	timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
+	timeo = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK);
+	if (timeo < 0)
+		return timeo;
 
 	from_queue = !skb_queue_empty(&ctx->rx_list);
 	if (from_queue) {
@@ -2029,7 +2075,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	}
 
 splice_read_end:
-	release_sock(sk);
+	tls_rx_reader_unlock(sk, ctx);
 	return copied ? : err;
 }
 
@@ -2371,6 +2417,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 	} else {
 		crypto_init_wait(&sw_ctx_rx->async_wait);
 		spin_lock_init(&sw_ctx_rx->decrypt_compl_lock);
+		init_waitqueue_head(&sw_ctx_rx->wq);
 		crypto_info = &ctx->crypto_recv.info;
 		cctx = &ctx->rx;
 		skb_queue_head_init(&sw_ctx_rx->rx_list);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 02/11] tls: rx: don't try to keep the skbs always on the list
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 03/11] tls: rx: don't keep decrypted skbs on ctx->recv_pkt Jakub Kicinski
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

I thought that having the skb either always on the ctx->rx_list
or ctx->recv_pkt will simplify the handling, as we would not
have to remember to flip it from one to the other on exit paths.

This became a little harder to justify with the fix for BPF
sockmaps. Subsequent changes will make the situation even worse.
Queue the skbs only when really needed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 761a63751616..acf65992aaca 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1861,8 +1861,11 @@ int tls_sw_recvmsg(struct sock *sk,
 			if (psock) {
 				chunk = sk_msg_recvmsg(sk, psock, msg, len,
 						       flags);
-				if (chunk > 0)
-					goto leave_on_list;
+				if (chunk > 0) {
+					decrypted += chunk;
+					len -= chunk;
+					continue;
+				}
 			}
 			goto recv_end;
 		}
@@ -1908,14 +1911,14 @@ int tls_sw_recvmsg(struct sock *sk,
 
 		ctx->recv_pkt = NULL;
 		__strp_unpause(&ctx->strp);
-		__skb_queue_tail(&ctx->rx_list, skb);
 
 		if (async) {
 			/* TLS 1.2-only, to_decrypt must be text length */
 			chunk = min_t(int, to_decrypt, len);
-leave_on_list:
+put_on_rx_list:
 			decrypted += chunk;
 			len -= chunk;
+			__skb_queue_tail(&ctx->rx_list, skb);
 			continue;
 		}
 		/* TLS 1.3 may have updated the length by more than overhead */
@@ -1925,8 +1928,6 @@ int tls_sw_recvmsg(struct sock *sk,
 			bool partially_consumed = chunk > len;
 
 			if (bpf_strp_enabled) {
-				/* BPF may try to queue the skb */
-				__skb_unlink(skb, &ctx->rx_list);
 				err = sk_psock_tls_strp_read(psock, skb);
 				if (err != __SK_PASS) {
 					rxm->offset = rxm->offset + rxm->full_len;
@@ -1935,7 +1936,6 @@ int tls_sw_recvmsg(struct sock *sk,
 						consume_skb(skb);
 					continue;
 				}
-				__skb_queue_tail(&ctx->rx_list, skb);
 			}
 
 			if (partially_consumed)
@@ -1943,23 +1943,24 @@ int tls_sw_recvmsg(struct sock *sk,
 
 			err = skb_copy_datagram_msg(skb, rxm->offset,
 						    msg, chunk);
-			if (err < 0)
+			if (err < 0) {
+				__skb_queue_tail(&ctx->rx_list, skb);
 				goto recv_end;
+			}
 
 			if (is_peek)
-				goto leave_on_list;
+				goto put_on_rx_list;
 
 			if (partially_consumed) {
 				rxm->offset += chunk;
 				rxm->full_len -= chunk;
-				goto leave_on_list;
+				goto put_on_rx_list;
 			}
 		}
 
 		decrypted += chunk;
 		len -= chunk;
 
-		__skb_unlink(skb, &ctx->rx_list);
 		consume_skb(skb);
 
 		/* Return full control message to userspace before trying
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 03/11] tls: rx: don't keep decrypted skbs on ctx->recv_pkt
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 02/11] tls: rx: don't try to keep the skbs always on the list Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 04/11] tls: rx: remove the message decrypted tracking Jakub Kicinski
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

Detach the skb from ctx->recv_pkt after decryption is done,
even if we can't consume it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 49 +++++++++++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index acf65992aaca..f5f06d1ba024 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1648,6 +1648,12 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
 	return 1;
 }
 
+static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
+{
+	ctx->recv_pkt = NULL;
+	__strp_unpause(&ctx->strp);
+}
+
 /* This function traverses the rx_list in tls receive context to copies the
  * decrypted records into the buffer provided by caller zero copy is not
  * true. Further, the records are removed from the rx_list if it is not a peek
@@ -1902,15 +1908,20 @@ int tls_sw_recvmsg(struct sock *sk,
 		 * For tls1.3, we disable async.
 		 */
 		err = tls_record_content_type(msg, tlm, &control);
-		if (err <= 0)
+		if (err <= 0) {
+			tls_rx_rec_done(ctx);
+put_on_rx_list_err:
+			__skb_queue_tail(&ctx->rx_list, skb);
 			goto recv_end;
+		}
 
 		/* periodically flush backlog, and feed strparser */
 		tls_read_flush_backlog(sk, prot, len, to_decrypt,
 				       decrypted + copied, &flushed_at);
 
-		ctx->recv_pkt = NULL;
-		__strp_unpause(&ctx->strp);
+		/* TLS 1.3 may have updated the length by more than overhead */
+		chunk = rxm->full_len;
+		tls_rx_rec_done(ctx);
 
 		if (async) {
 			/* TLS 1.2-only, to_decrypt must be text length */
@@ -1921,8 +1932,6 @@ int tls_sw_recvmsg(struct sock *sk,
 			__skb_queue_tail(&ctx->rx_list, skb);
 			continue;
 		}
-		/* TLS 1.3 may have updated the length by more than overhead */
-		chunk = rxm->full_len;
 
 		if (!darg.zc) {
 			bool partially_consumed = chunk > len;
@@ -1943,10 +1952,8 @@ int tls_sw_recvmsg(struct sock *sk,
 
 			err = skb_copy_datagram_msg(skb, rxm->offset,
 						    msg, chunk);
-			if (err < 0) {
-				__skb_queue_tail(&ctx->rx_list, skb);
-				goto recv_end;
-			}
+			if (err < 0)
+				goto put_on_rx_list_err;
 
 			if (is_peek)
 				goto put_on_rx_list;
@@ -2020,7 +2027,6 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	struct tls_msg *tlm;
 	struct sk_buff *skb;
 	ssize_t copied = 0;
-	bool from_queue;
 	int err = 0;
 	long timeo;
 	int chunk;
@@ -2029,8 +2035,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	if (timeo < 0)
 		return timeo;
 
-	from_queue = !skb_queue_empty(&ctx->rx_list);
-	if (from_queue) {
+	if (!skb_queue_empty(&ctx->rx_list)) {
 		skb = __skb_dequeue(&ctx->rx_list);
 	} else {
 		struct tls_decrypt_arg darg = {};
@@ -2047,6 +2052,8 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 			tls_err_abort(sk, -EBADMSG);
 			goto splice_read_end;
 		}
+
+		tls_rx_rec_done(ctx);
 	}
 
 	rxm = strp_msg(skb);
@@ -2055,29 +2062,29 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	/* splice does not support reading control messages */
 	if (tlm->control != TLS_RECORD_TYPE_DATA) {
 		err = -EINVAL;
-		goto splice_read_end;
+		goto splice_requeue;
 	}
 
 	chunk = min_t(unsigned int, rxm->full_len, len);
 	copied = skb_splice_bits(skb, sk, rxm->offset, pipe, chunk, flags);
 	if (copied < 0)
-		goto splice_read_end;
+		goto splice_requeue;
 
-	if (!from_queue) {
-		ctx->recv_pkt = NULL;
-		__strp_unpause(&ctx->strp);
-	}
 	if (chunk < rxm->full_len) {
-		__skb_queue_head(&ctx->rx_list, skb);
 		rxm->offset += len;
 		rxm->full_len -= len;
-	} else {
-		consume_skb(skb);
+		goto splice_requeue;
 	}
 
+	consume_skb(skb);
+
 splice_read_end:
 	tls_rx_reader_unlock(sk, ctx);
 	return copied ? : err;
+
+splice_requeue:
+	__skb_queue_head(&ctx->rx_list, skb);
+	goto splice_read_end;
 }
 
 bool tls_sw_sock_is_readable(struct sock *sk)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 04/11] tls: rx: remove the message decrypted tracking
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (2 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 03/11] tls: rx: don't keep decrypted skbs on ctx->recv_pkt Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 05/11] tls: rx: factor out device darg update Jakub Kicinski
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

We no longer allow a decrypted skb to remain linked to ctx->recv_pkt.
Anything on the list is decrypted, anything on ctx->recv_pkt needs
to be decrypted.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/strparser.h |  1 -
 net/tls/tls_sw.c        | 10 ----------
 2 files changed, 11 deletions(-)

diff --git a/include/net/strparser.h b/include/net/strparser.h
index 88900b05443e..41e2ce9e9e10 100644
--- a/include/net/strparser.h
+++ b/include/net/strparser.h
@@ -72,7 +72,6 @@ struct sk_skb_cb {
 	/* strp users' data follows */
 	struct tls_msg {
 		u8 control;
-		u8 decrypted;
 	} tls;
 	/* temp_reg is a temporary register used for bpf_convert_data_end_access
 	 * when dst_reg == src_reg.
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f5f06d1ba024..49cfaa8119c6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1563,21 +1563,13 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	struct strp_msg *rxm = strp_msg(skb);
-	struct tls_msg *tlm = tls_msg(skb);
 	int pad, err;
 
-	if (tlm->decrypted) {
-		darg->zc = false;
-		darg->async = false;
-		return 0;
-	}
-
 	if (tls_ctx->rx_conf == TLS_HW) {
 		err = tls_device_decrypted(sk, tls_ctx, skb, rxm);
 		if (err < 0)
 			return err;
 		if (err > 0) {
-			tlm->decrypted = 1;
 			darg->zc = false;
 			darg->async = false;
 			goto decrypt_done;
@@ -1610,7 +1602,6 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 	rxm->full_len -= pad;
 	rxm->offset += prot->prepend_size;
 	rxm->full_len -= prot->overhead_size;
-	tlm->decrypted = 1;
 decrypt_next:
 	tls_advance_record_sn(sk, prot, &tls_ctx->rx);
 
@@ -2130,7 +2121,6 @@ static int tls_read_size(struct strparser *strp, struct sk_buff *skb)
 	if (ret < 0)
 		goto read_failure;
 
-	tlm->decrypted = 0;
 	tlm->control = header[0];
 
 	data_len = ((header[4] & 0xFF) | (header[3] << 8));
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 05/11] tls: rx: factor out device darg update
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (3 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 04/11] tls: rx: remove the message decrypted tracking Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 06/11] tls: rx: read the input skb from ctx->recv_pkt Jakub Kicinski
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

I already forgot to transform darg from input to output
semantics once on the NIC inline crypto fastpath. To
avoid this happening again create a device equivalent
of decrypt_internal(). A function responsible for decryption
and transforming darg.

While at it rename decrypt_internal() to a hopefully slightly
more meaningful name.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 60 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 41 insertions(+), 19 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 49cfaa8119c6..5ef78e75c463 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1404,18 +1404,27 @@ static int tls_setup_from_iter(struct iov_iter *from,
 	return rc;
 }
 
+/* Decrypt handlers
+ *
+ * tls_decrypt_sg() and tls_decrypt_device() are decrypt handlers.
+ * They must transform the darg in/out argument are as follows:
+ *       |          Input            |         Output
+ * -------------------------------------------------------------------
+ *    zc | Zero-copy decrypt allowed | Zero-copy performed
+ * async | Async decrypt allowed     | Async crypto used / in progress
+ */
+
 /* This function decrypts the input skb into either out_iov or in out_sg
- * or in skb buffers itself. The input parameter 'zc' indicates if
+ * or in skb buffers itself. The input parameter 'darg->zc' indicates if
  * zero-copy mode needs to be tried or not. With zero-copy mode, either
  * out_iov or out_sg must be non-NULL. In case both out_iov and out_sg are
  * NULL, then the decryption happens inside skb buffers itself, i.e.
- * zero-copy gets disabled and 'zc' is updated.
+ * zero-copy gets disabled and 'darg->zc' is updated.
  */
-
-static int decrypt_internal(struct sock *sk, struct sk_buff *skb,
-			    struct iov_iter *out_iov,
-			    struct scatterlist *out_sg,
-			    struct tls_decrypt_arg *darg)
+static int tls_decrypt_sg(struct sock *sk, struct sk_buff *skb,
+			  struct iov_iter *out_iov,
+			  struct scatterlist *out_sg,
+			  struct tls_decrypt_arg *darg)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -1556,6 +1565,24 @@ static int decrypt_internal(struct sock *sk, struct sk_buff *skb,
 	return err;
 }
 
+static int
+tls_decrypt_device(struct sock *sk, struct tls_context *tls_ctx,
+		   struct sk_buff *skb, struct tls_decrypt_arg *darg)
+{
+	int err;
+
+	if (tls_ctx->rx_conf != TLS_HW)
+		return 0;
+
+	err = tls_device_decrypted(sk, tls_ctx, skb, strp_msg(skb));
+	if (err <= 0)
+		return err;
+
+	darg->zc = false;
+	darg->async = false;
+	return 1;
+}
+
 static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 			      struct iov_iter *dest,
 			      struct tls_decrypt_arg *darg)
@@ -1565,18 +1592,13 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 	struct strp_msg *rxm = strp_msg(skb);
 	int pad, err;
 
-	if (tls_ctx->rx_conf == TLS_HW) {
-		err = tls_device_decrypted(sk, tls_ctx, skb, rxm);
-		if (err < 0)
-			return err;
-		if (err > 0) {
-			darg->zc = false;
-			darg->async = false;
-			goto decrypt_done;
-		}
-	}
+	err = tls_decrypt_device(sk, tls_ctx, skb, darg);
+	if (err < 0)
+		return err;
+	if (err)
+		goto decrypt_done;
 
-	err = decrypt_internal(sk, skb, dest, NULL, darg);
+	err = tls_decrypt_sg(sk, skb, dest, NULL, darg);
 	if (err < 0) {
 		if (err == -EBADMSG)
 			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTERROR);
@@ -1613,7 +1635,7 @@ int decrypt_skb(struct sock *sk, struct sk_buff *skb,
 {
 	struct tls_decrypt_arg darg = { .zc = true, };
 
-	return decrypt_internal(sk, skb, NULL, sgout, &darg);
+	return tls_decrypt_sg(sk, skb, NULL, sgout, &darg);
 }
 
 static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 06/11] tls: rx: read the input skb from ctx->recv_pkt
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (4 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 05/11] tls: rx: factor out device darg update Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 07/11] tls: rx: return the decrypted skb via darg Jakub Kicinski
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

Callers always pass ctx->recv_pkt into decrypt_skb_update(),
and it propagates it to its callees. This may give someone
the false impression that those functions can accept any valid
skb containing a TLS record. That's not the case, the record
sequence number is read from the context, and they can only
take the next record coming out of the strp.

Let the functions get the skb from the context instead of
passing it in. This will also make it cleaner to return
a different skb than ctx->recv_pkt as the decrypted one
later on.

Since we're touching the definition of decrypt_skb_update()
use this as an opportunity to rename it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls.h        | 14 ++++++++------
 net/tls/tls_device.c | 25 ++++++++++++++++---------
 net/tls/tls_sw.c     | 37 ++++++++++++++++++-------------------
 3 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index e0ccc96a0850..44522b221717 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -118,8 +118,7 @@ void tls_device_write_space(struct sock *sk, struct tls_context *ctx);
 
 int tls_process_cmsg(struct sock *sk, struct msghdr *msg,
 		     unsigned char *record_type);
-int decrypt_skb(struct sock *sk, struct sk_buff *skb,
-		struct scatterlist *sgout);
+int decrypt_skb(struct sock *sk, struct scatterlist *sgout);
 
 int tls_sw_fallback_init(struct sock *sk,
 			 struct tls_offload_context_tx *offload_ctx,
@@ -132,6 +131,11 @@ static inline struct tls_msg *tls_msg(struct sk_buff *skb)
 	return &scb->tls;
 }
 
+static inline struct sk_buff *tls_strp_msg(struct tls_sw_context_rx *ctx)
+{
+	return ctx->recv_pkt;
+}
+
 #ifdef CONFIG_TLS_DEVICE
 int tls_device_init(void);
 void tls_device_cleanup(void);
@@ -140,8 +144,7 @@ void tls_device_free_resources_tx(struct sock *sk);
 int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx);
 void tls_device_offload_cleanup_rx(struct sock *sk);
 void tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq);
-int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx,
-			 struct sk_buff *skb, struct strp_msg *rxm);
+int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx);
 #else
 static inline int tls_device_init(void) { return 0; }
 static inline void tls_device_cleanup(void) {}
@@ -165,8 +168,7 @@ static inline void
 tls_device_rx_resync_new_rec(struct sock *sk, u32 rcd_len, u32 seq) {}
 
 static inline int
-tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx,
-		     struct sk_buff *skb, struct strp_msg *rxm)
+tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx)
 {
 	return 0;
 }
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 6f9dff1c11f6..6abbe3c2520c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -889,14 +889,19 @@ static void tls_device_core_ctrl_rx_resync(struct tls_context *tls_ctx,
 	}
 }
 
-static int tls_device_reencrypt(struct sock *sk, struct sk_buff *skb)
+static int
+tls_device_reencrypt(struct sock *sk, struct tls_sw_context_rx *sw_ctx)
 {
-	struct strp_msg *rxm = strp_msg(skb);
-	int err = 0, offset = rxm->offset, copy, nsg, data_len, pos;
-	struct sk_buff *skb_iter, *unused;
+	int err = 0, offset, copy, nsg, data_len, pos;
+	struct sk_buff *skb, *skb_iter, *unused;
 	struct scatterlist sg[1];
+	struct strp_msg *rxm;
 	char *orig_buf, *buf;
 
+	skb = tls_strp_msg(sw_ctx);
+	rxm = strp_msg(skb);
+	offset = rxm->offset;
+
 	orig_buf = kmalloc(rxm->full_len + TLS_HEADER_SIZE +
 			   TLS_CIPHER_AES_GCM_128_IV_SIZE, sk->sk_allocation);
 	if (!orig_buf)
@@ -919,7 +924,7 @@ static int tls_device_reencrypt(struct sock *sk, struct sk_buff *skb)
 		goto free_buf;
 
 	/* We are interested only in the decrypted data not the auth */
-	err = decrypt_skb(sk, skb, sg);
+	err = decrypt_skb(sk, sg);
 	if (err != -EBADMSG)
 		goto free_buf;
 	else
@@ -974,10 +979,12 @@ static int tls_device_reencrypt(struct sock *sk, struct sk_buff *skb)
 	return err;
 }
 
-int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx,
-			 struct sk_buff *skb, struct strp_msg *rxm)
+int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx)
 {
 	struct tls_offload_context_rx *ctx = tls_offload_ctx_rx(tls_ctx);
+	struct tls_sw_context_rx *sw_ctx = tls_sw_ctx_rx(tls_ctx);
+	struct sk_buff *skb = tls_strp_msg(sw_ctx);
+	struct strp_msg *rxm = strp_msg(skb);
 	int is_decrypted = skb->decrypted;
 	int is_encrypted = !is_decrypted;
 	struct sk_buff *skb_iter;
@@ -1000,7 +1007,7 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx,
 		 * likely have initial fragments decrypted, and final ones not
 		 * decrypted. We need to reencrypt that single SKB.
 		 */
-		return tls_device_reencrypt(sk, skb);
+		return tls_device_reencrypt(sk, sw_ctx);
 	}
 
 	/* Return immediately if the record is either entirely plaintext or
@@ -1017,7 +1024,7 @@ int tls_device_decrypted(struct sock *sk, struct tls_context *tls_ctx,
 	}
 
 	ctx->resync_nh_reset = 1;
-	return tls_device_reencrypt(sk, skb);
+	return tls_device_reencrypt(sk, sw_ctx);
 }
 
 static void tls_device_attach(struct tls_context *ctx, struct sock *sk,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 5ef78e75c463..6205ad1a84c7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1421,8 +1421,7 @@ static int tls_setup_from_iter(struct iov_iter *from,
  * NULL, then the decryption happens inside skb buffers itself, i.e.
  * zero-copy gets disabled and 'darg->zc' is updated.
  */
-static int tls_decrypt_sg(struct sock *sk, struct sk_buff *skb,
-			  struct iov_iter *out_iov,
+static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 			  struct scatterlist *out_sg,
 			  struct tls_decrypt_arg *darg)
 {
@@ -1430,6 +1429,7 @@ static int tls_decrypt_sg(struct sock *sk, struct sk_buff *skb,
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	int n_sgin, n_sgout, aead_size, err, pages = 0;
+	struct sk_buff *skb = tls_strp_msg(ctx);
 	struct strp_msg *rxm = strp_msg(skb);
 	struct tls_msg *tlm = tls_msg(skb);
 	struct aead_request *aead_req;
@@ -1567,14 +1567,14 @@ static int tls_decrypt_sg(struct sock *sk, struct sk_buff *skb,
 
 static int
 tls_decrypt_device(struct sock *sk, struct tls_context *tls_ctx,
-		   struct sk_buff *skb, struct tls_decrypt_arg *darg)
+		   struct tls_decrypt_arg *darg)
 {
 	int err;
 
 	if (tls_ctx->rx_conf != TLS_HW)
 		return 0;
 
-	err = tls_device_decrypted(sk, tls_ctx, skb, strp_msg(skb));
+	err = tls_device_decrypted(sk, tls_ctx);
 	if (err <= 0)
 		return err;
 
@@ -1583,22 +1583,22 @@ tls_decrypt_device(struct sock *sk, struct tls_context *tls_ctx,
 	return 1;
 }
 
-static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
-			      struct iov_iter *dest,
-			      struct tls_decrypt_arg *darg)
+static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
+			     struct tls_decrypt_arg *darg)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
-	struct strp_msg *rxm = strp_msg(skb);
+	struct strp_msg *rxm;
 	int pad, err;
 
-	err = tls_decrypt_device(sk, tls_ctx, skb, darg);
+	err = tls_decrypt_device(sk, tls_ctx, darg);
 	if (err < 0)
 		return err;
 	if (err)
 		goto decrypt_done;
 
-	err = tls_decrypt_sg(sk, skb, dest, NULL, darg);
+	err = tls_decrypt_sg(sk, dest, NULL, darg);
 	if (err < 0) {
 		if (err == -EBADMSG)
 			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTERROR);
@@ -1613,14 +1613,15 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 		if (!darg->tail)
 			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSRXNOPADVIOL);
 		TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTRETRY);
-		return decrypt_skb_update(sk, skb, dest, darg);
+		return tls_rx_one_record(sk, dest, darg);
 	}
 
 decrypt_done:
-	pad = tls_padding_length(prot, skb, darg);
+	pad = tls_padding_length(prot, ctx->recv_pkt, darg);
 	if (pad < 0)
 		return pad;
 
+	rxm = strp_msg(ctx->recv_pkt);
 	rxm->full_len -= pad;
 	rxm->offset += prot->prepend_size;
 	rxm->full_len -= prot->overhead_size;
@@ -1630,12 +1631,11 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 	return 0;
 }
 
-int decrypt_skb(struct sock *sk, struct sk_buff *skb,
-		struct scatterlist *sgout)
+int decrypt_skb(struct sock *sk, struct scatterlist *sgout)
 {
 	struct tls_decrypt_arg darg = { .zc = true, };
 
-	return tls_decrypt_sg(sk, skb, NULL, sgout, &darg);
+	return tls_decrypt_sg(sk, NULL, sgout, &darg);
 }
 
 static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
@@ -1905,7 +1905,7 @@ int tls_sw_recvmsg(struct sock *sk,
 		else
 			darg.async = false;
 
-		err = decrypt_skb_update(sk, skb, &msg->msg_iter, &darg);
+		err = tls_rx_one_record(sk, &msg->msg_iter, &darg);
 		if (err < 0) {
 			tls_err_abort(sk, -EBADMSG);
 			goto recv_end;
@@ -2058,14 +2058,13 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 		if (err <= 0)
 			goto splice_read_end;
 
-		skb = ctx->recv_pkt;
-
-		err = decrypt_skb_update(sk, skb, NULL, &darg);
+		err = tls_rx_one_record(sk, NULL, &darg);
 		if (err < 0) {
 			tls_err_abort(sk, -EBADMSG);
 			goto splice_read_end;
 		}
 
+		skb = ctx->recv_pkt;
 		tls_rx_rec_done(ctx);
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 07/11] tls: rx: return the decrypted skb via darg
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (5 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 06/11] tls: rx: read the input skb from ctx->recv_pkt Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 08/11] tls: rx: async: adjust record geometry immediately Jakub Kicinski
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

Instead of using ctx->recv_pkt after decryption read the skb
from darg.skb. This moves the decision of what the "output skb"
is to the decrypt handlers. For now after decrypt handler returns
successfully ctx->recv_pkt is simply moved to darg.skb, but it
will change soon.

Note that tls_decrypt_sg() cannot clear the ctx->recv_pkt
because it gets called to re-encrypt (i.e. by the device offload).
So we need an awkward temporary if() in tls_rx_one_record().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 49 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 6205ad1a84c7..6a9875456f84 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -47,9 +47,13 @@
 #include "tls.h"
 
 struct tls_decrypt_arg {
+	struct_group(inargs,
 	bool zc;
 	bool async;
 	u8 tail;
+	);
+
+	struct sk_buff *skb;
 };
 
 struct tls_decrypt_ctx {
@@ -1412,6 +1416,7 @@ static int tls_setup_from_iter(struct iov_iter *from,
  * -------------------------------------------------------------------
  *    zc | Zero-copy decrypt allowed | Zero-copy performed
  * async | Async decrypt allowed     | Async crypto used / in progress
+ *   skb |            *              | Output skb
  */
 
 /* This function decrypts the input skb into either out_iov or in out_sg
@@ -1551,12 +1556,17 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	/* Prepare and submit AEAD request */
 	err = tls_do_decryption(sk, skb, sgin, sgout, dctx->iv,
 				data_len + prot->tail_size, aead_req, darg);
+	if (err)
+		goto exit_free_pages;
+
+	darg->skb = tls_strp_msg(ctx);
 	if (darg->async)
 		return 0;
 
 	if (prot->tail_size)
 		darg->tail = dctx->tail;
 
+exit_free_pages:
 	/* Release the pages in case iov was mapped to pages */
 	for (; pages > 0; pages--)
 		put_page(sg_page(&sgout[pages]));
@@ -1569,6 +1579,7 @@ static int
 tls_decrypt_device(struct sock *sk, struct tls_context *tls_ctx,
 		   struct tls_decrypt_arg *darg)
 {
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	int err;
 
 	if (tls_ctx->rx_conf != TLS_HW)
@@ -1580,6 +1591,8 @@ tls_decrypt_device(struct sock *sk, struct tls_context *tls_ctx,
 
 	darg->zc = false;
 	darg->async = false;
+	darg->skb = tls_strp_msg(ctx);
+	ctx->recv_pkt = NULL;
 	return 1;
 }
 
@@ -1604,8 +1617,11 @@ static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
 			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTERROR);
 		return err;
 	}
-	if (darg->async)
+	if (darg->async) {
+		if (darg->skb == ctx->recv_pkt)
+			ctx->recv_pkt = NULL;
 		goto decrypt_next;
+	}
 	/* If opportunistic TLS 1.3 ZC failed retry without ZC */
 	if (unlikely(darg->zc && prot->version == TLS_1_3_VERSION &&
 		     darg->tail != TLS_RECORD_TYPE_DATA)) {
@@ -1616,12 +1632,17 @@ static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
 		return tls_rx_one_record(sk, dest, darg);
 	}
 
+	if (darg->skb == ctx->recv_pkt)
+		ctx->recv_pkt = NULL;
+
 decrypt_done:
-	pad = tls_padding_length(prot, ctx->recv_pkt, darg);
-	if (pad < 0)
+	pad = tls_padding_length(prot, darg->skb, darg);
+	if (pad < 0) {
+		consume_skb(darg->skb);
 		return pad;
+	}
 
-	rxm = strp_msg(ctx->recv_pkt);
+	rxm = strp_msg(darg->skb);
 	rxm->full_len -= pad;
 	rxm->offset += prot->prepend_size;
 	rxm->full_len -= prot->overhead_size;
@@ -1663,6 +1684,7 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
 
 static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
 {
+	consume_skb(ctx->recv_pkt);
 	ctx->recv_pkt = NULL;
 	__strp_unpause(&ctx->strp);
 }
@@ -1872,7 +1894,7 @@ int tls_sw_recvmsg(struct sock *sk,
 		ctx->zc_capable;
 	decrypted = 0;
 	while (len && (decrypted + copied < target || ctx->recv_pkt)) {
-		struct tls_decrypt_arg darg = {};
+		struct tls_decrypt_arg darg;
 		int to_decrypt, chunk;
 
 		err = tls_rx_rec_wait(sk, psock, flags & MSG_DONTWAIT, timeo);
@@ -1889,9 +1911,10 @@ int tls_sw_recvmsg(struct sock *sk,
 			goto recv_end;
 		}
 
-		skb = ctx->recv_pkt;
-		rxm = strp_msg(skb);
-		tlm = tls_msg(skb);
+		memset(&darg.inargs, 0, sizeof(darg.inargs));
+
+		rxm = strp_msg(ctx->recv_pkt);
+		tlm = tls_msg(ctx->recv_pkt);
 
 		to_decrypt = rxm->full_len - prot->overhead_size;
 
@@ -1911,6 +1934,10 @@ int tls_sw_recvmsg(struct sock *sk,
 			goto recv_end;
 		}
 
+		skb = darg.skb;
+		rxm = strp_msg(skb);
+		tlm = tls_msg(skb);
+
 		async |= darg.async;
 
 		/* If the type of records being processed is not known yet,
@@ -2051,21 +2078,23 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	if (!skb_queue_empty(&ctx->rx_list)) {
 		skb = __skb_dequeue(&ctx->rx_list);
 	} else {
-		struct tls_decrypt_arg darg = {};
+		struct tls_decrypt_arg darg;
 
 		err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK,
 				      timeo);
 		if (err <= 0)
 			goto splice_read_end;
 
+		memset(&darg.inargs, 0, sizeof(darg.inargs));
+
 		err = tls_rx_one_record(sk, NULL, &darg);
 		if (err < 0) {
 			tls_err_abort(sk, -EBADMSG);
 			goto splice_read_end;
 		}
 
-		skb = ctx->recv_pkt;
 		tls_rx_rec_done(ctx);
+		skb = darg.skb;
 	}
 
 	rxm = strp_msg(skb);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 08/11] tls: rx: async: adjust record geometry immediately
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (6 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 07/11] tls: rx: return the decrypted skb via darg Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 09/11] tls: rx: async: hold onto the input skb Jakub Kicinski
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

Async crypto TLS Rx currently waits for crypto to be done
in order to strip the TLS header and tailer. Simplify
the code by moving the pointers immediately, since only
TLS 1.2 is supported here there is no message padding.

This simplifies the decryption into a new skb in the next
patch as we don't have to worry about input vs output
skb in the decrypt_done() handler any more.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 49 ++++++++++--------------------------------------
 1 file changed, 10 insertions(+), 39 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 6a9875456f84..09fe2cfff51a 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -184,39 +184,22 @@ static void tls_decrypt_done(struct crypto_async_request *req, int err)
 	struct scatterlist *sgin = aead_req->src;
 	struct tls_sw_context_rx *ctx;
 	struct tls_context *tls_ctx;
-	struct tls_prot_info *prot;
 	struct scatterlist *sg;
-	struct sk_buff *skb;
 	unsigned int pages;
+	struct sock *sk;
 
-	skb = (struct sk_buff *)req->data;
-	tls_ctx = tls_get_ctx(skb->sk);
+	sk = (struct sock *)req->data;
+	tls_ctx = tls_get_ctx(sk);
 	ctx = tls_sw_ctx_rx(tls_ctx);
-	prot = &tls_ctx->prot_info;
 
 	/* Propagate if there was an err */
 	if (err) {
 		if (err == -EBADMSG)
-			TLS_INC_STATS(sock_net(skb->sk),
-				      LINUX_MIB_TLSDECRYPTERROR);
+			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTERROR);
 		ctx->async_wait.err = err;
-		tls_err_abort(skb->sk, err);
-	} else {
-		struct strp_msg *rxm = strp_msg(skb);
-
-		/* No TLS 1.3 support with async crypto */
-		WARN_ON(prot->tail_size);
-
-		rxm->offset += prot->prepend_size;
-		rxm->full_len -= prot->overhead_size;
+		tls_err_abort(sk, err);
 	}
 
-	/* After using skb->sk to propagate sk through crypto async callback
-	 * we need to NULL it again.
-	 */
-	skb->sk = NULL;
-
-
 	/* Free the destination pages if skb was not decrypted inplace */
 	if (sgout != sgin) {
 		/* Skip the first S/G entry as it points to AAD */
@@ -236,7 +219,6 @@ static void tls_decrypt_done(struct crypto_async_request *req, int err)
 }
 
 static int tls_do_decryption(struct sock *sk,
-			     struct sk_buff *skb,
 			     struct scatterlist *sgin,
 			     struct scatterlist *sgout,
 			     char *iv_recv,
@@ -256,16 +238,9 @@ static int tls_do_decryption(struct sock *sk,
 			       (u8 *)iv_recv);
 
 	if (darg->async) {
-		/* Using skb->sk to push sk through to crypto async callback
-		 * handler. This allows propagating errors up to the socket
-		 * if needed. It _must_ be cleared in the async handler
-		 * before consume_skb is called. We _know_ skb->sk is NULL
-		 * because it is a clone from strparser.
-		 */
-		skb->sk = sk;
 		aead_request_set_callback(aead_req,
 					  CRYPTO_TFM_REQ_MAY_BACKLOG,
-					  tls_decrypt_done, skb);
+					  tls_decrypt_done, sk);
 		atomic_inc(&ctx->decrypt_pending);
 	} else {
 		aead_request_set_callback(aead_req,
@@ -1554,7 +1529,7 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	}
 
 	/* Prepare and submit AEAD request */
-	err = tls_do_decryption(sk, skb, sgin, sgout, dctx->iv,
+	err = tls_do_decryption(sk, sgin, sgout, dctx->iv,
 				data_len + prot->tail_size, aead_req, darg);
 	if (err)
 		goto exit_free_pages;
@@ -1617,11 +1592,8 @@ static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
 			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSDECRYPTERROR);
 		return err;
 	}
-	if (darg->async) {
-		if (darg->skb == ctx->recv_pkt)
-			ctx->recv_pkt = NULL;
-		goto decrypt_next;
-	}
+	if (darg->async)
+		goto decrypt_done;
 	/* If opportunistic TLS 1.3 ZC failed retry without ZC */
 	if (unlikely(darg->zc && prot->version == TLS_1_3_VERSION &&
 		     darg->tail != TLS_RECORD_TYPE_DATA)) {
@@ -1632,10 +1604,10 @@ static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
 		return tls_rx_one_record(sk, dest, darg);
 	}
 
+decrypt_done:
 	if (darg->skb == ctx->recv_pkt)
 		ctx->recv_pkt = NULL;
 
-decrypt_done:
 	pad = tls_padding_length(prot, darg->skb, darg);
 	if (pad < 0) {
 		consume_skb(darg->skb);
@@ -1646,7 +1618,6 @@ static int tls_rx_one_record(struct sock *sk, struct iov_iter *dest,
 	rxm->full_len -= pad;
 	rxm->offset += prot->prepend_size;
 	rxm->full_len -= prot->overhead_size;
-decrypt_next:
 	tls_advance_record_sn(sk, prot, &tls_ctx->rx);
 
 	return 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 09/11] tls: rx: async: hold onto the input skb
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (7 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 08/11] tls: rx: async: adjust record geometry immediately Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 10/11] tls: rx: async: don't put async zc on the list Jakub Kicinski
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

Async crypto currently benefits from the fact that we decrypt
in place. When we allow input and output to be different skbs
we will have to hang onto the input while we move to the next
record. Clone the inputs and keep them on a list.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/tls.h  |  1 +
 net/tls/Makefile   |  2 +-
 net/tls/tls.h      |  3 +++
 net/tls/tls_strp.c | 17 +++++++++++++++++
 net/tls/tls_sw.c   | 26 +++++++++++++++++---------
 5 files changed, 39 insertions(+), 10 deletions(-)
 create mode 100644 net/tls/tls_strp.c

diff --git a/include/net/tls.h b/include/net/tls.h
index e8935cfe0cd6..181c496b01b8 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -123,6 +123,7 @@ struct tls_sw_context_rx {
 	atomic_t decrypt_pending;
 	/* protect crypto_wait with decrypt_pending*/
 	spinlock_t decrypt_compl_lock;
+	struct sk_buff_head async_hold;
 	struct wait_queue_head wq;
 };
 
diff --git a/net/tls/Makefile b/net/tls/Makefile
index f1ffbfe8968d..e41c800489ac 100644
--- a/net/tls/Makefile
+++ b/net/tls/Makefile
@@ -7,7 +7,7 @@ CFLAGS_trace.o := -I$(src)
 
 obj-$(CONFIG_TLS) += tls.o
 
-tls-y := tls_main.o tls_sw.o tls_proc.o trace.o
+tls-y := tls_main.o tls_sw.o tls_proc.o trace.o tls_strp.o
 
 tls-$(CONFIG_TLS_TOE) += tls_toe.o
 tls-$(CONFIG_TLS_DEVICE) += tls_device.o tls_device_fallback.o
diff --git a/net/tls/tls.h b/net/tls/tls.h
index 44522b221717..c818dc68955d 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -124,6 +124,9 @@ int tls_sw_fallback_init(struct sock *sk,
 			 struct tls_offload_context_tx *offload_ctx,
 			 struct tls_crypto_info *crypto_info);
 
+int tls_strp_msg_hold(struct sock *sk, struct sk_buff *skb,
+		      struct sk_buff_head *dst);
+
 static inline struct tls_msg *tls_msg(struct sk_buff *skb)
 {
 	struct sk_skb_cb *scb = (struct sk_skb_cb *)skb->cb;
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
new file mode 100644
index 000000000000..9ccab79a6e1e
--- /dev/null
+++ b/net/tls/tls_strp.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/skbuff.h>
+
+#include "tls.h"
+
+int tls_strp_msg_hold(struct sock *sk, struct sk_buff *skb,
+		      struct sk_buff_head *dst)
+{
+	struct sk_buff *clone;
+
+	clone = skb_clone(skb, sk->sk_allocation);
+	if (!clone)
+		return -ENOMEM;
+	__skb_queue_tail(dst, clone);
+	return 0;
+}
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 09fe2cfff51a..f767501e178d 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1535,8 +1535,13 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 		goto exit_free_pages;
 
 	darg->skb = tls_strp_msg(ctx);
-	if (darg->async)
-		return 0;
+
+	if (unlikely(darg->async)) {
+		err = tls_strp_msg_hold(sk, skb, &ctx->async_hold);
+		if (err)
+			__skb_queue_tail(&ctx->async_hold, darg->skb);
+		return err;
+	}
 
 	if (prot->tail_size)
 		darg->tail = dctx->tail;
@@ -1998,14 +2003,16 @@ int tls_sw_recvmsg(struct sock *sk,
 		reinit_completion(&ctx->async_wait.completion);
 		pending = atomic_read(&ctx->decrypt_pending);
 		spin_unlock_bh(&ctx->decrypt_compl_lock);
-		if (pending) {
+		ret = 0;
+		if (pending)
 			ret = crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
-			if (ret) {
-				if (err >= 0 || err == -EINPROGRESS)
-					err = ret;
-				decrypted = 0;
-				goto end;
-			}
+		__skb_queue_purge(&ctx->async_hold);
+
+		if (ret) {
+			if (err >= 0 || err == -EINPROGRESS)
+				err = ret;
+			decrypted = 0;
+			goto end;
 		}
 
 		/* Drain records from the rx_list & copy if required */
@@ -2440,6 +2447,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		crypto_info = &ctx->crypto_recv.info;
 		cctx = &ctx->rx;
 		skb_queue_head_init(&sw_ctx_rx->rx_list);
+		skb_queue_head_init(&sw_ctx_rx->async_hold);
 		aead = &sw_ctx_rx->aead_recv;
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 10/11] tls: rx: async: don't put async zc on the list
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (8 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 09/11] tls: rx: async: hold onto the input skb Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-15  5:22 ` [PATCH net-next v2 11/11] tls: rx: decrypt into a fresh skb Jakub Kicinski
  2022-07-18 10:40 ` [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() patchwork-bot+netdevbpf
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

The "zero-copy" path in SW TLS will engage either for no skbs or
for all but last. If the recvmsg parameters are right and the
socket can do ZC we'll ZC until the iterator can't fit a full
record at which point we'll decrypt one more record and copy
over the necessary bits to fill up the request.

The only reason we hold onto the ZC skbs which went thru the async
path until the end of recvmsg() is to count bytes. We need an accurate
count of zc'ed bytes so that we can calculate how much of the non-zc'd
data to copy. To allow freeing input skbs on the ZC path count only
how much of the list we'll need to consume.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls_sw.c | 40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f767501e178d..1c9a0705ee63 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1675,7 +1675,6 @@ static int process_rx_list(struct tls_sw_context_rx *ctx,
 			   u8 *control,
 			   size_t skip,
 			   size_t len,
-			   bool zc,
 			   bool is_peek)
 {
 	struct sk_buff *skb = skb_peek(&ctx->rx_list);
@@ -1709,12 +1708,10 @@ static int process_rx_list(struct tls_sw_context_rx *ctx,
 		if (err <= 0)
 			goto out;
 
-		if (!zc || (rxm->full_len - skip) > len) {
-			err = skb_copy_datagram_msg(skb, rxm->offset + skip,
-						    msg, chunk);
-			if (err < 0)
-				goto out;
-		}
+		err = skb_copy_datagram_msg(skb, rxm->offset + skip,
+					    msg, chunk);
+		if (err < 0)
+			goto out;
 
 		len = len - chunk;
 		copied = copied + chunk;
@@ -1824,9 +1821,9 @@ int tls_sw_recvmsg(struct sock *sk,
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
+	ssize_t decrypted = 0, async_copy_bytes = 0;
 	struct sk_psock *psock;
 	unsigned char control = 0;
-	ssize_t decrypted = 0;
 	size_t flushed_at = 0;
 	struct strp_msg *rxm;
 	struct tls_msg *tlm;
@@ -1855,7 +1852,7 @@ int tls_sw_recvmsg(struct sock *sk,
 		goto end;
 
 	/* Process pending decrypted records. It must be non-zero-copy */
-	err = process_rx_list(ctx, msg, &control, 0, len, false, is_peek);
+	err = process_rx_list(ctx, msg, &control, 0, len, is_peek);
 	if (err < 0)
 		goto end;
 
@@ -1939,19 +1936,20 @@ int tls_sw_recvmsg(struct sock *sk,
 		chunk = rxm->full_len;
 		tls_rx_rec_done(ctx);
 
-		if (async) {
-			/* TLS 1.2-only, to_decrypt must be text length */
-			chunk = min_t(int, to_decrypt, len);
-put_on_rx_list:
-			decrypted += chunk;
-			len -= chunk;
-			__skb_queue_tail(&ctx->rx_list, skb);
-			continue;
-		}
-
 		if (!darg.zc) {
 			bool partially_consumed = chunk > len;
 
+			if (async) {
+				/* TLS 1.2-only, to_decrypt must be text len */
+				chunk = min_t(int, to_decrypt, len);
+				async_copy_bytes += chunk;
+put_on_rx_list:
+				decrypted += chunk;
+				len -= chunk;
+				__skb_queue_tail(&ctx->rx_list, skb);
+				continue;
+			}
+
 			if (bpf_strp_enabled) {
 				err = sk_psock_tls_strp_read(psock, skb);
 				if (err != __SK_PASS) {
@@ -2018,10 +2016,10 @@ int tls_sw_recvmsg(struct sock *sk,
 		/* Drain records from the rx_list & copy if required */
 		if (is_peek || is_kvec)
 			err = process_rx_list(ctx, msg, &control, copied,
-					      decrypted, false, is_peek);
+					      decrypted, is_peek);
 		else
 			err = process_rx_list(ctx, msg, &control, 0,
-					      decrypted, true, is_peek);
+					      async_copy_bytes, is_peek);
 		decrypted = max(err, 0);
 	}
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next v2 11/11] tls: rx: decrypt into a fresh skb
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (9 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 10/11] tls: rx: async: don't put async zc on the list Jakub Kicinski
@ 2022-07-15  5:22 ` Jakub Kicinski
  2022-07-18 10:40 ` [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() patchwork-bot+netdevbpf
  11 siblings, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-15  5:22 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko, Jakub Kicinski

We currently CoW Rx skbs whenever we can't decrypt to a user
space buffer. The skbs can be enormous (64kB) and CoW does
a linear alloc which has a strong chance of failing under
memory pressure. Or even without, skb_cow_data() assumes
GFP_ATOMIC.

Allocate a new frag'd skb and decrypt into it. We finally
take advantage of the decrypted skb getting returned via
darg.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/tls/tls.h    |   3 ++
 net/tls/tls_sw.c | 106 ++++++++++++++++++++++++++++++-----------------
 2 files changed, 72 insertions(+), 37 deletions(-)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index c818dc68955d..3740740504e3 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -39,6 +39,9 @@
 #include <linux/skmsg.h>
 #include <net/tls.h>
 
+#define TLS_PAGE_ORDER	(min_t(unsigned int, PAGE_ALLOC_COSTLY_ORDER,	\
+			       TLS_MAX_PAYLOAD_SIZE >> PAGE_SHIFT))
+
 #define __TLS_INC_STATS(net, field)				\
 	__SNMP_INC_STATS((net)->mib.tls_statistics, field)
 #define TLS_INC_STATS(net, field)				\
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 1c9a0705ee63..859ea02022c0 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1383,6 +1383,29 @@ static int tls_setup_from_iter(struct iov_iter *from,
 	return rc;
 }
 
+static struct sk_buff *
+tls_alloc_clrtxt_skb(struct sock *sk, struct sk_buff *skb,
+		     unsigned int full_len)
+{
+	struct strp_msg *clr_rxm;
+	struct sk_buff *clr_skb;
+	int err;
+
+	clr_skb = alloc_skb_with_frags(0, full_len, TLS_PAGE_ORDER,
+				       &err, sk->sk_allocation);
+	if (!clr_skb)
+		return NULL;
+
+	skb_copy_header(clr_skb, skb);
+	clr_skb->len = full_len;
+	clr_skb->data_len = full_len;
+
+	clr_rxm = strp_msg(clr_skb);
+	clr_rxm->offset = 0;
+
+	return clr_skb;
+}
+
 /* Decrypt handlers
  *
  * tls_decrypt_sg() and tls_decrypt_device() are decrypt handlers.
@@ -1410,34 +1433,40 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	int n_sgin, n_sgout, aead_size, err, pages = 0;
 	struct sk_buff *skb = tls_strp_msg(ctx);
-	struct strp_msg *rxm = strp_msg(skb);
-	struct tls_msg *tlm = tls_msg(skb);
+	const struct strp_msg *rxm = strp_msg(skb);
+	const struct tls_msg *tlm = tls_msg(skb);
 	struct aead_request *aead_req;
-	struct sk_buff *unused;
 	struct scatterlist *sgin = NULL;
 	struct scatterlist *sgout = NULL;
 	const int data_len = rxm->full_len - prot->overhead_size;
 	int tail_pages = !!prot->tail_size;
 	struct tls_decrypt_ctx *dctx;
+	struct sk_buff *clear_skb;
 	int iv_offset = 0;
 	u8 *mem;
 
+	n_sgin = skb_nsg(skb, rxm->offset + prot->prepend_size,
+			 rxm->full_len - prot->prepend_size);
+	if (n_sgin < 1)
+		return n_sgin ?: -EBADMSG;
+
 	if (darg->zc && (out_iov || out_sg)) {
+		clear_skb = NULL;
+
 		if (out_iov)
 			n_sgout = 1 + tail_pages +
 				iov_iter_npages_cap(out_iov, INT_MAX, data_len);
 		else
 			n_sgout = sg_nents(out_sg);
-		n_sgin = skb_nsg(skb, rxm->offset + prot->prepend_size,
-				 rxm->full_len - prot->prepend_size);
 	} else {
-		n_sgout = 0;
 		darg->zc = false;
-		n_sgin = skb_cow_data(skb, 0, &unused);
-	}
 
-	if (n_sgin < 1)
-		return -EBADMSG;
+		clear_skb = tls_alloc_clrtxt_skb(sk, skb, rxm->full_len);
+		if (!clear_skb)
+			return -ENOMEM;
+
+		n_sgout = 1 + skb_shinfo(clear_skb)->nr_frags;
+	}
 
 	/* Increment to accommodate AAD */
 	n_sgin = n_sgin + 1;
@@ -1449,8 +1478,10 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	aead_size = sizeof(*aead_req) + crypto_aead_reqsize(ctx->aead_recv);
 	mem = kmalloc(aead_size + struct_size(dctx, sg, n_sgin + n_sgout),
 		      sk->sk_allocation);
-	if (!mem)
-		return -ENOMEM;
+	if (!mem) {
+		err = -ENOMEM;
+		goto exit_free_skb;
+	}
 
 	/* Segment the allocated memory */
 	aead_req = (struct aead_request *)mem;
@@ -1499,33 +1530,31 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	if (err < 0)
 		goto exit_free;
 
-	if (n_sgout) {
-		if (out_iov) {
-			sg_init_table(sgout, n_sgout);
-			sg_set_buf(&sgout[0], dctx->aad, prot->aad_size);
+	if (clear_skb) {
+		sg_init_table(sgout, n_sgout);
+		sg_set_buf(&sgout[0], dctx->aad, prot->aad_size);
 
-			err = tls_setup_from_iter(out_iov, data_len,
-						  &pages, &sgout[1],
-						  (n_sgout - 1 - tail_pages));
-			if (err < 0)
-				goto fallback_to_reg_recv;
+		err = skb_to_sgvec(clear_skb, &sgout[1], prot->prepend_size,
+				   data_len + prot->tail_size);
+		if (err < 0)
+			goto exit_free;
+	} else if (out_iov) {
+		sg_init_table(sgout, n_sgout);
+		sg_set_buf(&sgout[0], dctx->aad, prot->aad_size);
 
-			if (prot->tail_size) {
-				sg_unmark_end(&sgout[pages]);
-				sg_set_buf(&sgout[pages + 1], &dctx->tail,
-					   prot->tail_size);
-				sg_mark_end(&sgout[pages + 1]);
-			}
-		} else if (out_sg) {
-			memcpy(sgout, out_sg, n_sgout * sizeof(*sgout));
-		} else {
-			goto fallback_to_reg_recv;
+		err = tls_setup_from_iter(out_iov, data_len, &pages, &sgout[1],
+					  (n_sgout - 1 - tail_pages));
+		if (err < 0)
+			goto exit_free_pages;
+
+		if (prot->tail_size) {
+			sg_unmark_end(&sgout[pages]);
+			sg_set_buf(&sgout[pages + 1], &dctx->tail,
+				   prot->tail_size);
+			sg_mark_end(&sgout[pages + 1]);
 		}
-	} else {
-fallback_to_reg_recv:
-		sgout = sgin;
-		pages = 0;
-		darg->zc = false;
+	} else if (out_sg) {
+		memcpy(sgout, out_sg, n_sgout * sizeof(*sgout));
 	}
 
 	/* Prepare and submit AEAD request */
@@ -1534,7 +1563,8 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 	if (err)
 		goto exit_free_pages;
 
-	darg->skb = tls_strp_msg(ctx);
+	darg->skb = clear_skb ?: tls_strp_msg(ctx);
+	clear_skb = NULL;
 
 	if (unlikely(darg->async)) {
 		err = tls_strp_msg_hold(sk, skb, &ctx->async_hold);
@@ -1552,6 +1582,8 @@ static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov,
 		put_page(sg_page(&sgout[pages]));
 exit_free:
 	kfree(mem);
+exit_free_skb:
+	consume_skb(clear_skb);
 	return err;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data()
  2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
                   ` (10 preceding siblings ...)
  2022-07-15  5:22 ` [PATCH net-next v2 11/11] tls: rx: decrypt into a fresh skb Jakub Kicinski
@ 2022-07-18 10:40 ` patchwork-bot+netdevbpf
  11 siblings, 0 replies; 17+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-07-18 10:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, borisp, john.fastabend, maximmi,
	tariqt, vfedorenko

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Thu, 14 Jul 2022 22:22:24 -0700 you wrote:
> TLS calls skb_cow_data() on the skb it received from strparser
> whenever it needs to hold onto the skb with the decrypted data.
> (The alternative being decrypting directly to a user space buffer
> in whic case the input skb doesn't get modified or used after.)
> TLS needs the decrypted skb:
>  - almost always with TLS 1.3 (unless the new NoPad is enabled);
>  - when user space buffer is too small to fit the record;
>  - when BPF sockmap is enabled.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,01/11] tls: rx: allow only one reader at a time
    https://git.kernel.org/netdev/net-next/c/4cbc325ed6b4
  - [net-next,v2,02/11] tls: rx: don't try to keep the skbs always on the list
    https://git.kernel.org/netdev/net-next/c/008141de8557
  - [net-next,v2,03/11] tls: rx: don't keep decrypted skbs on ctx->recv_pkt
    https://git.kernel.org/netdev/net-next/c/abb47dc95dc6
  - [net-next,v2,04/11] tls: rx: remove the message decrypted tracking
    https://git.kernel.org/netdev/net-next/c/53d57999fe02
  - [net-next,v2,05/11] tls: rx: factor out device darg update
    https://git.kernel.org/netdev/net-next/c/8a958732818b
  - [net-next,v2,06/11] tls: rx: read the input skb from ctx->recv_pkt
    https://git.kernel.org/netdev/net-next/c/541cc48be3b1
  - [net-next,v2,07/11] tls: rx: return the decrypted skb via darg
    https://git.kernel.org/netdev/net-next/c/6bd116c8c654
  - [net-next,v2,08/11] tls: rx: async: adjust record geometry immediately
    https://git.kernel.org/netdev/net-next/c/6ececdc51369
  - [net-next,v2,09/11] tls: rx: async: hold onto the input skb
    https://git.kernel.org/netdev/net-next/c/c618db2afe7c
  - [net-next,v2,10/11] tls: rx: async: don't put async zc on the list
    https://git.kernel.org/netdev/net-next/c/cbbdee9918a2
  - [net-next,v2,11/11] tls: rx: decrypt into a fresh skb
    https://git.kernel.org/netdev/net-next/c/fd31f3996af2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time
  2022-07-15  5:22 ` [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time Jakub Kicinski
@ 2022-07-20  8:37   ` Eric Dumazet
  2022-07-20 16:59     ` Jakub Kicinski
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2022-07-20  8:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, Paolo Abeni, Boris Pismenny,
	John Fastabend, Maxim Mikityanskiy, Tariq Toukan, vfedorenko

On Fri, Jul 15, 2022 at 7:22 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> recvmsg() in TLS gets data from the skb list (rx_list) or fresh
> skbs we read from TCP via strparser. The former holds skbs which were
> already decrypted for peek or decrypted and partially consumed.
>
> tls_wait_data() only notices appearance of fresh skbs coming out
> of TCP (or psock). It is possible, if there is a concurrent call
> to peek() and recv() that the peek() will move the data from input
> to rx_list without recv() noticing. recv() will then read data out
> of order or never wake up.
>
> This is not a practical use case/concern, but it makes the self
> tests less reliable. This patch solves the problem by allowing
> only one reader in.
>
> Because having multiple processes calling read()/peek() is not
> normal avoid adding a lock and try to fast-path the single reader
> case.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/tls.h |  3 +++
>  net/tls/tls_sw.c  | 61 +++++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 57 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 8742e13bc362..e8935cfe0cd6 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -116,11 +116,14 @@ struct tls_sw_context_rx {
>         void (*saved_data_ready)(struct sock *sk);
>
>         struct sk_buff *recv_pkt;
> +       u8 reader_present;
>         u8 async_capable:1;
>         u8 zc_capable:1;
> +       u8 reader_contended:1;
>         atomic_t decrypt_pending;
>         /* protect crypto_wait with decrypt_pending*/
>         spinlock_t decrypt_compl_lock;
> +       struct wait_queue_head wq;
>  };
>
>  struct tls_record_info {
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 68d79ee48a56..761a63751616 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -1753,6 +1753,51 @@ tls_read_flush_backlog(struct sock *sk, struct tls_prot_info *prot,
>         sk_flush_backlog(sk);
>  }
>
> +static long tls_rx_reader_lock(struct sock *sk, struct tls_sw_context_rx *ctx,
> +                              bool nonblock)
> +{
> +       long timeo;
> +
> +       lock_sock(sk);
> +
> +       timeo = sock_rcvtimeo(sk, nonblock);
> +
> +       while (unlikely(ctx->reader_present)) {
> +               DEFINE_WAIT_FUNC(wait, woken_wake_function);
> +
> +               ctx->reader_contended = 1;
> +
> +               add_wait_queue(&ctx->wq, &wait);
> +               sk_wait_event(sk, &timeo,
> +                             !READ_ONCE(ctx->reader_present), &wait);
> +               remove_wait_queue(&ctx->wq, &wait);
> +
> +               if (!timeo)
> +                       return -EAGAIN;

We return with socket lock held, and callers seem to not release the lock.

> +               if (signal_pending(current))
> +                       return sock_intr_errno(timeo);

same here.

Let's wait for syzbot to catch up :)

> +       }
> +
> +       WRITE_ONCE(ctx->reader_present, 1);
> +
> +       return timeo;
> +}
> +
> +static void tls_rx_reader_unlock(struct sock *sk, struct tls_sw_context_rx *ctx)
> +{
> +       if (unlikely(ctx->reader_contended)) {
> +               if (wq_has_sleeper(&ctx->wq))
> +                       wake_up(&ctx->wq);
> +               else
> +                       ctx->reader_contended = 0;
> +
> +               WARN_ON_ONCE(!ctx->reader_present);
> +       }
> +
> +       WRITE_ONCE(ctx->reader_present, 0);
> +       release_sock(sk);
> +}
> +
>  int tls_sw_recvmsg(struct sock *sk,
>                    struct msghdr *msg,
>                    size_t len,
> @@ -1782,7 +1827,9 @@ int tls_sw_recvmsg(struct sock *sk,
>                 return sock_recv_errqueue(sk, msg, len, SOL_IP, IP_RECVERR);
>
>         psock = sk_psock_get(sk);
> -       lock_sock(sk);
> +       timeo = tls_rx_reader_lock(sk, ctx, flags & MSG_DONTWAIT);
> +       if (timeo < 0)
> +               return timeo;
>         bpf_strp_enabled = sk_psock_strp_enabled(psock);
>
>         /* If crypto failed the connection is broken */
> @@ -1801,7 +1848,6 @@ int tls_sw_recvmsg(struct sock *sk,
>
>         target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
>         len = len - copied;
> -       timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
>
>         zc_capable = !bpf_strp_enabled && !is_kvec && !is_peek &&
>                 ctx->zc_capable;
> @@ -1956,7 +2002,7 @@ int tls_sw_recvmsg(struct sock *sk,
>         copied += decrypted;
>
>  end:
> -       release_sock(sk);
> +       tls_rx_reader_unlock(sk, ctx);
>         if (psock)
>                 sk_psock_put(sk, psock);
>         return copied ? : err;
> @@ -1978,9 +2024,9 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
>         long timeo;
>         int chunk;
>
> -       lock_sock(sk);
> -
> -       timeo = sock_rcvtimeo(sk, flags & SPLICE_F_NONBLOCK);
> +       timeo = tls_rx_reader_lock(sk, ctx, flags & SPLICE_F_NONBLOCK);
> +       if (timeo < 0)
> +               return timeo;
>
>         from_queue = !skb_queue_empty(&ctx->rx_list);
>         if (from_queue) {
> @@ -2029,7 +2075,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
>         }
>
>  splice_read_end:
> -       release_sock(sk);
> +       tls_rx_reader_unlock(sk, ctx);
>         return copied ? : err;
>  }
>
> @@ -2371,6 +2417,7 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
>         } else {
>                 crypto_init_wait(&sw_ctx_rx->async_wait);
>                 spin_lock_init(&sw_ctx_rx->decrypt_compl_lock);
> +               init_waitqueue_head(&sw_ctx_rx->wq);
>                 crypto_info = &ctx->crypto_recv.info;
>                 cctx = &ctx->rx;
>                 skb_queue_head_init(&sw_ctx_rx->rx_list);
> --
> 2.36.1
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time
  2022-07-20  8:37   ` Eric Dumazet
@ 2022-07-20 16:59     ` Jakub Kicinski
  2022-07-20 17:09       ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2022-07-20 16:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Paolo Abeni, Boris Pismenny,
	John Fastabend, Maxim Mikityanskiy, Tariq Toukan, vfedorenko

On Wed, 20 Jul 2022 10:37:02 +0200 Eric Dumazet wrote:
> > +               if (!timeo)
> > +                       return -EAGAIN;  
> 
> We return with socket lock held, and callers seem to not release the lock.
>  
> > +               if (signal_pending(current))
> > +                       return sock_intr_errno(timeo);  
> 
> same here.

Thanks a lot for catching these.
 
> Let's wait for syzbot to catch up :)

I'll send the fixes later today. This is just a passing comment, right?
There isn't a report you know is coming? Otherwise I can wait to give
syzbot credit, too.

I have two additional questions while I have you :)

Is the timeo supposed to be for the entire operation? Right now TLS
seems to use a fresh timeo every time it goes to wait so the cumulative
wait can be much longer, as long as some data keeps coming in :/

Last one - I posted a bit of a disemboweling patch for TCP, LMK if it's
no bueno:

https://lore.kernel.org/all/20220719231129.1870776-6-kuba@kernel.org/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time
  2022-07-20 16:59     ` Jakub Kicinski
@ 2022-07-20 17:09       ` Eric Dumazet
  2022-07-20 17:19         ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2022-07-20 17:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, Paolo Abeni, Boris Pismenny,
	John Fastabend, Maxim Mikityanskiy, Tariq Toukan, vfedorenko

On Wed, Jul 20, 2022 at 6:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 20 Jul 2022 10:37:02 +0200 Eric Dumazet wrote:
> > > +               if (!timeo)
> > > +                       return -EAGAIN;
> >
> > We return with socket lock held, and callers seem to not release the lock.
> >
> > > +               if (signal_pending(current))
> > > +                       return sock_intr_errno(timeo);
> >
> > same here.
>
> Thanks a lot for catching these.
>
> > Let's wait for syzbot to catch up :)
>
> I'll send the fixes later today. This is just a passing comment, right?
> There isn't a report you know is coming? Otherwise I can wait to give
> syzbot credit, too.

I now have a full syzbot report, with a repro and bisection, I am
releasing it now.

>
> I have two additional questions while I have you :)
>
> Is the timeo supposed to be for the entire operation? Right now TLS
> seems to use a fresh timeo every time it goes to wait so the cumulative
> wait can be much longer, as long as some data keeps coming in :/

Good question. I am not sure how this timeout is used in applications, but
I would think it serves as a way to make sure a stall is detected.
So restarting the timeout every time there is progress would make sense.

Application needing a different behavior can still use regular timer,
independent of networking, ( alarm() being the most simple one)

>
> Last one - I posted a bit of a disemboweling patch for TCP, LMK if it's
> no bueno:
>
> https://lore.kernel.org/all/20220719231129.1870776-6-kuba@kernel.org/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time
  2022-07-20 17:09       ` Eric Dumazet
@ 2022-07-20 17:19         ` Eric Dumazet
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2022-07-20 17:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, netdev, Paolo Abeni, Boris Pismenny,
	John Fastabend, Maxim Mikityanskiy, Tariq Toukan, vfedorenko

On Wed, Jul 20, 2022 at 7:09 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Jul 20, 2022 at 6:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Wed, 20 Jul 2022 10:37:02 +0200 Eric Dumazet wrote:
> > > > +               if (!timeo)
> > > > +                       return -EAGAIN;
> > >
> > > We return with socket lock held, and callers seem to not release the lock.
> > >
> > > > +               if (signal_pending(current))
> > > > +                       return sock_intr_errno(timeo);
> > >
> > > same here.
> >
> > Thanks a lot for catching these.
> >
> > > Let's wait for syzbot to catch up :)
> >
> > I'll send the fixes later today. This is just a passing comment, right?
> > There isn't a report you know is coming? Otherwise I can wait to give
> > syzbot credit, too.
>
> I now have a full syzbot report, with a repro and bisection, I am
> releasing it now.

( [syzbot] WARNING: still has locks held in tls_rx_reader_lock )

>
> >
> > I have two additional questions while I have you :)
> >
> > Is the timeo supposed to be for the entire operation? Right now TLS
> > seems to use a fresh timeo every time it goes to wait so the cumulative
> > wait can be much longer, as long as some data keeps coming in :/
>
> Good question. I am not sure how this timeout is used in applications, but
> I would think it serves as a way to make sure a stall is detected.
> So restarting the timeout every time there is progress would make sense.
>
> Application needing a different behavior can still use regular timer,
> independent of networking, ( alarm() being the most simple one)
>
> >
> > Last one - I posted a bit of a disemboweling patch for TCP, LMK if it's
> > no bueno:
> >
> > https://lore.kernel.org/all/20220719231129.1870776-6-kuba@kernel.org/

Ah sorry, I have no comments, I guess we might try to factorize all
these similar functions at some point.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-07-20 17:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-15  5:22 [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 01/11] tls: rx: allow only one reader at a time Jakub Kicinski
2022-07-20  8:37   ` Eric Dumazet
2022-07-20 16:59     ` Jakub Kicinski
2022-07-20 17:09       ` Eric Dumazet
2022-07-20 17:19         ` Eric Dumazet
2022-07-15  5:22 ` [PATCH net-next v2 02/11] tls: rx: don't try to keep the skbs always on the list Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 03/11] tls: rx: don't keep decrypted skbs on ctx->recv_pkt Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 04/11] tls: rx: remove the message decrypted tracking Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 05/11] tls: rx: factor out device darg update Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 06/11] tls: rx: read the input skb from ctx->recv_pkt Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 07/11] tls: rx: return the decrypted skb via darg Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 08/11] tls: rx: async: adjust record geometry immediately Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 09/11] tls: rx: async: hold onto the input skb Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 10/11] tls: rx: async: don't put async zc on the list Jakub Kicinski
2022-07-15  5:22 ` [PATCH net-next v2 11/11] tls: rx: decrypt into a fresh skb Jakub Kicinski
2022-07-18 10:40 ` [PATCH net-next v2 00/11] tls: rx: avoid skb_cow_data() patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.