netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] net/tls: small TX offload optimizations
@ 2019-09-07  5:29 Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 1/4] net/tls: unref frags in order Jakub Kicinski
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-07  5:29 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel, Jakub Kicinski

Hi!

This set brings small TLS TX device optimizations. The biggest
gain comes from fixing a misuse of non temporal copy instructions.
On a synthetic workload modelled after customer's RFC application
I see 3-5% percent gain.

Jakub Kicinski (4):
  net/tls: unref frags in order
  net/tls: use RCU for the adder to the offload record list
  net/tls: remove the record tail optimization
  net/tls: align non temporal copy to cache lines

 net/tls/tls_device.c | 121 ++++++++++++++++++++++++++++++-------------
 1 file changed, 84 insertions(+), 37 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next 1/4] net/tls: unref frags in order
  2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
@ 2019-09-07  5:29 ` Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 2/4] net/tls: use RCU for the adder to the offload record list Jakub Kicinski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-07  5:29 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel, Jakub Kicinski, Dirk van der Merwe

It's generally more cache friendly to walk arrays in order,
especially those which are likely not in cache.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 net/tls/tls_device.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 41c106e45f01..285c9f9e94e4 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -122,13 +122,10 @@ static struct net_device *get_netdev_for_sock(struct sock *sk)
 
 static void destroy_record(struct tls_record_info *record)
 {
-	int nr_frags = record->num_frags;
-	skb_frag_t *frag;
+	int i;
 
-	while (nr_frags-- > 0) {
-		frag = &record->frags[nr_frags];
-		__skb_frag_unref(frag);
-	}
+	for (i = 0; i < record->num_frags; i++)
+		__skb_frag_unref(&record->frags[i]);
 	kfree(record);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next 2/4] net/tls: use RCU for the adder to the offload record list
  2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 1/4] net/tls: unref frags in order Jakub Kicinski
@ 2019-09-07  5:29 ` Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 3/4] net/tls: remove the record tail optimization Jakub Kicinski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-07  5:29 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel, Jakub Kicinski, Dirk van der Merwe

All modifications to TLS record list happen under the socket
lock. Since records form an ordered queue readers are only
concerned about elements being removed, additions can happen
concurrently.

Use RCU primitives to ensure the correct access types
(READ_ONCE/WRITE_ONCE).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 net/tls/tls_device.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 285c9f9e94e4..b11355e00514 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -280,9 +280,7 @@ static int tls_push_record(struct sock *sk,
 
 	tls_append_frag(record, &dummy_tag_frag, prot->tag_size);
 	record->end_seq = tp->write_seq + record->len;
-	spin_lock_irq(&offload_ctx->lock);
-	list_add_tail(&record->list, &offload_ctx->records_list);
-	spin_unlock_irq(&offload_ctx->lock);
+	list_add_tail_rcu(&record->list, &offload_ctx->records_list);
 	offload_ctx->open_record = NULL;
 
 	if (test_bit(TLS_TX_SYNC_SCHED, &ctx->flags))
@@ -535,12 +533,16 @@ struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
 		/* if retransmit_hint is irrelevant start
 		 * from the beggining of the list
 		 */
-		info = list_first_entry(&context->records_list,
-					struct tls_record_info, list);
+		info = list_first_entry_or_null(&context->records_list,
+						struct tls_record_info, list);
+		if (!info)
+			return NULL;
 		record_sn = context->unacked_record_sn;
 	}
 
-	list_for_each_entry_from(info, &context->records_list, list) {
+	/* We just need the _rcu for the READ_ONCE() */
+	rcu_read_lock();
+	list_for_each_entry_from_rcu(info, &context->records_list, list) {
 		if (before(seq, info->end_seq)) {
 			if (!context->retransmit_hint ||
 			    after(info->end_seq,
@@ -549,12 +551,15 @@ struct tls_record_info *tls_get_record(struct tls_offload_context_tx *context,
 				context->retransmit_hint = info;
 			}
 			*p_record_sn = record_sn;
-			return info;
+			goto exit_rcu_unlock;
 		}
 		record_sn++;
 	}
+	info = NULL;
 
-	return NULL;
+exit_rcu_unlock:
+	rcu_read_unlock();
+	return info;
 }
 EXPORT_SYMBOL(tls_get_record);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next 3/4] net/tls: remove the record tail optimization
  2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 1/4] net/tls: unref frags in order Jakub Kicinski
  2019-09-07  5:29 ` [PATCH net-next 2/4] net/tls: use RCU for the adder to the offload record list Jakub Kicinski
@ 2019-09-07  5:29 ` Jakub Kicinski
  2019-09-07  5:30 ` [PATCH net-next 4/4] net/tls: align non temporal copy to cache lines Jakub Kicinski
  2019-09-07 16:11 ` [PATCH net-next 0/4] net/tls: small TX offload optimizations David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-07  5:29 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel, Jakub Kicinski, Dirk van der Merwe

For TLS device offload the tag/message authentication code are
filled in by the device. The kernel merely reserves space for
them. Because device overwrites it, the contents of the tag make
do no matter. Current code tries to save space by reusing the
header as the tag. This, however, leads to an additional frag
being created and defeats buffer coalescing (which trickles
all the way down to the drivers).

Remove this optimization, and try to allocate the space for
the tag in the usual way, leave the memory uninitialized.
If memory allocation fails rewind the record pointer so that
we use the already copied user data as tag.

Note that the optimization was actually buggy, as the tag
for TLS 1.2 is 16 bytes, but header is just 13, so the reuse
may had looked past the end of the page..

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 net/tls/tls_device.c | 67 +++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 20 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index b11355e00514..916c3c0a99f0 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -256,29 +256,13 @@ static int tls_push_record(struct sock *sk,
 			   struct tls_context *ctx,
 			   struct tls_offload_context_tx *offload_ctx,
 			   struct tls_record_info *record,
-			   struct page_frag *pfrag,
-			   int flags,
-			   unsigned char record_type)
+			   int flags)
 {
 	struct tls_prot_info *prot = &ctx->prot_info;
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct page_frag dummy_tag_frag;
 	skb_frag_t *frag;
 	int i;
 
-	/* fill prepend */
-	frag = &record->frags[0];
-	tls_fill_prepend(ctx,
-			 skb_frag_address(frag),
-			 record->len - prot->prepend_size,
-			 record_type,
-			 prot->version);
-
-	/* HW doesn't care about the data in the tag, because it fills it. */
-	dummy_tag_frag.page = skb_frag_page(frag);
-	dummy_tag_frag.offset = 0;
-
-	tls_append_frag(record, &dummy_tag_frag, prot->tag_size);
 	record->end_seq = tp->write_seq + record->len;
 	list_add_tail_rcu(&record->list, &offload_ctx->records_list);
 	offload_ctx->open_record = NULL;
@@ -302,6 +286,38 @@ static int tls_push_record(struct sock *sk,
 	return tls_push_sg(sk, ctx, offload_ctx->sg_tx_data, 0, flags);
 }
 
+static int tls_device_record_close(struct sock *sk,
+				   struct tls_context *ctx,
+				   struct tls_record_info *record,
+				   struct page_frag *pfrag,
+				   unsigned char record_type)
+{
+	struct tls_prot_info *prot = &ctx->prot_info;
+	int ret;
+
+	/* append tag
+	 * device will fill in the tag, we just need to append a placeholder
+	 * use socket memory to improve coalescing (re-using a single buffer
+	 * increases frag count)
+	 * if we can't allocate memory now, steal some back from data
+	 */
+	if (likely(skb_page_frag_refill(prot->tag_size, pfrag,
+					sk->sk_allocation))) {
+		ret = 0;
+		tls_append_frag(record, pfrag, prot->tag_size);
+	} else {
+		ret = prot->tag_size;
+		if (record->len <= prot->overhead_size)
+			return -ENOMEM;
+	}
+
+	/* fill prepend */
+	tls_fill_prepend(ctx, skb_frag_address(&record->frags[0]),
+			 record->len - prot->overhead_size,
+			 record_type, prot->version);
+	return ret;
+}
+
 static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
 				 struct page_frag *pfrag,
 				 size_t prepend_size)
@@ -452,13 +468,24 @@ static int tls_push_data(struct sock *sk,
 
 		if (done || record->len >= max_open_record_len ||
 		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
+			rc = tls_device_record_close(sk, tls_ctx, record,
+						     pfrag, record_type);
+			if (rc) {
+				if (rc > 0) {
+					size += rc;
+				} else {
+					size = orig_size;
+					destroy_record(record);
+					ctx->open_record = NULL;
+					break;
+				}
+			}
+
 			rc = tls_push_record(sk,
 					     tls_ctx,
 					     ctx,
 					     record,
-					     pfrag,
-					     tls_push_record_flags,
-					     record_type);
+					     tls_push_record_flags);
 			if (rc < 0)
 				break;
 		}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next 4/4] net/tls: align non temporal copy to cache lines
  2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
                   ` (2 preceding siblings ...)
  2019-09-07  5:29 ` [PATCH net-next 3/4] net/tls: remove the record tail optimization Jakub Kicinski
@ 2019-09-07  5:30 ` Jakub Kicinski
  2019-09-07 16:11 ` [PATCH net-next 0/4] net/tls: small TX offload optimizations David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2019-09-07  5:30 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel, Jakub Kicinski, Dirk van der Merwe

Unlike normal TCP code TLS has to touch the cache lines
it copies into to fill header info. On memory-heavy workloads
having non temporal stores and normal accesses targeting
the same cache line leads to significant overhead.

Measured 3% overhead running 3600 round robin connections
with additional memory heavy workload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
---
 net/tls/tls_device.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 916c3c0a99f0..f959487c5cd1 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -372,6 +372,31 @@ static int tls_do_allocation(struct sock *sk,
 	return 0;
 }
 
+static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
+{
+	size_t pre_copy, nocache;
+
+	pre_copy = ~((unsigned long)addr - 1) & (SMP_CACHE_BYTES - 1);
+	if (pre_copy) {
+		pre_copy = min(pre_copy, bytes);
+		if (copy_from_iter(addr, pre_copy, i) != pre_copy)
+			return -EFAULT;
+		bytes -= pre_copy;
+		addr += pre_copy;
+	}
+
+	nocache = round_down(bytes, SMP_CACHE_BYTES);
+	if (copy_from_iter_nocache(addr, nocache, i) != nocache)
+		return -EFAULT;
+	bytes -= nocache;
+	addr += nocache;
+
+	if (bytes && copy_from_iter(addr, bytes, i) != bytes)
+		return -EFAULT;
+
+	return 0;
+}
+
 static int tls_push_data(struct sock *sk,
 			 struct iov_iter *msg_iter,
 			 size_t size, int flags,
@@ -445,12 +470,10 @@ static int tls_push_data(struct sock *sk,
 		copy = min_t(size_t, size, (pfrag->size - pfrag->offset));
 		copy = min_t(size_t, copy, (max_open_record_len - record->len));
 
-		if (copy_from_iter_nocache(page_address(pfrag->page) +
-					       pfrag->offset,
-					   copy, msg_iter) != copy) {
-			rc = -EFAULT;
+		rc = tls_device_copy_data(page_address(pfrag->page) +
+					  pfrag->offset, copy, msg_iter);
+		if (rc)
 			goto handle_error;
-		}
 		tls_append_frag(record, pfrag, copy);
 
 		size -= copy;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 0/4] net/tls: small TX offload optimizations
  2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
                   ` (3 preceding siblings ...)
  2019-09-07  5:30 ` [PATCH net-next 4/4] net/tls: align non temporal copy to cache lines Jakub Kicinski
@ 2019-09-07 16:11 ` David Miller
  4 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2019-09-07 16:11 UTC (permalink / raw)
  To: jakub.kicinski
  Cc: netdev, oss-drivers, davejwatson, borisp, aviadye,
	john.fastabend, daniel

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Fri,  6 Sep 2019 22:29:56 -0700

> Hi!
> 
> This set brings small TLS TX device optimizations. The biggest
> gain comes from fixing a misuse of non temporal copy instructions.
> On a synthetic workload modelled after customer's RFC application
> I see 3-5% percent gain.

Series applied.

But if history is any indication I'd watch for how much this actually
helps or hurts universally.  We once tried to use non-temporal stores
for sendmsg/recvmsg copies and had to turn that off because it only
helped in certain situations on certain cpus and hurt in others.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-07 16:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-07  5:29 [PATCH net-next 0/4] net/tls: small TX offload optimizations Jakub Kicinski
2019-09-07  5:29 ` [PATCH net-next 1/4] net/tls: unref frags in order Jakub Kicinski
2019-09-07  5:29 ` [PATCH net-next 2/4] net/tls: use RCU for the adder to the offload record list Jakub Kicinski
2019-09-07  5:29 ` [PATCH net-next 3/4] net/tls: remove the record tail optimization Jakub Kicinski
2019-09-07  5:30 ` [PATCH net-next 4/4] net/tls: align non temporal copy to cache lines Jakub Kicinski
2019-09-07 16:11 ` [PATCH net-next 0/4] net/tls: small TX offload optimizations David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).