netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
@ 2022-05-11  3:54 Joe Damato
  2022-05-11  3:54 ` [RFC,net-next,x86 1/6] arch, x86, uaccess: Add nontemporal copy functions Joe Damato
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86; +Cc: Joe Damato

Greetings:

The purpose of this RFC is to gauge initial thoughts/reactions to adding a
path in af_unix for nontemporal copies in the write path. The network stack
supports something similar, but it is enabled for the entire NIC via the
NETIF_F_NOCACHE_COPY bit and cannot (AFAICT) be controlled or adjusted per
socket or per-write and does not affect unix sockets.

This work seeks to build on the existing nontemporal (NT) copy work in the
kernel by adding support in the unix socket write path via a new sendmsg
flag: MSG_NTCOPY. This could also be accomplished via a setsockopt flag,
as well, but this initial implementation adds MSG_NTCOPY for ease of use
and to save an extra system call or two.

In the future, MSG_NTCOPY could be supported by other protocols, and
perhaps used in place of NETIF_F_NOCACHE_COPY to allow user programs to
enable this functionality on a per-write (or per-socket) basis.

If supporting NT copies in the unix write path is acceptable in principle,
I am open to making whatever modifications are requested or needed to get
this RFC closer to a v1. I am sure there will be many; this is just a PoC
in its current form.

As you'll see below, NT copies in the unix write path have a large
measureable impact on certain application architectures and CPUs.

Initial benchmarks are extremely encouraging. I wrote a simple C program to
benchmark this patchset, the program:
  - Creates a unix socket pair
  - Forks a child process
  - The parent process writes to the unix socket using MSG_NTCOPY - or not -
    depending on the command line flags
  - The child process uses splice to move the data from the unix socket to
    a pipe buffer, followed by a second splice call to move the data from
    the pipe buffer to a file descriptor opened on /dev/null.
  - taskset is used when launching the benchmark to ensure the parent and
    child run on appropriate CPUs for various scenarios

The source of the test program is available for examination [1] and results
for three benchmarks I ran are provided below.

Test system: AMD EPYC 7662 64-Core Processor,
	     64 cores / 128 threads,
	     512kb L2 per core shared by sibling CPUs,
	     16mb L3 per NUMA zone,
	     AMD specific settings: NPS=1 and L3 as NUMA enabled 

Test: 1048576 byte object,
      100,000 iterations,
      512kb pipe buffer size,
      512kb unix socket send buffer size

Sample command lines for running the tests provided below. Note that the
command line shows how to run a "normal" copy benchmark. To run the
benchmark in MSG_NTCOPY mode, change command line argument 3 from 0 to 1.

Test pinned to CPUs 1 and 2 which do *not* share an L2 cache, but do share
an L3.

Command line for "normal" copy:
% time taskset -ac 1,2 ./unix-nt-bench 1048576 100000 0 524288 524288

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		10.630				78,928
MSG_NTCOPY		7.429				112,935 

Same test as above, but pinned to CPUs 1 and 65 which share an L2 (512kb) and L3
cache (16mb).

Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 524288 524288

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		12.532				66,941
MSG_NTCOPY		9.445				88,826	

Same test as above, pinned to CPUs 1 and 65, but with 128kb unix send
buffer and pipe buffer sizes (to avoid spilling L2).

Command line for "normal" copy:
% time taskset -ac 1,65 ./unix-nt-bench 1048576 100000 0 131072 131072

Mode			real time (sec.)		throughput (Mb/s)
"Normal" copy		12.451				67,377
MSG_NTCOPY		9.451				88,768

Thanks,
Joe

[1]: https://gist.githubusercontent.com/jdamato-fsly/03a2f0cd4e71ebe0fef97f7f2980d9e5/raw/19cfd3aca59109ebf5b03871d952ea1360f3e982/unix-nt-copy-bench.c

Joe Damato (6):
  arch, x86, uaccess: Add nontemporal copy functions
  iov_iter: Allow custom copyin function
  iov_iter: Add a nocache copy iov iterator
  net: Add a struct for managing copy functions
  net: Add a way to copy skbs without affect cache
  net: unix: Add MSG_NTCOPY

 arch/x86/include/asm/uaccess_64.h |  6 ++++
 include/linux/skbuff.h            |  2 ++
 include/linux/socket.h            |  1 +
 include/linux/uaccess.h           |  6 ++++
 include/linux/uio.h               |  2 ++
 lib/iov_iter.c                    | 34 ++++++++++++++++++----
 net/core/datagram.c               | 61 ++++++++++++++++++++++++++++-----------
 net/unix/af_unix.c                | 13 +++++++--
 8 files changed, 100 insertions(+), 25 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC,net-next,x86 1/6] arch, x86, uaccess: Add nontemporal copy functions
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 2/6] iov_iter: Allow custom copyin function Joe Damato
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin
  Cc: Joe Damato

Add a generic non-temporal wrapper to uaccess which can be overridden by
arches that support non-temporal copies.

An implementation is added for x86 which wraps an existing non-temporal
copy in the kernel.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 arch/x86/include/asm/uaccess_64.h | 6 ++++++
 include/linux/uaccess.h           | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 45697e0..ed41dba 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -65,6 +65,12 @@ extern long __copy_user_flushcache(void *dst, const void __user *src, unsigned s
 extern void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 			   size_t len);
 
+static inline unsigned long
+__copy_from_user_nocache(void *dst, const void __user *src, unsigned long size)
+{
+	return (unsigned long)__copy_user_nocache(dst, src, (unsigned int) size, 0);
+}
+
 static inline int
 __copy_from_user_inatomic_nocache(void *dst, const void __user *src,
 				  unsigned size)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 5461794..d1f57a1 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -234,6 +234,12 @@ static inline bool pagefault_disabled(void)
 #ifndef ARCH_HAS_NOCACHE_UACCESS
 
 static inline __must_check unsigned long
+__copy_from_user_nocache(void *to, const void __user *from, unsigned long n)
+{
+	return __copy_from_user(to, from, n);
+}
+
+static inline __must_check unsigned long
 __copy_from_user_inatomic_nocache(void *to, const void __user *from,
 				  unsigned long n)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC,net-next 2/6] iov_iter: Allow custom copyin function
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
  2022-05-11  3:54 ` [RFC,net-next,x86 1/6] arch, x86, uaccess: Add nontemporal copy functions Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 3/6] iov_iter: Add a nocache copy iov iterator Joe Damato
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Alexander Viro; +Cc: Joe Damato

When calling copy_page_from_iter_iovec, allow callers to specify the copy
function they'd like to use.

The only caller is updated to pass raw_copy_from_user.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 lib/iov_iter.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330..ef22ec1 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -253,7 +253,9 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
 }
 
 static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
+			 struct iov_iter *i,
+			 unsigned long (*_copyin)(void *to, const void __user *from,
+						 unsigned long n))
 {
 	size_t skip, copy, left, wanted;
 	const struct iovec *iov;
@@ -278,7 +280,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 		to = kaddr + offset;
 
 		/* first chunk, usually the only one */
-		left = copyin(to, buf, copy);
+		left = _copyin(to, buf, copy);
 		copy -= left;
 		skip += copy;
 		to += copy;
@@ -288,7 +290,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 			iov++;
 			buf = iov->iov_base;
 			copy = min(bytes, iov->iov_len);
-			left = copyin(to, buf, copy);
+			left = _copyin(to, buf, copy);
 			copy -= left;
 			skip = copy;
 			to += copy;
@@ -307,7 +309,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 
 	kaddr = kmap(page);
 	to = kaddr + offset;
-	left = copyin(to, buf, copy);
+	left = _copyin(to, buf, copy);
 	copy -= left;
 	skip += copy;
 	to += copy;
@@ -316,7 +318,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
 		iov++;
 		buf = iov->iov_base;
 		copy = min(bytes, iov->iov_len);
-		left = copyin(to, buf, copy);
+		left = _copyin(to, buf, copy);
 		copy -= left;
 		skip = copy;
 		to += copy;
@@ -899,7 +901,7 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 	if (unlikely(!page_copy_sane(page, offset, bytes)))
 		return 0;
 	if (likely(iter_is_iovec(i)))
-		return copy_page_from_iter_iovec(page, offset, bytes, i);
+		return copy_page_from_iter_iovec(page, offset, bytes, i, raw_copy_from_user);
 	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
 		void *kaddr = kmap_local_page(page);
 		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC,net-next 3/6] iov_iter: Add a nocache copy iov iterator
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
  2022-05-11  3:54 ` [RFC,net-next,x86 1/6] arch, x86, uaccess: Add nontemporal copy functions Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 2/6] iov_iter: Allow custom copyin function Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 4/6] net: Add a struct for managing copy functions Joe Damato
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Alexander Viro; +Cc: Joe Damato

Add copy_page_from_iter_nocache, which wraps copy_page_from_iter_iovec and
passes in a custom copyin function: __copy_from_user_nocache.

This allows callers of copy_page_from_iter_nocache to copy data without
disturbing the CPU cache.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285f..58c7946 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -142,6 +142,8 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
+size_t copy_page_from_iter_nocache(struct page *page, size_t offset, size_t bytes,
+			 struct iov_iter *i);
 
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ef22ec1..985bf58 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -895,6 +895,26 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
+size_t copy_page_from_iter_nocache(struct page *page, size_t offset, size_t
+		bytes, struct iov_iter *i)
+{
+	if (unlikely(!page_copy_sane(page, offset, bytes)))
+		return 0;
+	if (unlikely(iov_iter_is_pipe(i) || iov_iter_is_discard(i))) {
+		WARN_ON(1);
+		return 0;
+	}
+	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
+		void *kaddr = kmap_atomic(page);
+		size_t wanted = _copy_from_iter_nocache(kaddr + offset, bytes, i);
+
+		kunmap_atomic(kaddr);
+		return wanted;
+	} else
+		return copy_page_from_iter_iovec(page, offset, bytes, i,
+				__copy_from_user_nocache);
+}
+
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC,net-next 4/6] net: Add a struct for managing copy functions
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
                   ` (2 preceding siblings ...)
  2022-05-11  3:54 ` [RFC,net-next 3/6] iov_iter: Add a nocache copy iov iterator Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 5/6] net: Add a way to copy skbs without affect cache Joe Damato
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Eric Dumazet, Paolo Abeni
  Cc: Joe Damato

Add struct skb_copier which encapsulates two functions for copying data,
and provide a default copier, skb_copier.

Separate skb_copy_datagram_from_iter into a a helper function,
do_skb_copy_datagram, which takes a struct skb_copier.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 net/core/datagram.c | 49 ++++++++++++++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 17 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 50f4fae..a87c41b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -532,18 +532,19 @@ int skb_copy_datagram_iter(const struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_copy_datagram_iter);
 
-/**
- *	skb_copy_datagram_from_iter - Copy a datagram from an iov_iter.
- *	@skb: buffer to copy
- *	@offset: offset in the buffer to start copying to
- *	@from: the copy source
- *	@len: amount of data to copy to buffer from iovec
- *
- *	Returns 0 or -EFAULT.
- */
-int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
-				 struct iov_iter *from,
-				 int len)
+struct skb_copier {
+	size_t (*copy_from_iter)(void *addr, size_t bytes, struct iov_iter *i);
+	size_t (*copy_page_from_iter)(struct page *page, size_t offset, size_t bytes,
+				      struct iov_iter *i);
+};
+
+struct skb_copier skb_copier = {
+	.copy_from_iter = copy_from_iter,
+	.copy_page_from_iter = copy_page_from_iter
+};
+
+static int do_skb_copy_datagram(struct sk_buff *skb, int offset,
+				struct iov_iter *from, int len, struct skb_copier copier)
 {
 	int start = skb_headlen(skb);
 	int i, copy = start - offset;
@@ -553,7 +554,7 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 	if (copy > 0) {
 		if (copy > len)
 			copy = len;
-		if (copy_from_iter(skb->data + offset, copy, from) != copy)
+		if (copier.copy_from_iter(skb->data + offset, copy, from) != copy)
 			goto fault;
 		if ((len -= copy) == 0)
 			return 0;
@@ -573,7 +574,7 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 
 			if (copy > len)
 				copy = len;
-			copied = copy_page_from_iter(skb_frag_page(frag),
+			copied = copier.copy_page_from_iter(skb_frag_page(frag),
 					  skb_frag_off(frag) + offset - start,
 					  copy, from);
 			if (copied != copy)
@@ -595,9 +596,7 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 		if ((copy = end - offset) > 0) {
 			if (copy > len)
 				copy = len;
-			if (skb_copy_datagram_from_iter(frag_iter,
-							offset - start,
-							from, copy))
+			if (do_skb_copy_datagram(frag_iter, offset - start, from, copy, copier))
 				goto fault;
 			if ((len -= copy) == 0)
 				return 0;
@@ -611,6 +610,22 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 fault:
 	return -EFAULT;
 }
+
+/**
+ *	skb_copy_datagram_from_iter - Copy a datagram from an iov_iter.
+ *	@skb: buffer to copy
+ *	@offset: offset in the buffer to start copying to
+ *	@from: the copy source
+ *	@len: amount of data to copy to buffer from iovec
+ *
+ *	Returns 0 or -EFAULT.
+ */
+int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
+				struct iov_iter *from,
+				 int len)
+{
+	return do_skb_copy_datagram(skb, offset, from, len, skb_copier);
+}
 EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 
 int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC,net-next 5/6] net: Add a way to copy skbs without affect cache
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
                   ` (3 preceding siblings ...)
  2022-05-11  3:54 ` [RFC,net-next 4/6] net: Add a struct for managing copy functions Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11  3:54 ` [RFC,net-next 6/6] net: unix: Add MSG_NTCOPY Joe Damato
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Eric Dumazet, Paolo Abeni
  Cc: Joe Damato

Add an skb_copier, skb_nocache_copier, which contains function pointers to
nontemporal copy routines.

Using skb_nocache_copier and do_skb_copy_datagram implement
skb_copy_datagram_from_iter_nocache. This function is intended to be used
by callers which would like to copy data into SKBs using nontemporal
instructions to avoid the CPU cache.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/datagram.c    | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 97de40b..32c0cba 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3918,6 +3918,8 @@ int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset,
 			   struct ahash_request *hash);
 int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 				 struct iov_iter *from, int len);
+int skb_copy_datagram_from_iter_nocache(struct sk_buff *skb, int offset,
+				 struct iov_iter *from, int len);
 int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
 void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
 void __skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb, int len);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index a87c41b..da8557b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -543,6 +543,11 @@ struct skb_copier skb_copier = {
 	.copy_page_from_iter = copy_page_from_iter
 };
 
+struct skb_copier skb_nocache_copier = {
+	.copy_from_iter = copy_from_iter_nocache,
+	.copy_page_from_iter = copy_page_from_iter_nocache
+};
+
 static int do_skb_copy_datagram(struct sk_buff *skb, int offset,
 				struct iov_iter *from, int len, struct skb_copier copier)
 {
@@ -611,6 +616,13 @@ static int do_skb_copy_datagram(struct sk_buff *skb, int offset,
 	return -EFAULT;
 }
 
+int skb_copy_datagram_from_iter_nocache(struct sk_buff *skb, int offset,
+					struct iov_iter *from, int len)
+{
+	return do_skb_copy_datagram(skb, offset, from, len, skb_nocache_copier);
+}
+EXPORT_SYMBOL(skb_copy_datagram_from_iter_nocache);
+
 /**
  *	skb_copy_datagram_from_iter - Copy a datagram from an iov_iter.
  *	@skb: buffer to copy
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC,net-next 6/6] net: unix: Add MSG_NTCOPY
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
                   ` (4 preceding siblings ...)
  2022-05-11  3:54 ` [RFC,net-next 5/6] net: Add a way to copy skbs without affect cache Joe Damato
@ 2022-05-11  3:54 ` Joe Damato
  2022-05-11 23:25 ` [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Jakub Kicinski
  2022-05-31  6:04 ` Christoph Hellwig
  7 siblings, 0 replies; 13+ messages in thread
From: Joe Damato @ 2022-05-11  3:54 UTC (permalink / raw)
  To: netdev, davem, kuba, linux-kernel, x86, Eric Dumazet, Paolo Abeni
  Cc: Joe Damato

Add a new sendmsg flag, MSG_NTCOPY, which user programs can use to signal
to the kernel that data copied into the kernel during sendmsg should be
done so using nontemporal copies, if it is supported by the architecture.

Signed-off-by: Joe Damato <jdamato@fastly.com>
---
 include/linux/socket.h |  1 +
 net/unix/af_unix.c     | 13 +++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 12085c9..c9b10aa 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -318,6 +318,7 @@ struct ucred {
 					  * plain text and require encryption
 					  */
 
+#define MSG_NTCOPY	0x2000000	/* Use a non-temporal copy */
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e1dd9e9..ccbd643 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1907,7 +1907,11 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	skb_put(skb, len - data_len);
 	skb->data_len = data_len;
 	skb->len = len;
-	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, len);
+	if (msg->msg_flags & MSG_NTCOPY)
+		err = skb_copy_datagram_from_iter_nocache(skb, 0, &msg->msg_iter, len);
+	else
+		err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, len);
+
 	if (err)
 		goto out_free;
 
@@ -2167,7 +2171,12 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		skb_put(skb, size - data_len);
 		skb->data_len = data_len;
 		skb->len = size;
-		err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+
+		if (msg->msg_flags & MSG_NTCOPY)
+			err = skb_copy_datagram_from_iter_nocache(skb, 0, &msg->msg_iter, size);
+		else
+			err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+
 		if (err) {
 			kfree_skb(skb);
 			goto out_err;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
                   ` (5 preceding siblings ...)
  2022-05-11  3:54 ` [RFC,net-next 6/6] net: unix: Add MSG_NTCOPY Joe Damato
@ 2022-05-11 23:25 ` Jakub Kicinski
  2022-05-12  1:01   ` Joe Damato
  2022-05-31  6:04 ` Christoph Hellwig
  7 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-05-11 23:25 UTC (permalink / raw)
  To: Joe Damato; +Cc: netdev, davem, linux-kernel, x86

On Tue, 10 May 2022 20:54:21 -0700 Joe Damato wrote:
> Initial benchmarks are extremely encouraging. I wrote a simple C program to
> benchmark this patchset, the program:
>   - Creates a unix socket pair
>   - Forks a child process
>   - The parent process writes to the unix socket using MSG_NTCOPY - or not -
>     depending on the command line flags
>   - The child process uses splice to move the data from the unix socket to
>     a pipe buffer, followed by a second splice call to move the data from
>     the pipe buffer to a file descriptor opened on /dev/null.
>   - taskset is used when launching the benchmark to ensure the parent and
>     child run on appropriate CPUs for various scenarios

Is there a practical use case?

The patches look like a lot of extra indirect calls.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-11 23:25 ` [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Jakub Kicinski
@ 2022-05-12  1:01   ` Joe Damato
  2022-05-12 19:46     ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Damato @ 2022-05-12  1:01 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, linux-kernel, x86

On Wed, May 11, 2022 at 04:25:20PM -0700, Jakub Kicinski wrote:
> On Tue, 10 May 2022 20:54:21 -0700 Joe Damato wrote:
> > Initial benchmarks are extremely encouraging. I wrote a simple C program to
> > benchmark this patchset, the program:
> >   - Creates a unix socket pair
> >   - Forks a child process
> >   - The parent process writes to the unix socket using MSG_NTCOPY - or not -
> >     depending on the command line flags
> >   - The child process uses splice to move the data from the unix socket to
> >     a pipe buffer, followed by a second splice call to move the data from
> >     the pipe buffer to a file descriptor opened on /dev/null.
> >   - taskset is used when launching the benchmark to ensure the parent and
> >     child run on appropriate CPUs for various scenarios
> 
> Is there a practical use case?

Yes; for us there seems to be - especially with AMD Zen2. I'll try to
describe such a setup and my synthetic HTTP benchmark results.

Imagine a program, call it storageD, which is responsible for storing and
retrieving data from a data store. Other programs can request data from
storageD via communicating with it on a Unix socket.

One such program that could request data via the Unix socket is an HTTP
daemon. For some client connections that the HTTP daemon receives, the
daemon may determine that responses can be sent in plain text.

In this case, the HTTP daemon can use splice to move data from the unix
socket connection with storageD directly to the client TCP socket via a
pipe. splice saves CPU cycles and avoids incurring any memory access
latency since the data itself is not accessed.

Because we'll use splice (instead of accessing the data and potentially
affecting the CPU cache) it is advantageous for storageD to use NT copies
when it writes to the Unix socket to avoid evicting hot data from the CPU
cache. After all, once the data is copied into the kernel on the unix
socket write path, it won't be touched again; only spliced.

In my synthetic HTTP benchmarks for this setup, we've been able to increase
network throughput of the the HTTP daemon by roughly 30% while reducing
the system time of storageD. We're still collecting data on production
workloads.

The motivation, IMHO, is very similar to the motivation for
NETIF_F_NOCACHE_COPY, as far I understand.

In some cases, when an application writes to a network socket the data
written to the socket won't be accessed again once it is copied into the
kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and
helps to preserve the CPU cache and avoid evicting hot data.

We get a sizable benefit from this option, too, in situations where we
can't use splice and have to call write to transmit data to client
connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but
when writing to Unix sockets as well.

Let me know if that makes it more clear.

> The patches look like a lot of extra indirect calls.

Yup. As I mentioned in the cover letter this was mostly a PoC that seems to
work and increases network throughput in a real world scenario.

If this general line of thinking (NT copies on write to a Unix socket) is
acceptable, I'm happy to refactor the code however you (and others) would
like to get it to an acceptable state.

Thanks for taking a look,
Joe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-12  1:01   ` Joe Damato
@ 2022-05-12 19:46     ` Jakub Kicinski
  2022-05-12 22:53       ` Joe Damato
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-05-12 19:46 UTC (permalink / raw)
  To: Joe Damato; +Cc: netdev, davem, linux-kernel, x86

On Wed, 11 May 2022 18:01:54 -0700 Joe Damato wrote:
> > Is there a practical use case?  
> 
> Yes; for us there seems to be - especially with AMD Zen2. I'll try to
> describe such a setup and my synthetic HTTP benchmark results.
> 
> Imagine a program, call it storageD, which is responsible for storing and
> retrieving data from a data store. Other programs can request data from
> storageD via communicating with it on a Unix socket.
> 
> One such program that could request data via the Unix socket is an HTTP
> daemon. For some client connections that the HTTP daemon receives, the
> daemon may determine that responses can be sent in plain text.
> 
> In this case, the HTTP daemon can use splice to move data from the unix
> socket connection with storageD directly to the client TCP socket via a
> pipe. splice saves CPU cycles and avoids incurring any memory access
> latency since the data itself is not accessed.
> 
> Because we'll use splice (instead of accessing the data and potentially
> affecting the CPU cache) it is advantageous for storageD to use NT copies
> when it writes to the Unix socket to avoid evicting hot data from the CPU
> cache. After all, once the data is copied into the kernel on the unix
> socket write path, it won't be touched again; only spliced.
> 
> In my synthetic HTTP benchmarks for this setup, we've been able to increase
> network throughput of the the HTTP daemon by roughly 30% while reducing
> the system time of storageD. We're still collecting data on production
> workloads.
> 
> The motivation, IMHO, is very similar to the motivation for
> NETIF_F_NOCACHE_COPY, as far I understand.
> 
> In some cases, when an application writes to a network socket the data
> written to the socket won't be accessed again once it is copied into the
> kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and
> helps to preserve the CPU cache and avoid evicting hot data.
> 
> We get a sizable benefit from this option, too, in situations where we
> can't use splice and have to call write to transmit data to client
> connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but
> when writing to Unix sockets as well.
> 
> Let me know if that makes it more clear.

Makes sense, thanks for the explainer.

> > The patches look like a lot of extra indirect calls.  
> 
> Yup. As I mentioned in the cover letter this was mostly a PoC that seems to
> work and increases network throughput in a real world scenario.
> 
> If this general line of thinking (NT copies on write to a Unix socket) is
> acceptable, I'm happy to refactor the code however you (and others) would
> like to get it to an acceptable state.

My only concern is that in post-spectre world the indirect calls are
going to be more expensive than an branch would be. But I'm not really
a mirco-optimization expert :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-12 19:46     ` Jakub Kicinski
@ 2022-05-12 22:53       ` Joe Damato
  2022-05-12 23:12         ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Joe Damato @ 2022-05-12 22:53 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, linux-kernel, x86

On Thu, May 12, 2022 at 12:46:08PM -0700, Jakub Kicinski wrote:
> On Wed, 11 May 2022 18:01:54 -0700 Joe Damato wrote:
> > > Is there a practical use case?  
> > 
> > Yes; for us there seems to be - especially with AMD Zen2. I'll try to
> > describe such a setup and my synthetic HTTP benchmark results.
> > 
> > Imagine a program, call it storageD, which is responsible for storing and
> > retrieving data from a data store. Other programs can request data from
> > storageD via communicating with it on a Unix socket.
> > 
> > One such program that could request data via the Unix socket is an HTTP
> > daemon. For some client connections that the HTTP daemon receives, the
> > daemon may determine that responses can be sent in plain text.
> > 
> > In this case, the HTTP daemon can use splice to move data from the unix
> > socket connection with storageD directly to the client TCP socket via a
> > pipe. splice saves CPU cycles and avoids incurring any memory access
> > latency since the data itself is not accessed.
> > 
> > Because we'll use splice (instead of accessing the data and potentially
> > affecting the CPU cache) it is advantageous for storageD to use NT copies
> > when it writes to the Unix socket to avoid evicting hot data from the CPU
> > cache. After all, once the data is copied into the kernel on the unix
> > socket write path, it won't be touched again; only spliced.
> > 
> > In my synthetic HTTP benchmarks for this setup, we've been able to increase
> > network throughput of the the HTTP daemon by roughly 30% while reducing
> > the system time of storageD. We're still collecting data on production
> > workloads.
> > 
> > The motivation, IMHO, is very similar to the motivation for
> > NETIF_F_NOCACHE_COPY, as far I understand.
> > 
> > In some cases, when an application writes to a network socket the data
> > written to the socket won't be accessed again once it is copied into the
> > kernel. In these cases, NETIF_F_NOCACHE_COPY can improve performance and
> > helps to preserve the CPU cache and avoid evicting hot data.
> > 
> > We get a sizable benefit from this option, too, in situations where we
> > can't use splice and have to call write to transmit data to client
> > connections. We want to get the same benefit of NETIF_F_NOCACHE_COPY, but
> > when writing to Unix sockets as well.
> > 
> > Let me know if that makes it more clear.
> 
> Makes sense, thanks for the explainer.
> 
> > > The patches look like a lot of extra indirect calls.  
> > 
> > Yup. As I mentioned in the cover letter this was mostly a PoC that seems to
> > work and increases network throughput in a real world scenario.
> > 
> > If this general line of thinking (NT copies on write to a Unix socket) is
> > acceptable, I'm happy to refactor the code however you (and others) would
> > like to get it to an acceptable state.
> 
> My only concern is that in post-spectre world the indirect calls are
> going to be more expensive than an branch would be. But I'm not really
> a mirco-optimization expert :)

Makes sense; neither am I, FWIW :)

For whatever reason, on AMD Zen2 it seems that using non-temporal
instructions when copying data sizes above the L2 size is a huge
performance win (compared to the kernel's normal temporal copy code) even
if that size fits in L3.

This is why both NETIF_F_NOCACHE_COPY and MSG_NTCOPY from this series seem
to have such a large, measurable impact in the contrived benchmark I
included in the cover letter and also in synthetic HTTP workloads.

I'll plan on including numbers from the benchmark program on a few other
CPUs I have access to in the cover letter for any follow-up RFCs or
revisions.

As a data point, there has been similar-ish work done in glibc [1] to
determine when non-temporal copies should be used on Zen2 based on the size
of the copy. I'm certainly not a micro-arch expert by any stretch, but the
glibc work plus the benchmark results I've measured seem to suggest that
NT-copies can be very helpful on Zen2.

Two questions for you:

 1. Do you have any strong opinions on the sendmsg flag vs a socket option?

 2. If I can think of a way to avoid the indirect calls, do you think this
    series is ready for a v1? I'm not sure if there's anything major that
    needs to be addressed aside from the indirect calls.

I'll include some documentation and cosmetic cleanup in the v1, as well.

Thanks,
Joe

[1]: https://sourceware.org/pipermail/libc-alpha/2020-October/118895.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-12 22:53       ` Joe Damato
@ 2022-05-12 23:12         ` Jakub Kicinski
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Kicinski @ 2022-05-12 23:12 UTC (permalink / raw)
  To: Joe Damato; +Cc: netdev, davem, linux-kernel, x86

On Thu, 12 May 2022 15:53:05 -0700 Joe Damato wrote:
>  1. Do you have any strong opinions on the sendmsg flag vs a socket option?

It sounded like you want to mix nt and non-nt on a single socket hence
the flag was a requirement. socket option is better because we can have
many more of those than there are bits for flags, obviously.

>  2. If I can think of a way to avoid the indirect calls, do you think this
>     series is ready for a v1? I'm not sure if there's anything major that
>     needs to be addressed aside from the indirect calls.

Nothing comes to mind, seems pretty straightforward to me.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path
  2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
                   ` (6 preceding siblings ...)
  2022-05-11 23:25 ` [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Jakub Kicinski
@ 2022-05-31  6:04 ` Christoph Hellwig
  7 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2022-05-31  6:04 UTC (permalink / raw)
  To: Joe Damato; +Cc: netdev, davem, kuba, linux-kernel, x86

From the iov_iter point of view:  please follow the way how the inatomic
nocache helpers are implemented instead of adding costly funtion
pointers.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-05-31  6:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-11  3:54 [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Joe Damato
2022-05-11  3:54 ` [RFC,net-next,x86 1/6] arch, x86, uaccess: Add nontemporal copy functions Joe Damato
2022-05-11  3:54 ` [RFC,net-next 2/6] iov_iter: Allow custom copyin function Joe Damato
2022-05-11  3:54 ` [RFC,net-next 3/6] iov_iter: Add a nocache copy iov iterator Joe Damato
2022-05-11  3:54 ` [RFC,net-next 4/6] net: Add a struct for managing copy functions Joe Damato
2022-05-11  3:54 ` [RFC,net-next 5/6] net: Add a way to copy skbs without affect cache Joe Damato
2022-05-11  3:54 ` [RFC,net-next 6/6] net: unix: Add MSG_NTCOPY Joe Damato
2022-05-11 23:25 ` [RFC,net-next,x86 0/6] Nontemporal copies in unix socket write path Jakub Kicinski
2022-05-12  1:01   ` Joe Damato
2022-05-12 19:46     ` Jakub Kicinski
2022-05-12 22:53       ` Joe Damato
2022-05-12 23:12         ` Jakub Kicinski
2022-05-31  6:04 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).