All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
@ 2017-02-22 16:38 Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 01/12] sock: allocate skbs from optmem Willem de Bruijn
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

RFCv2:

I have received a few requests for status and rebased code of this
feature. We have been running this code internally, discovering and
fixing various bugs. With net-next closed, now seems like a good time
to share an updated patchset with fixes. The rebase from RFCv1/v4.2
was mostly straightforward: mainly iov_iter changes. Full changelog:

  RFC -> RFCv2:
    - review comment: do not loop skb with zerocopy frags onto rx:
          add skb_orphan_frags_rx to orphan even refcounted frags
	  call this in __netif_receive_skb_core, deliver_skb and tun:
	  the same as 1080e512d44d ("net: orphan frags on receive")
    - fix: hold an explicit sk reference on each notification skb.
          previously relied on the reference (or wmem) held by the
	  data skb that would trigger notification, but this breaks
	  on skb_orphan.
    - fix: when aborting a send, do not inc the zerocopy counter
          this caused gaps in the notification chain
    - fix: in packet with SOCK_DGRAM, pull ll headers before calling
          zerocopy_sg_from_iter
    - fix: if sock_zerocopy_realloc does not allow coalescing,
          do not fail, just allocate a new ubuf
    - fix: in tcp, check return value of second allocation attempt
    - chg: allocate notification skbs from optmem
          to avoid affecting tcp write queue accounting (TSQ)
    - chg: limit #locked pages (ulimit) per user instead of per process
    - chg: grow notification ids from 16 to 32 bit
      - pass range [lo, hi] through 32 bit fields ee_info and ee_data
    - chg: rebased to davem-net-next on top of v4.10-rc7
    - add: limit notification coalescing
          sharing ubufs limits overhead, but delays notification until
	  the last packet is released, possibly unbounded. Add a cap. 
    - tests: add snd_zerocopy_lo pf_packet test
    - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

The change to allocate notification skbuffs from optmem requires
ensuring that net.core.optmem is at least a few 100KB. To
experiment, run

  sysctl -w net.core.optmem_max=1048576

The snd_zerocopy_lo benchmarks reported in the individual patches were
rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
replaced with skb_orphan_frags to allow looping to local sockets. The
netperf results below are also rerun with v2.

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.


Overview (from original RFC):

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

        --process cycles--      ----cpu cycles----
           std      zc   %      std         zc   %
4K      27,609  11,217  41      49,217  39,175  79
16K     21,370   3,823  18      43,540  29,213  67
64K     20,557   2,312  11      42,189  26,910  64
256K    21,110   2,134  10      43,006  27,104  63
1M      20,987   1,610   8      42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313                                                   
 79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
  0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb

zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124                                                     
 30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
 14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
  8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
  4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
  3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node


* Safety

The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.

While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.

Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx. 

Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.


* Limitations / Known Issues

- PF_INET6 is not yet supported.
- TCP does not build max GSO packets, especially for
     small send buffers (< 4 KB)

Willem de Bruijn (12):
  sock: allocate skbs from optmem
  sock: skb_copy_ubufs support for compound pages
  sock: add generic socket zerocopy
  sock: enable sendmsg zerocopy
  sock: sendmsg zerocopy notification coalescing
  sock: sendmsg zerocopy ulimit
  sock: sendmsg zerocopy limit bytes per notification
  tcp: enable sendmsg zerocopy
  udp: enable sendmsg zerocopy
  raw: enable sendmsg zerocopy with IP_HDRINCL
  packet: enable sendmsg zerocopy
  test: add sendmsg zerocopy tests

 drivers/net/tun.c                             |   2 +-
 drivers/vhost/net.c                           |   1 +
 include/linux/sched.h                         |   2 +-
 include/linux/skbuff.h                        |  94 +++-
 include/linux/socket.h                        |   1 +
 include/net/sock.h                            |   4 +
 include/uapi/linux/errqueue.h                 |   1 +
 net/core/datagram.c                           |  35 +-
 net/core/dev.c                                |   4 +-
 net/core/skbuff.c                             | 327 ++++++++++++--
 net/core/sock.c                               |  29 ++
 net/ipv4/ip_output.c                          |  34 +-
 net/ipv4/raw.c                                |  27 +-
 net/ipv4/tcp.c                                |  37 +-
 net/packet/af_packet.c                        |  52 ++-
 tools/testing/selftests/net/.gitignore        |   2 +
 tools/testing/selftests/net/Makefile          |   1 +
 tools/testing/selftests/net/snd_zerocopy.c    | 354 +++++++++++++++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
 19 files changed, 1536 insertions(+), 67 deletions(-)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 01/12] sock: allocate skbs from optmem
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h |  2 ++
 net/core/sock.c    | 27 +++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 9ccefa5c5487..c1a8b2cbc75e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1531,6 +1531,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
 void sock_rfree(struct sk_buff *skb);
 void sock_efree(struct sk_buff *skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index e7d74940e863..57a7da46ac52 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1772,6 +1772,33 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 }
 EXPORT_SYMBOL(sock_wmalloc);
 
+static void sock_ofree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	atomic_sub(skb->truesize, &sk->sk_omem_alloc);
+}
+
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+			     gfp_t priority)
+{
+	struct sk_buff *skb;
+
+	/* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
+	if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) >
+	    sysctl_optmem_max)
+		return NULL;
+
+	skb = alloc_skb(size, priority);
+	if (!skb)
+		return NULL;
+
+	atomic_add(skb->truesize, &sk->sk_omem_alloc);
+	skb->sk = sk;
+	skb->destructor = sock_ofree;
+	return skb;
+}
+
 /*
  * Allocate a memory block from the socket's option memory buffer.
  */
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 01/12] sock: allocate skbs from optmem Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 20:33   ` Eric Dumazet
  2017-02-22 16:38 ` [PATCH RFC v2 03/12] sock: add generic socket zerocopy Willem de Bruijn
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Refine skb_copy_ubufs to support compount pages. With upcoming TCP
and UDP zerocopy sendmsg, such fragments may appear.

These skbuffs can have both kernel and zerocopy fragments, e.g., when
corking. Avoid unnecessary copying of fragments that have no userspace
reference.

It is not safe to modify skb frags when the skbuff is shared. This
should not happen. Fail loudly if we find an unexpected edge case.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f3557958e9bf..67e4216fca01 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph);
  *	If this function is called from an interrupt gfp_mask() must be
  *	%GFP_ATOMIC.
  *
+ *	skb_shinfo(skb) can only be safely modified when not accessed
+ *	concurrently. Fail if the skb is shared or cloned.
+ *
  *	Returns 0 on success or a negative error code on failure
  *	to allocate kernel memory to copy to.
  */
@@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	struct page *page, *head = NULL;
 	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
+	if (skb_shared(skb) || skb_cloned(skb)) {
+		WARN_ON_ONCE(1);
+		return -EINVAL;
+	}
+
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
+		unsigned int order = 0;
+		gfp_t mask = gfp_mask;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
-		page = alloc_page(gfp_mask);
+		page = skb_frag_page(f);
+		if (page_count(page) == 1) {
+			skb_frag_ref(skb, i);
+			goto copy_done;
+		}
+
+		if (f->size > PAGE_SIZE) {
+			order = get_order(f->size);
+			mask |= __GFP_COMP;
+		}
+
+		page = alloc_pages(mask, order);
 		if (!page) {
 			while (head) {
 				struct page *next = (struct page *)page_private(head);
@@ -971,6 +992,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		memcpy(page_address(page),
 		       vaddr + f->page_offset, skb_frag_size(f));
 		kunmap_atomic(vaddr);
+copy_done:
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 03/12] sock: add generic socket zerocopy
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 01/12] sock: allocate skbs from optmem Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 04/12] sock: enable sendmsg zerocopy Willem de Bruijn
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |  46 ++++++++++++++++
 include/linux/socket.h        |   1 +
 include/net/sock.h            |   2 +
 include/uapi/linux/errqueue.h |   1 +
 net/core/datagram.c           |  35 ++++++++----
 net/core/skbuff.c             | 120 ++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   2 +
 7 files changed, 196 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 69ccd2636911..c99538b258c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -390,6 +390,7 @@ enum {
 	SKBTX_SCHED_TSTAMP = 1 << 6,
 };
 
+#define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP)
 #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP)
@@ -406,8 +407,27 @@ struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
 	unsigned long desc;
+	atomic_t refcnt;
 };
 
+#define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
+
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+
+static inline void sock_zerocopy_get(struct ubuf_info *uarg)
+{
+	atomic_inc(&uarg->refcnt);
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg);
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
+
+bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size);
+int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
+				struct iov_iter *iter, int len,
+				struct ubuf_info *uarg);
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -1230,6 +1250,32 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
+{
+	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
+
+	return is_zcopy ? skb_uarg(skb) : NULL;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+	if (uarg) {
+		sock_zerocopy_get(uarg);
+		skb_shinfo(skb)->destructor_arg = uarg;
+		skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+	}
+}
+
+static inline void skb_zcopy_clear(struct sk_buff *skb)
+{
+	struct ubuf_info *uarg = skb_zcopy(skb);
+
+	if (uarg) {
+		sock_zerocopy_put(uarg);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+
 /**
  *	skb_queue_empty - check if a queue is empty
  *	@list: queue head
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 082027457825..c2d6ec354bee 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
 #define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
 
+#define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/include/net/sock.h b/include/net/sock.h
index c1a8b2cbc75e..74ad7d7c5eed 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,6 +288,7 @@ struct sock_common {
   *	@sk_stamp: time stamp of last packet received
   *	@sk_tsflags: SO_TIMESTAMPING socket options
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
+  *	@sk_zckey: counter to order MSG_ZEROCOPY notifications
   *	@sk_socket: Identd and reporting IO signals
   *	@sk_user_data: RPC layer private data
   *	@sk_frag: cached page frag
@@ -455,6 +456,7 @@ struct sock {
 	u16			sk_tsflags;
 	u8			sk_shutdown;
 	u32			sk_tskey;
+	atomic_t		sk_zckey;
 	struct socket		*sk_socket;
 	void			*sk_user_data;
 #ifdef CONFIG_SECURITY
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1f444a..0f15a77c9e39 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -18,6 +18,7 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP	2
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
+#define SO_EE_ORIGIN_ZEROCOPY	5
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ea633342ab0d..79db53c6bcf4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -564,17 +564,12 @@ EXPORT_SYMBOL(skb_copy_datagram_from_iter);
  *
  *	Returns 0, -EFAULT or -EMSGSIZE.
  */
-int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+			    struct iov_iter *from, size_t length)
 {
-	int len = iov_iter_count(from);
-	int copy = min_t(int, skb_headlen(skb), len);
-	int frag = 0;
-
-	/* copy up to skb headlen */
-	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
-		return -EFAULT;
+	int frag = skb_shinfo(skb)->nr_frags;
 
-	while (iov_iter_count(from)) {
+	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
 		size_t start;
 		ssize_t copied;
@@ -584,18 +579,24 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, ~0U,
+		copied = iov_iter_get_pages(from, pages, length,
 					    MAX_SKB_FRAGS - frag, &start);
 		if (copied < 0)
 			return -EFAULT;
 
 		iov_iter_advance(from, copied);
+		length -= copied;
 
 		truesize = PAGE_ALIGN(copied + start);
 		skb->data_len += copied;
 		skb->len += copied;
 		skb->truesize += truesize;
-		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		if (sk && sk->sk_type == SOCK_STREAM) {
+			sk->sk_wmem_queued += truesize;
+			sk_mem_charge(sk, truesize);
+		} else {
+			atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		}
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			skb_fill_page_desc(skb, frag++, pages[n], start, size);
@@ -606,6 +607,18 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 	}
 	return 0;
 }
+EXPORT_SYMBOL(__zerocopy_sg_from_iter);
+
+int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+{
+	int copy = min_t(int, skb_headlen(skb), iov_iter_count(from));
+
+	/* copy up to skb headlen */
+	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
+		return -EFAULT;
+
+	return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+}
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 67e4216fca01..d566f85a7690 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -932,6 +932,126 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+/* must only be called from process context */
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
+{
+	struct sk_buff *skb;
+	struct ubuf_info *uarg;
+
+	skb = sock_omalloc(sk, 0, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
+	uarg = (void *)skb->cb;
+
+	uarg->callback = sock_zerocopy_callback;
+	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	atomic_set(&uarg->refcnt, 0);
+	sock_hold(sk);
+
+	return uarg;
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_alloc);
+
+static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg)
+{
+	return container_of((void *)uarg, struct sk_buff, cb);
+}
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
+{
+	struct sock_exterr_skb *serr;
+	struct sk_buff *skb = skb_from_uarg(uarg);
+	struct sock *sk = skb->sk;
+	u16 id = uarg->desc;
+
+	serr = SKB_EXT_ERR(skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
+	serr->ee.ee_data = id;
+
+	skb_queue_tail(&sk->sk_error_queue, skb);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_error_report(sk);
+
+	sock_put(sk);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
+
+void sock_zerocopy_put(struct ubuf_info *uarg)
+{
+	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		if (uarg->callback)
+			uarg->callback(uarg, true);
+		else
+			consume_skb(skb_from_uarg(uarg));
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put);
+
+bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size)
+{
+	struct ubuf_info *uarg;
+
+	uarg = sock_zerocopy_alloc(skb->sk, size);
+	if (!uarg)
+		return false;
+
+	skb_zcopy_set(skb, uarg);
+	return true;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_alloc);
+
+extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+				   struct iov_iter *from, size_t length);
+
+int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
+				struct iov_iter *iter, int len,
+				struct ubuf_info *uarg)
+{
+	struct ubuf_info *orig_uarg = skb_zcopy(skb);
+	struct iov_iter orig_iter = *iter;
+	int ret, orig_len = skb->len;
+
+	if (orig_uarg && orig_uarg != uarg)
+		return -EEXIST;
+
+	ret = __zerocopy_sg_from_iter(sk, skb, iter, len);
+	if (ret && (ret != -EMSGSIZE || skb->len == orig_len)) {
+		*iter = orig_iter;
+		___pskb_trim(skb, orig_len);
+		return ret;
+	}
+
+	if (!orig_uarg)
+		skb_zcopy_set(skb, uarg);
+
+	return skb->len - orig_len;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_add_frags_iter);
+
+/* unused only until next patch in the series; will remove attribute */
+static int __attribute__((unused))
+	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+			      gfp_t gfp_mask)
+{
+	if (skb_zcopy(orig)) {
+		if (skb_zcopy(nskb)) {
+			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
+			BUG_ON(!gfp_mask);
+			if (skb_uarg(nskb) == skb_uarg(orig))
+				return 0;
+			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
+				return -EIO;
+		}
+		skb_zcopy_set(nskb, skb_uarg(orig));
+	}
+	return 0;
+}
+
 /**
  *	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
  *	@skb: the skb to modify
diff --git a/net/core/sock.c b/net/core/sock.c
index 57a7da46ac52..8f8203565ac4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1526,6 +1526,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		atomic_set(&newsk->sk_drops, 0);
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
+		atomic_set(&newsk->sk_zckey, 0);
 
 		sock_reset_flag(newsk, SOCK_DONE);
 		skb_queue_head_init(&newsk->sk_error_queue);
@@ -2524,6 +2525,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_sndtimeo		=	MAX_SCHEDULE_TIMEOUT;
 
 	sk->sk_stamp = ktime_set(-1L, 0);
+	atomic_set(&sk->sk_zckey, 0);
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id		=	0;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 04/12] sock: enable sendmsg zerocopy
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (2 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 03/12] sock: add generic socket zerocopy Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/tun.c      |  2 +-
 drivers/vhost/net.c    |  1 +
 include/linux/skbuff.h | 16 ++++++++++++++--
 net/core/dev.c         |  4 ++--
 net/core/skbuff.c      | 52 +++++++++++++++++++++-----------------------------
 5 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 30863e378925..b80c7fdcb05b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -880,7 +880,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	    sk_filter(tfile->socket.sk, skb))
 		goto drop;
 
-	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+	if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 		goto drop;
 
 	skb_tx_timestamp(skb);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2fe35354f20e..f7ff72ed892f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -454,6 +454,7 @@ static void handle_tx(struct vhost_net *net)
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
+			atomic_set(&ubuf->refcnt, 1);
 			msg.msg_control = ubuf;
 			msg.msg_controllen = sizeof(ubuf);
 			ubufs = nvq->ubufs;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c99538b258c9..c7b42272b409 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2448,7 +2448,7 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
- *	skb_orphan_frags - orphan the frags contained in a buffer
+ *	skb_orphan_frags - make a local copy of non-refcounted user frags
  *	@skb: buffer to orphan frags from
  *	@gfp_mask: allocation mask for replacement pages
  *
@@ -2458,7 +2458,17 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+	if (likely(!skb_zcopy(skb)))
+		return 0;
+	if (skb_uarg(skb)->callback == sock_zerocopy_callback)
+		return 0;
+	return skb_copy_ubufs(skb, gfp_mask);
+}
+
+/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
+static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	if (likely(!skb_zcopy(skb)))
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
@@ -2890,6 +2900,8 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
+	if (skb_zcopy(skb))
+		return false;
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9..7879225818da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1801,7 +1801,7 @@ static inline int deliver_skb(struct sk_buff *skb,
 			      struct packet_type *pt_prev,
 			      struct net_device *orig_dev)
 {
-	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+	if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 		return -ENOMEM;
 	atomic_inc(&skb->users);
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
@@ -4173,7 +4173,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	}
 
 	if (pt_prev) {
-		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 			goto drop;
 		else
 			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d566f85a7690..fcbdc91b2d24 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -592,21 +592,10 @@ static void skb_release_data(struct sk_buff *skb)
 	for (i = 0; i < shinfo->nr_frags; i++)
 		__skb_frag_unref(&shinfo->frags[i]);
 
-	/*
-	 * If skb buf is from userspace, we need to notify the caller
-	 * the lower device DMA has done;
-	 */
-	if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = shinfo->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, true);
-	}
-
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
+	skb_zcopy_clear(skb);
 	skb_free_head(skb);
 }
 
@@ -725,14 +714,7 @@ EXPORT_SYMBOL(kfree_skb_list);
  */
 void skb_tx_error(struct sk_buff *skb)
 {
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = skb_shinfo(skb)->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, false);
-		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
-	}
+	skb_zcopy_clear(skb);
 }
 EXPORT_SYMBOL(skb_tx_error);
 
@@ -1033,9 +1015,7 @@ int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_add_frags_iter);
 
-/* unused only until next patch in the series; will remove attribute */
-static int __attribute__((unused))
-	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
 	if (skb_zcopy(orig)) {
@@ -1044,6 +1024,8 @@ static int __attribute__((unused))
 			BUG_ON(!gfp_mask);
 			if (skb_uarg(nskb) == skb_uarg(orig))
 				return 0;
+			/* nskb is always new, writable, so copy ubufs is ok */
+			BUG_ON(skb_shared(nskb) || skb_cloned(nskb));
 			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
 				return -EIO;
 		}
@@ -1075,12 +1057,13 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
-	if (skb_shared(skb) || skb_cloned(skb)) {
+	if (skb_shared(skb)) {
 		WARN_ON_ONCE(1);
 		return -EINVAL;
 	}
+	if (skb_unclone(skb, gfp_mask))
+		return -EINVAL;
 
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
@@ -1121,8 +1104,6 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
 
-	uarg->callback(uarg, false);
-
 	/* skb frags point to kernel buffers */
 	for (i = num_frags - 1; i >= 0; i--) {
 		__skb_fill_page_desc(skb, i, head, 0,
@@ -1130,7 +1111,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		head = (struct page *)page_private(head);
 	}
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	skb_zcopy_clear(skb);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(skb_copy_ubufs);
@@ -1291,11 +1272,13 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 	if (skb_shinfo(skb)->nr_frags) {
 		int i;
 
-		if (skb_orphan_frags(skb, gfp_mask)) {
+		if (skb_orphan_frags(skb, gfp_mask) ||
+		    skb_zerocopy_clone(n, skb, gfp_mask)) {
 			kfree_skb(n);
 			n = NULL;
 			goto out;
 		}
+
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
 			skb_frag_ref(skb, i);
@@ -1368,9 +1351,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 * be since all we did is relocate the values
 	 */
 	if (skb_cloned(skb)) {
-		/* copy this zero copy skb frags */
 		if (skb_orphan_frags(skb, gfp_mask))
 			goto nofrags;
+		if (skb_zcopy(skb))
+			atomic_inc(&skb_uarg(skb)->refcnt);
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			skb_frag_ref(skb, i);
 
@@ -2466,6 +2450,7 @@ skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		skb_tx_error(from);
 		return -ENOMEM;
 	}
+	skb_zerocopy_clone(to, from, GFP_ATOMIC);
 
 	for (i = 0; i < skb_shinfo(from)->nr_frags; i++) {
 		if (!len)
@@ -2762,6 +2747,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 	int pos = skb_headlen(skb);
 
 	skb_shinfo(skb1)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
+	skb_zerocopy_clone(skb1, skb, 0);
 	if (len < pos)	/* Split line is inside header. */
 		skb_split_inside_header(skb, skb1, len, pos);
 	else		/* Second chunk has no header, nothing to copy. */
@@ -2805,6 +2791,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	if (skb_headlen(skb))
 		return 0;
+	if (skb_zcopy(tgt) || skb_zcopy(skb))
+		return 0;
 
 	todo = shiftlen;
 	from = 0;
@@ -3368,6 +3356,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags &
 			SKBTX_SHARED_FRAG;
+		if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
+			goto err;
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
@@ -4446,6 +4436,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 	if (skb_has_frag_list(to) || skb_has_frag_list(from))
 		return false;
+	if (skb_zcopy(to) || skb_zcopy(from))
+		return false;
 
 	if (skb_headlen(from) != 0) {
 		struct page *page;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (3 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 04/12] sock: enable sendmsg zerocopy Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit Willem de Bruijn
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause sendmsg() calls to append to a single
sk_buff and ubuf_info. Modify the notification path to return an
inclusive range of notifications [N..N+m].

Add skb_zerocopy_realloc() to reuse ubuf_info across sendmsg() calls
and modify the notification path to return a range.

For the case of reliable ordered transmission (TCP), only the upper
value of the range to be read, as the lower value is guaranteed to
be 1 above the last read notification.

Additionally, coalesce notifications in this common case: if an
skb_uarg [1, 1] is queued while [0, 0] is already on the queue,
just modify the head of the queue to read [0, 1].

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 21 +++++++++++-
 net/core/skbuff.c      | 92 +++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c7b42272b409..eedac9fd3f0f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -406,13 +406,21 @@ enum {
 struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
-	unsigned long desc;
+	union {
+		unsigned long desc;
+		struct {
+			u32 id;
+			u16 len;
+		};
+	};
 	atomic_t refcnt;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg);
 
 static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 {
@@ -420,6 +428,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1276,6 +1285,16 @@ static inline void skb_zcopy_clear(struct sk_buff *skb)
 	}
 }
 
+static inline void skb_zcopy_abort(struct sk_buff *skb)
+{
+	struct ubuf_info *uarg = skb_zcopy(skb);
+
+	if (uarg) {
+		sock_zerocopy_put_abort(uarg);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+
 /**
  *	skb_queue_empty - check if a queue is empty
  *	@list: queue head
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fcbdc91b2d24..7a1d6e7703a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -928,7 +928,8 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg = (void *)skb->cb;
 
 	uarg->callback = sock_zerocopy_callback;
-	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
+	uarg->len = 1;
 	atomic_set(&uarg->refcnt, 0);
 	sock_hold(sk);
 
@@ -941,24 +942,94 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg)
 	return container_of((void *)uarg, struct sk_buff, cb);
 }
 
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg)
+{
+	if (uarg) {
+		u32 next;
+
+		/* realloc only when socket is locked (TCP, UDP cork),
+		 * so uarg->len and sk_zckey access is serialized
+		 */
+		BUG_ON(!sock_owned_by_user(sk));
+
+		if (unlikely(uarg->len == USHRT_MAX - 1))
+			return NULL;
+
+		next = (u32)atomic_read(&sk->sk_zckey);
+		if ((u32)(uarg->id + uarg->len) == next) {
+			uarg->len++;
+			atomic_set(&sk->sk_zckey, ++next);
+			return uarg;
+		}
+	}
+
+	return sock_zerocopy_alloc(sk, size);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
+
+static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	s64 sum_len;
+	u32 old_lo, old_hi;
+
+	old_lo = serr->ee.ee_info;
+	old_hi = serr->ee.ee_data;
+	sum_len = old_hi - old_lo + 1 + len;
+	if (old_hi < old_lo)
+		sum_len += (1ULL << 32);
+
+	if (sum_len >= (1ULL << 32))
+		return false;
+
+	if (lo != old_hi + 1)
+		return false;
+
+	serr->ee.ee_data += len;
+	return true;
+}
+
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 {
 	struct sock_exterr_skb *serr;
-	struct sk_buff *skb = skb_from_uarg(uarg);
+	struct sk_buff *head, *skb = skb_from_uarg(uarg);
 	struct sock *sk = skb->sk;
-	u16 id = uarg->desc;
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	unsigned long flags;
+	u32 lo, hi;
+	u16 len;
+
+	/* if !len, there was only 1 call, and it was aborted
+	 * so do not queue a completion notification
+	 */
+	if (!uarg->len)
+		goto free;
+
+	len = uarg->len;
+	lo = uarg->id;
+	hi = uarg->id + len - 1;
 
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = 0;
 	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
-	serr->ee.ee_data = id;
+	serr->ee.ee_data = hi;
+	serr->ee.ee_info = lo;
 
-	skb_queue_tail(&sk->sk_error_queue, skb);
+	spin_lock_irqsave(&q->lock, flags);
+	head = skb_peek(q);
+	if (!head || !skb_zerocopy_notify_extend(head, lo, len)) {
+		__skb_queue_tail(q, skb);
+		skb = NULL;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_error_report(sk);
 
+free:
+	consume_skb(skb);
 	sock_put(sk);
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
@@ -974,6 +1045,17 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
+/* only called when sendmsg returns with error; no notification for this call */
+void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+{
+	if (uarg) {
+		uarg->len--;
+		atomic_dec(&skb_from_uarg(uarg)->sk->sk_zckey);
+		sock_zerocopy_put(uarg);
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
+
 bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size)
 {
 	struct ubuf_info *uarg;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (4 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification Willem de Bruijn
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/sched.h  |  2 +-
 include/linux/skbuff.h |  5 +++++
 net/core/skbuff.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9ec61f7..943714f8e91a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -905,7 +905,7 @@ struct user_struct {
 	struct hlist_node uidhash_node;
 	kuid_t uid;
 
-#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_NET)
 	atomic_long_t locked_vm;
 #endif
 };
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eedac9fd3f0f..a38308b10d76 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -414,6 +414,11 @@ struct ubuf_info {
 		};
 	};
 	atomic_t refcnt;
+
+	struct mmpin {
+		struct user_struct *user;
+		int num_pg;
+	} mmp;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7a1d6e7703a6..b86e196d6dec 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -914,6 +914,44 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+{
+	unsigned long max_pg, num_pg, new_pg, old_pg;
+	struct user_struct *user;
+
+	if (capable(CAP_IPC_LOCK) || !size)
+		return 0;
+
+	num_pg = (size >> PAGE_SHIFT) + 2;	/* worst case */
+	max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	user = mmp->user ? : current_user();
+
+	do {
+		old_pg = atomic_long_read(&user->locked_vm);
+		new_pg = old_pg + num_pg;
+		if (new_pg > max_pg)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
+		 old_pg);
+
+	if (!mmp->user) {
+		mmp->user = get_uid(user);
+		mmp->num_pg = num_pg;
+	} else {
+		mmp->num_pg += num_pg;
+	}
+
+	return 0;
+}
+
+static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+{
+	if (mmp->user) {
+		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+		free_uid(mmp->user);
+	}
+}
+
 /* must only be called from process context */
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
@@ -926,6 +964,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 
 	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
 	uarg = (void *)skb->cb;
+	uarg->mmp.user = NULL;
+
+	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+		kfree_skb(skb);
+		return NULL;
+	}
 
 	uarg->callback = sock_zerocopy_callback;
 	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
@@ -958,6 +1002,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 
 		next = (u32)atomic_read(&sk->sk_zckey);
 		if ((u32)(uarg->id + uarg->len) == next) {
+			if (mm_account_pinned_pages(&uarg->mmp, size))
+				return NULL;
 			uarg->len++;
 			atomic_set(&sk->sk_zckey, ++next);
 			return uarg;
@@ -1037,6 +1083,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
 	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		mm_unaccount_pinned_pages(&uarg->mmp);
+
 		if (uarg->callback)
 			uarg->callback(uarg, true);
 		else
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (5 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy Willem de Bruijn
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Zerocopy can coalesce notifications of up to 65535 send calls.
Excessive coalescing increases notification latency and process
working set size.

Experiments showed trains of 75 syscalls holding around 8 MB of data
per notification. On servers with many slower clients, this causes
many GB of user data waiting for acknowledgment and many seconds of
latency between send and notification reception.

Introduce a notification byte limit.

Implementation notes:
- Due to space constraints in struct ubuf_info, the internal
  calculation is approximate, in Kilobytes and capped to 64MB.

- The field is accessed only on initial allocation of ubuf_info, when
  the struct is private, or under the tcp lock.

- When breaking a chain, we create a new notification structure uarg.
  A chain can be broken in the middle of a large sendmsg. Each skbuff
  can only point to a single uarg, so skb_zerocopy_add_frags_iter will
  fail after breaking a chain. The (next) TCP patch is changed in v2
  to detect failure (EEXIST) and jump to new_segment to create a new
  skbuff that can point to the new uarg. As a result, packetization of
  the bytestream may differ from a send without zerocopy.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a38308b10d76..6ad1724ceb60 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -411,6 +411,7 @@ struct ubuf_info {
 		struct {
 			u32 id;
 			u16 len;
+			u16 kbytelen;
 		};
 	};
 	atomic_t refcnt;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b86e196d6dec..6a07a20a91ed 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -974,6 +974,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->callback = sock_zerocopy_callback;
 	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
 	uarg->len = 1;
+	uarg->kbytelen = min_t(size_t, DIV_ROUND_UP(size, 1024u), USHRT_MAX);
 	atomic_set(&uarg->refcnt, 0);
 	sock_hold(sk);
 
@@ -990,6 +991,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg)
 {
 	if (uarg) {
+		const size_t limit_kb = 512;	/* consider a sysctl */
+		size_t kbytelen;
 		u32 next;
 
 		/* realloc only when socket is locked (TCP, UDP cork),
@@ -997,8 +1000,13 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 		 */
 		BUG_ON(!sock_owned_by_user(sk));
 
+		kbytelen = uarg->kbytelen + DIV_ROUND_UP(size, 1024u);
+		if (unlikely(kbytelen > limit_kb))
+			goto new_alloc;
+		uarg->kbytelen = kbytelen;
+
 		if (unlikely(uarg->len == USHRT_MAX - 1))
-			return NULL;
+			goto new_alloc;
 
 		next = (u32)atomic_read(&sk->sk_zckey);
 		if ((u32)(uarg->id + uarg->len) == next) {
@@ -1010,6 +1018,7 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 		}
 	}
 
+new_alloc:
 	return sock_zerocopy_alloc(sk, size);
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (6 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 09/12] udp: " Willem de Bruijn
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable support for MSG_ZEROCOPY to the TCP stack. Data that is
sent to a remote host will be zerocopy. TSO and GSO are supported.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  loopback test snd_zerocopy_lo -t -z produced:

  without zerocopy (-t):
    rx=102852 (6418 MB) tx=102852 txc=0
    rx=213216 (13305 MB) tx=213216 txc=0
    rx=325266 (20298 MB) tx=325266 txc=0
    rx=437082 (27275 MB) tx=437082 txc=0

  with zerocopy (-t -z):
    rx=238446 (14880 MB) tx=238446 txc=238434
    rx=500076 (31207 MB) tx=500076 txc=500060
    rx=763728 (47660 MB) tx=763728 txc=763706
    rx=1028184 (64163 MB) tx=1028184 txc=1028156

  This test opens a pair of local sockets, one one calls sendmsg with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index da385ae997a3..4884f4ff14d2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1051,13 +1051,17 @@ static int linear_payload_sz(bool first_skb)
 	return 0;
 }
 
-static int select_size(const struct sock *sk, bool sg, bool first_skb)
+static int select_size(const struct sock *sk, bool sg, bool first_skb,
+		       bool zerocopy)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	int tmp = tp->mss_cache;
 
 	if (sg) {
 		if (sk_can_gso(sk)) {
+			if (zerocopy)
+				return 0;
+
 			tmp = linear_payload_sz(first_skb);
 		} else {
 			int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
@@ -1121,6 +1125,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	struct sockcm_cookie sockc;
+	struct ubuf_info *uarg = NULL;
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool process_backlog = false;
@@ -1190,6 +1195,21 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (sg && (flags & MSG_ZEROCOPY) && size && !uarg) {
+		skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL;
+		uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+		if (!uarg) {
+			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
+				goto out_err;
+			uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+			if (!uarg) {
+				err = -ENOBUFS;
+				goto out_err;
+			}
+		}
+		sock_zerocopy_get(uarg);
+	}
+
 	while (msg_data_left(msg)) {
 		int copy = 0;
 		int max = size_goal;
@@ -1217,7 +1237,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			}
 			first_skb = skb_queue_empty(&sk->sk_write_queue);
 			skb = sk_stream_alloc_skb(sk,
-						  select_size(sk, sg, first_skb),
+						  select_size(sk, sg, first_skb, uarg),
 						  sk->sk_allocation,
 						  first_skb);
 			if (!skb)
@@ -1253,7 +1273,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
 			if (err)
 				goto do_fault;
-		} else {
+		} else if (!uarg) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1291,6 +1311,15 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
+		} else {
+			err = skb_zerocopy_add_frags_iter(sk, skb,
+							  &msg->msg_iter,
+							  copy, uarg);
+			if (err == -EMSGSIZE || err == -EEXIST)
+				goto new_segment;
+			if (err < 0)
+				goto do_error;
+			copy = err;
 		}
 
 		if (!copied)
@@ -1337,6 +1366,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
 	}
 out_nopush:
+	sock_zerocopy_put(uarg);
 	release_sock(sk);
 	return copied + copied_syn;
 
@@ -1354,6 +1384,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	if (copied + copied_syn)
 		goto out;
 out_err:
+	sock_zerocopy_put_abort(uarg);
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 09/12] udp: enable sendmsg zerocopy
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (7 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:38 ` [PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL Willem de Bruijn
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add MSG_ZEROCOPY support to inet/dgram. This includes udplite.

Tested:
  loopback test snd_zerocopy_lo -u -z produces

  without zerocopy (-u):
    rx=173940 (10854 MB) tx=173940 txc=0
    rx=367026 (22904 MB) tx=367026 txc=0
    rx=564078 (35201 MB) tx=564078 txc=0
    rx=756588 (47214 MB) tx=756588 txc=0

  with zerocopy (-u -z):
    rx=377994 (23588 MB) tx=377994 txc=377980
    rx=792654 (49465 MB) tx=792654 txc=792632
    rx=1209582 (75483 MB) tx=1209582 txc=1209552
    rx=1628376 (101618 MB) tx=1628376 txc=1628338

  loopback test currently fails with corking, due to
  CHECKSUM_PARTIAL being disabled with UDP_CORK after commit
  d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")

  I will suggest to allow it on NETIF_F_LOOPBACK.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  5 +++++
 net/ipv4/ip_output.c   | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6ad1724ceb60..9e7386f3f7a8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -424,6 +424,11 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+#define sock_can_zerocopy(sk, rt, csummode) \
+	((rt->dst.dev->features & NETIF_F_SG) && \
+	 ((sk->sk_type == SOCK_RAW) || \
+	  (sk->sk_type == SOCK_DGRAM && csummode & CHECKSUM_UNNECESSARY)))
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 737ce826d7ec..9e0110d8a429 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -919,7 +919,7 @@ static int __ip_append_data(struct sock *sk,
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
-
+	struct ubuf_info *uarg = NULL;
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -963,9 +963,16 @@ static int __ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length &&
+	    sock_can_zerocopy(sk, rt, skb ? skb->ip_summed : csummode)) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+	}
+
 	cork->length += length;
 	if ((((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
-	    (sk->sk_protocol == IPPROTO_UDP) &&
+	    (sk->sk_protocol == IPPROTO_UDP) && !uarg &&
 	    (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
 	    (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
 		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
@@ -1017,6 +1024,8 @@ static int __ip_append_data(struct sock *sk,
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
+			else if (uarg)
+				alloclen = min_t(int, fraglen, MAX_HEADER);
 			else
 				alloclen = fraglen;
 
@@ -1059,11 +1068,12 @@ static int __ip_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
 			 */
-			data = skb_put(skb, fraglen + exthdrlen);
+			data = skb_put(skb, alloclen);
 			skb_set_network_header(skb, exthdrlen);
 			skb->transport_header = (skb->network_header +
 						 fragheaderlen);
@@ -1079,7 +1089,9 @@ static int __ip_append_data(struct sock *sk,
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = min(datalen,
+				   alloclen - exthdrlen - fragheaderlen);
+			copy -= transhdrlen - fraggap;
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
@@ -1087,7 +1099,7 @@ static int __ip_append_data(struct sock *sk,
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
@@ -1115,6 +1127,17 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
+		} else if (uarg) {
+			struct iov_iter *iter;
+
+			if (sk->sk_type == SOCK_RAW)
+				iter = &((struct msghdr **)from)[0]->msg_iter;
+			else
+				iter = &((struct msghdr *)from)->msg_iter;
+			err = skb_zerocopy_add_frags_iter(sk, skb, iter, copy, uarg);
+			if (err < 0)
+				goto error;
+			copy = err;
 		} else {
 			int i = skb_shinfo(skb)->nr_frags;
 
@@ -1155,6 +1178,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (8 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 09/12] udp: " Willem de Bruijn
@ 2017-02-22 16:38 ` Willem de Bruijn
  2017-02-22 16:39 ` [PATCH RFC v2 11/12] packet: enable sendmsg zerocopy Willem de Bruijn
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:38 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Tested:
  raw loopback test snd_zerocopy_lo -r -z produces:

  without zerocopy (-r):
    rx=97632 (6092 MB) tx=97632 txc=0
    rx=208194 (12992 MB) tx=208194 txc=0
    rx=318714 (19889 MB) tx=318714 txc=0
    rx=429126 (26779 MB) tx=429126 txc=0

  with zerocopy (-r -z):
    rx=326160 (20353 MB) tx=326160 txc=326144
    rx=689244 (43012 MB) tx=689244 txc=689220
    rx=1049352 (65484 MB) tx=1049352 txc=1049320
    rx=1408782 (87914 MB) tx=1408782 txc=1408744

  raw hdrincl loopback test snd_zerocopy_lo -R -z produces:

  without zerocopy (-R):
    rx=167328 (10442 MB) tx=167328 txc=0
    rx=354942 (22150 MB) tx=354942 txc=0
    rx=542400 (33848 MB) tx=542400 txc=0
    rx=716442 (44709 MB) tx=716442 txc=0

  with zerocopy (-R -z):
    rx=340116 (21224 MB) tx=340116 txc=340102
    rx=712746 (44478 MB) tx=712746 txc=712726
    rx=1083732 (67629 MB) tx=1083732 txc=1083704
    rx=1457856 (90976 MB) tx=1457856 txc=1457820

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/raw.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8119e1f66e03..d21279b2f69e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,7 +351,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	unsigned int iphlen;
 	int err;
 	struct rtable *rt = *rtp;
-	int hlen, tlen;
+	int hlen, tlen, linear;
 
 	if (length > rt->dst.dev->mtu) {
 		ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
@@ -363,8 +363,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	hlen = LL_RESERVED_SPACE(rt->dst.dev);
 	tlen = rt->dst.dev->needed_tailroom;
+	linear = length;
+
+	if (flags & MSG_ZEROCOPY && length &&
+	    sock_can_zerocopy(sk, rt, CHECKSUM_UNNECESSARY))
+		linear = min_t(int, length, MAX_HEADER);
+
 	skb = sock_alloc_send_skb(sk,
-				  length + hlen + tlen + 15,
+				  linear + hlen + tlen + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (!skb)
 		goto error;
@@ -377,7 +383,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
-	skb_put(skb, length);
+	skb_put(skb, linear);
 
 	skb->ip_summed = CHECKSUM_NONE;
 
@@ -388,7 +394,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->transport_header = skb->network_header;
 	err = -EFAULT;
-	if (memcpy_from_msg(iph, msg, length))
+	if (memcpy_from_msg(iph, msg, linear))
 		goto error_free;
 
 	iphlen = iph->ihl * 4;
@@ -404,6 +410,17 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	if (iphlen > length)
 		goto error_free;
 
+	if (length != linear) {
+		size_t datalen = length - linear;
+
+		if (!skb_zerocopy_alloc(skb, datalen))
+			goto error_zcopy;
+		err = skb_zerocopy_add_frags_iter(sk, skb, &msg->msg_iter,
+						  datalen, skb_uarg(skb));
+		if (err != datalen)
+			goto error_zcopy;
+	}
+
 	if (iphlen >= sizeof(*iph)) {
 		if (!iph->saddr)
 			iph->saddr = fl4->saddr;
@@ -430,6 +447,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 out:
 	return 0;
 
+error_zcopy:
+	sock_zerocopy_put_abort(skb_zcopy(skb));
 error_free:
 	kfree_skb(skb);
 error:
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 11/12] packet: enable sendmsg zerocopy
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (9 preceding siblings ...)
  2017-02-22 16:38 ` [PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL Willem de Bruijn
@ 2017-02-22 16:39 ` Willem de Bruijn
  2017-02-22 16:39 ` [PATCH RFC v2 12/12] test: add sendmsg zerocopy tests Willem de Bruijn
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:39 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Support MSG_ZEROCOPY on PF_PACKET transmission.

Tested:
  pf_packet loopback test snd_zerocopy_lo -p -z produces:

  without zerocopy (-p):
    rx=0 (0 MB) tx=221696 txc=0
    rx=0 (0 MB) tx=443880 txc=0
    rx=0 (0 MB) tx=661056 txc=0
    rx=0 (0 MB) tx=877152 txc=0

  with zerocopy (-p -z):
    rx=0 (0 MB) tx=528548 txc=528544
    rx=0 (0 MB) tx=1052364 txc=1052360
    rx=0 (0 MB) tx=1571956 txc=1571952
    rx=0 (0 MB) tx=2094144 txc=2094140

  Packets do not arrive at the Rx socket due to a martian test:

    IPv4: martian destination 127.0.0.1 from 127.0.0.1, dev lo

  I'll need to revise snd_zerocopy_lo to bypass that.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 52 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2bd0d1949312..af9ecc1edf72 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2754,28 +2754,55 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 				        size_t reserve, size_t len,
-				        size_t linear, int noblock,
+					size_t linear, int flags,
 				        int *err)
 {
 	struct sk_buff *skb;
+	size_t data_len;
 
-	/* Under a page?  Don't bother with paged skb. */
-	if (prepad + len < PAGE_SIZE || !linear)
-		linear = len;
+	if (flags & MSG_ZEROCOPY) {
+		/* Minimize linear, but respect header lower bound */
+		linear = reserve + min(len, max_t(size_t, linear, MAX_HEADER));
+		data_len = 0;
+	} else {
+		/* Under a page? Don't bother with paged skb. */
+		if (prepad + len < PAGE_SIZE || !linear)
+			linear = len;
+		data_len = len - linear;
+	}
 
-	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-				   err, 0);
+	skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
+				   flags & MSG_DONTWAIT, err, 0);
 	if (!skb)
 		return NULL;
 
 	skb_reserve(skb, reserve);
 	skb_put(skb, linear);
-	skb->data_len = len - linear;
-	skb->len += len - linear;
+	skb->data_len = data_len;
+	skb->len += data_len;
 
 	return skb;
 }
 
+static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
+					 struct msghdr *msg,
+					 int offset, size_t size)
+{
+	int ret;
+
+	/* if SOCK_DGRAM, head room was alloc'ed and holds ll-headers */
+	__skb_pull(skb, offset);
+	ret = zerocopy_sg_from_iter(skb, &msg->msg_iter);
+	__skb_push(skb, offset);
+	if (unlikely(ret))
+		return ret == -EMSGSIZE ? ret : -EIO;
+
+	if (!skb_zerocopy_alloc(skb, size))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -2853,7 +2880,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len);
 	linear = max(linear, min_t(int, len, dev->hard_header_len));
 	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear,
-			       msg->msg_flags & MSG_DONTWAIT, &err);
+			       msg->msg_flags, &err);
 	if (skb == NULL)
 		goto out_unlock;
 
@@ -2867,7 +2894,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	/* Returns -EFAULT on error */
-	err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len);
+	if (msg->msg_flags & MSG_ZEROCOPY)
+		err = packet_zerocopy_sg_from_iovec(skb, msg, offset, len);
+	else
+		err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter,
+						  len);
 	if (err)
 		goto out_free;
 
@@ -2913,6 +2944,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	return len;
 
 out_free:
+	skb_zcopy_abort(skb);
 	kfree_skb(skb);
 out_unlock:
 	if (dev)
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFC v2 12/12] test: add sendmsg zerocopy tests
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (10 preceding siblings ...)
  2017-02-22 16:39 ` [PATCH RFC v2 11/12] packet: enable sendmsg zerocopy Willem de Bruijn
@ 2017-02-22 16:39 ` Willem de Bruijn
  2017-02-23 15:45 ` [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY David Miller
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-22 16:39 UTC (permalink / raw)
  To: netdev; +Cc: Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Introduce the tests uses to verify MSG_ZEROCOPY behavior:

snd_zerocopy:
  send zerocopy fragments out over the default route.

snd_zerocopy_lo:
  send data between a pair of local sockets and report throughput.

These tests are not suitable for inclusion in /tools/testing/selftest
as is, as they do not return a pass/fail verdict. Including them in
this RFC for demonstration, only.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/.gitignore        |   2 +
 tools/testing/selftests/net/Makefile          |   1 +
 tools/testing/selftests/net/snd_zerocopy.c    | 354 +++++++++++++++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
 4 files changed, 953 insertions(+)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index afe109e5508a..7dfb030f0c9b 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -5,3 +5,5 @@ reuseport_bpf
 reuseport_bpf_cpu
 reuseport_bpf_numa
 reuseport_dualstack
+snd_zerocopy
+snd_zerocopy_lo
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index e24e4c82542e..aa663c791f7a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ NET_PROGS =  socket
 NET_PROGS += psock_fanout psock_tpacket
 NET_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 NET_PROGS += reuseport_dualstack
+NET_PROGS += snd_zerocopy snd_zerocopy_lo
 
 all: $(NET_PROGS)
 reuseport_bpf_numa: LDFLAGS += -lnuma
diff --git a/tools/testing/selftests/net/snd_zerocopy.c b/tools/testing/selftests/net/snd_zerocopy.c
new file mode 100644
index 000000000000..052d0d14e62d
--- /dev/null
+++ b/tools/testing/selftests/net/snd_zerocopy.c
@@ -0,0 +1,354 @@
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <limits.h>
+#include <linux/errqueue.h>
+#include <poll.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#define MSG_ZEROCOPY	0x4000000
+
+#define SK_FUDGE_FACTOR	2		/* allow for overhead in SNDBUF */
+#define BUFLEN		(400 * 1000)	/* max length of send call */
+#define DEST_PORT	9000
+
+uint32_t sent = UINT32_MAX, acked = UINT32_MAX;
+
+int cfg_batch_notify = 10;
+int cfg_num_runs = 16;
+size_t cfg_socksize = 1 << 20;
+int cfg_stress_sec;
+int cfg_verbose;
+bool cfg_zerocopy;
+
+static unsigned long gettime_now_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void do_set_socksize(int fd)
+{
+	if (setsockopt(fd, SOL_SOCKET, SO_SNDBUFFORCE,
+		       &cfg_socksize, sizeof(cfg_socksize)))
+		error(1, 0, "setsockopt sndbufforce");
+
+	if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE,
+		       &cfg_socksize, sizeof(cfg_socksize)))
+		error(1, 0, "setsockopt sndbufforce");
+}
+
+static bool do_read_notification(int fd)
+{
+	struct sock_extended_err *serr;
+	struct cmsghdr *cm;
+	struct msghdr msg = {};
+	char control[100];
+	int64_t hi, lo;
+	int ret;
+
+	msg.msg_control = control;
+	msg.msg_controllen = sizeof(control);
+
+	ret = recvmsg(fd, &msg, MSG_DONTWAIT | MSG_ERRQUEUE);
+	if (ret == -1 && errno == EAGAIN)
+		return false;
+	if (ret == -1)
+		error(1, errno, "recvmsg notification");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, errno, "recvmsg notification: truncated");
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!cm || cm->cmsg_level != SOL_IP ||
+	    (cm->cmsg_type != IP_RECVERR && cm->cmsg_type != IPV6_RECVERR))
+		error(1, 0, "cmsg: wrong type");
+
+	serr = (void *) CMSG_DATA(cm);
+	if (serr->ee_errno != 0 || serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+		error(1, 0, "serr: wrong type");
+
+	hi = serr->ee_data;
+	lo = serr->ee_info;
+	if (lo != (uint32_t) (acked + 1))
+		error(1, 0, "notify: %lu..%lu, expected %u\n",
+		      lo, hi, acked + 1);
+	acked = hi;
+
+	if (cfg_verbose)
+		fprintf(stderr, "completed: %lu..%lu\n", lo, hi);
+
+	return true;
+}
+
+static void do_poll(int fd, int events, int timeout)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.fd = fd;
+	pfd.events = events;
+	pfd.revents = 0;
+
+	ret = poll(&pfd, 1, timeout);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret != 1)
+		error(1, 0, "poll timeout. events=0x%x acked=%u sent=%u",
+		      pfd.events, acked, sent);
+
+	if (cfg_verbose >= 2)
+		fprintf(stderr, "poll ok. events=0x%x revents=0x%x\n",
+			pfd.events, pfd.revents);
+}
+
+static void do_send(int fd, int len, int flags)
+{
+	static char data[BUFLEN];
+	struct msghdr msg = {};
+	struct iovec iov = {};
+	int ret;
+
+	if (len > BUFLEN)
+		error(1, 0, "write out of bounds");
+
+	iov.iov_base = data;
+	iov.iov_len = len;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	ret = sendmsg(fd, &msg, flags);
+	if (ret == -1)
+		error(1, errno, "sendmsg");
+	if (ret != len)
+		error(1, errno, "sendmsg: %u < %u", ret, len);
+
+	if (cfg_verbose >= 2)
+		fprintf(stderr, "  sent %6u B\n", len);
+
+	if (flags & MSG_ZEROCOPY && len) {
+		sent++;
+		if (cfg_verbose)
+			fprintf(stderr, "    add %u\n", sent);
+		do_read_notification(fd);
+	}
+}
+
+/* wait for all outstanding notifications to arrive */
+static void wait_for_notifications(int fd)
+{
+	unsigned long tstop, tnow;
+
+	if (acked == sent)
+		return;
+
+	tnow = gettime_now_ms();
+	tstop = tnow + 10000;
+	do {
+		do_poll(fd, 0 /* POLLERR is always reported */, tstop - tnow);
+
+		while (do_read_notification(fd)) {}
+		if (acked == sent)
+			return;
+
+		tnow = gettime_now_ms();
+	} while (tnow < tstop);
+
+	error(1, 0, "notify timeout. acked=%u sent=%u", acked, sent);
+}
+
+static void run_test(int fd, int len_cp, int len_zc, int batch)
+{
+	int i;
+
+	fprintf(stderr, "\ncp=%u zc=%u batch=%u\n", len_cp, len_zc, batch);
+
+	if (acked != sent)
+		error(1, 0, "not empty when expected");
+
+	if (batch * BUFLEN * SK_FUDGE_FACTOR > cfg_socksize) {
+		batch = cfg_socksize / BUFLEN / SK_FUDGE_FACTOR;
+		if (!batch)
+			error(1, 0, "cannot batch: increase socksize ('-s')");
+	}
+
+	for (i = 0; i < cfg_num_runs; i++) {
+		if (len_cp) {
+			do_poll(fd, POLLOUT, 1000);
+			do_send(fd, len_cp, 0);
+		}
+
+		do_poll(fd, POLLOUT, 1000);
+		do_send(fd, len_zc, cfg_zerocopy ? MSG_ZEROCOPY : 0);
+
+		if (i % batch == 0)
+			wait_for_notifications(fd);
+	}
+
+	wait_for_notifications(fd);
+}
+
+static void run_single(int fd, int len, int batch)
+{
+	run_test(fd, 0, len, batch);
+}
+
+/* combine zerocopy fragments with regular fragments */
+static void run_mix_zerocopy(int fd, int len_cp, int len_zc)
+{
+	run_test(fd, len_cp, len_zc, 1);
+}
+
+static void run_tests(int fd)
+{
+	/* test basic use */
+	run_single(fd, 4096, 1);
+	run_single(fd, 1500, 1);
+	run_single(fd, 1472, 1);
+	run_single(fd, 32000, 1);
+	run_single(fd, 65000, 1);
+	run_single(fd, BUFLEN, 1);
+
+	/* test notification on copybreak: data fits in skb head, no frags */
+	run_single(fd, 1, 1);
+
+	/* test coalescing */
+	run_single(fd, 32000, 4);
+	run_single(fd, 3000, 10);
+	run_single(fd, 100, 100);
+
+	run_mix_zerocopy(fd, 2000, 2000);
+	run_mix_zerocopy(fd, 100, 100);
+	run_mix_zerocopy(fd, 100, 1500);	/* fits coalesce in skb head */
+	run_mix_zerocopy(fd, 100, BUFLEN - 100);
+	run_mix_zerocopy(fd, 2000, 2000);
+
+	run_mix_zerocopy(fd, 1000, 12000);
+	run_mix_zerocopy(fd, 12000, 1000);
+	run_mix_zerocopy(fd, 12000, 12000);
+	run_mix_zerocopy(fd, 16000, 16000);
+
+	/* test more realistic async notifications */
+	run_single(fd, 1472, cfg_batch_notify);
+	run_single(fd, 1, cfg_batch_notify);
+	run_single(fd, BUFLEN, cfg_batch_notify);
+}
+
+static void run_stress_test(int fd, int runtime_sec)
+{
+	const int max_batch = 32;
+	unsigned long tstop, i = 0;
+	int len, len_cp, batch;
+
+	cfg_socksize = BUFLEN * max_batch * SK_FUDGE_FACTOR;
+	do_set_socksize(fd);
+
+	tstop = gettime_now_ms() + (runtime_sec * 1000);
+	do {
+		len = random() % BUFLEN;
+
+		/* create some skbs with only zerocopy frags */
+		if (len && ((i % 200) < 100))
+			len_cp = random() % BUFLEN;
+		else
+			len_cp = 0;
+
+		batch = random() % max_batch;
+
+		fprintf(stderr, "stress: cnt=%lu len_cp=%u len=%u batch=%u\n",
+			i, len_cp, len, batch);
+		run_test(fd, len_cp, len, batch);
+
+		i++;
+	} while (gettime_now_ms() < tstop);
+}
+
+static void parse_opts(int argc, char **argv, struct in_addr *addr)
+{
+	int c;
+
+	addr->s_addr = 0;
+
+	while ((c = getopt(argc, argv, "b:H:n:s:S:vV:z")) != -1) {
+		switch (c) {
+		case 'b':
+			cfg_batch_notify = strtol(optarg, NULL, 0);
+			break;
+		case 'H':
+			if (inet_pton(AF_INET, optarg, addr) != 1)
+				error(1, 0, "inet_pton: could not parse host");
+			break;
+		case 'n':
+			cfg_num_runs = strtol(optarg, NULL, 0);
+			break;
+		case 's':
+			cfg_socksize = strtol(optarg, NULL, 0);
+			break;
+		case 'S':
+			cfg_stress_sec = strtol(optarg, NULL, 0);
+		case 'v':
+			cfg_verbose = 1;
+			break;
+		case 'V':
+			cfg_verbose = strtol(optarg, NULL, 0);
+			break;
+		case 'z':
+			cfg_zerocopy = true;
+			break;
+		}
+	}
+
+	if (addr->s_addr == 0)
+		error(1, 0, "host ('-H') argument required");
+
+	if (cfg_verbose) {
+		fprintf(stderr, "batch_notify:  %u\n", cfg_batch_notify);
+		fprintf(stderr, "num_runs:      %u\n", cfg_num_runs);
+		fprintf(stderr, "socksize:      %lu\n", cfg_socksize);
+		fprintf(stderr, "stress:        %u\n", cfg_stress_sec);
+		fprintf(stderr, "zerocopy:      %s\n", cfg_zerocopy ? "ON" : "OFF");
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct sockaddr_in addr = {};
+	int fd;
+
+	parse_opts(argc, argv, &addr.sin_addr);
+
+	fd = socket(PF_INET, SOCK_STREAM, 0);
+	if (fd == -1)
+		error(1, errno, "socket");
+
+	do_set_socksize(fd);
+
+	addr.sin_family = AF_INET;
+	addr.sin_port = htons(DEST_PORT);
+	if (connect(fd, (void *) &addr, sizeof(addr)))
+		error(1, errno, "connect");
+
+	if (cfg_num_runs)
+		run_tests(fd);
+
+	if (cfg_stress_sec)
+		run_stress_test(fd, cfg_stress_sec);
+
+	if (close(fd))
+		error(1, errno, "close");
+
+	fprintf(stderr, "OK. All tests passed\n");
+	return 0;
+}
diff --git a/tools/testing/selftests/net/snd_zerocopy_lo.c b/tools/testing/selftests/net/snd_zerocopy_lo.c
new file mode 100644
index 000000000000..309b016a4fd5
--- /dev/null
+++ b/tools/testing/selftests/net/snd_zerocopy_lo.c
@@ -0,0 +1,596 @@
+/* evaluate MSG_ZEROCOPY over the loopback interface */
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <limits.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/socket.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#define MSG_ZEROCOPY	0x4000000
+
+#define NUM_LOOPS	4	/* MUST BE > 1 for corking to work */
+#define TXC_FUDGE	100
+
+static int  cfg_len_ms		= 4200;
+static int  cfg_report_len_ms	= 1000;
+static int  cfg_payload_len	= ((1 << 16) - 100);
+static bool cfg_test_packet;
+static bool cfg_test_raw;
+static bool cfg_test_raw_hdrincl;
+static bool cfg_test_tcp;
+static bool cfg_test_udp;
+static bool cfg_test_udp_cork;
+static bool cfg_verbose;
+static bool cfg_zerocopy;
+
+static bool flag_cork;
+
+static uint64_t tstop, treport;
+
+static unsigned long gettimeofday_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static uint16_t get_ip_csum(const uint16_t *start, int num_words)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_words; i++)
+		sum += start[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xFFFF) + (sum >> 16);
+
+	return ~sum;
+}
+
+static void timer_start(int timeout_ms)
+{
+	uint64_t tstart;
+
+	tstart = gettimeofday_ms();
+	treport = tstart + cfg_report_len_ms;
+	tstop = tstart + timeout_ms;
+}
+
+static bool timer_report(void)
+{
+	uint64_t tstart;
+
+	tstart = gettimeofday_ms();
+	if (tstart < treport)
+		return false;
+
+	treport = tstart + cfg_report_len_ms;
+	return true;
+}
+
+static bool timer_stop(void)
+{
+	return gettimeofday_ms() > tstop;
+}
+
+static int getnumcpus(void)
+{
+	int num = sysconf(_SC_NPROCESSORS_ONLN);
+
+	if (num < 1)
+		error(1, 0, "get num cpus\n");
+	return num;
+}
+
+static int setcpu(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask)) {
+		fprintf(stderr, "setaffinity %d\n", cpu);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void test_mtu_is_max(int fd)
+{
+	struct ifreq ifr = {
+		.ifr_name = "lo",
+	};
+
+	if (ioctl(fd, SIOCGIFMTU, &ifr))
+		error(1, errno, "ioctl get mtu");
+
+	if (ifr.ifr_mtu != 1 << 16)
+		error(1, 0, "mtu=%u expected=2^16\n", ifr.ifr_mtu);
+}
+
+static void do_poll(int fd, int dir)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.events = dir;
+	pfd.revents = 0;
+	pfd.fd = fd;
+
+	ret = poll(&pfd, 1, 10);
+	if (ret == -1)
+		error(1, errno, "poll");
+	if (ret == 0)
+		error(1, 0, "poll: EAGAIN");
+}
+
+static bool do_write_once(int fd, struct msghdr *msg, int total_len, bool zcopy)
+{
+	int ret, flags;
+
+	flags = MSG_DONTWAIT;
+	if (zcopy)
+		flags |= MSG_ZEROCOPY;
+
+	ret = sendmsg(fd, msg, flags);
+	if (ret == -1 && (errno == EAGAIN || errno == ENOBUFS))
+		return false;
+
+	if (ret == -1)
+		error(1, errno, "send");
+	if (ret != total_len)
+		error(1, 0, "send: ret=%u\n", ret);
+
+	return true;
+}
+
+static void do_print_data_mismatch(char *tx, char *rx, int len)
+{
+	int i;
+
+	fprintf(stderr, "tx: ");
+	for (i = 0; i < len; i++)
+		fprintf(stderr, "%hx ", tx[i] & 0xff);
+	fprintf(stderr, "\nrx: ");
+	for (i = 0; i < len; i++)
+		fprintf(stderr, "%hx ", rx[i] & 0xff);
+	fprintf(stderr, "\n");
+}
+
+/* Flush @remaining bytes from the socket, blocking if necessary */
+static void do_flush_tcp(int fd, long remaining)
+{
+	unsigned long tstop;
+	int ret;
+
+	tstop = gettimeofday_ms() + 500;
+	while (remaining > 0 && gettimeofday_ms() < tstop) {
+		ret = recv(fd, NULL, remaining, MSG_TRUNC);
+		if (ret == -1)
+			error(1, errno, "recv (flush)");
+		remaining -= ret;
+		if (!remaining)
+			return;
+		fprintf(stderr, "recv (flush): %dB, %ldB left\n",
+			ret, remaining);
+	}
+
+	error(1, 0, "recv (flush): %ldB at timeout", remaining);
+}
+
+static bool do_read_once(int fd, char *tbuf, int type, bool corked, long *bytes)
+{
+	char rbuf[32], *payload;
+	int ret, len, expected, flags;
+
+	flags = MSG_DONTWAIT;
+	/* MSG_TRUNC differs on SOCK_STREAM: it flushes the buffer */
+	if (type != SOCK_STREAM)
+		flags |= MSG_TRUNC;
+
+	ret = recv(fd, rbuf, sizeof(rbuf), flags);
+	if (ret == -1 && errno == EAGAIN)
+		return false;
+	if (ret == -1)
+		error(1, errno, "recv");
+	if (type == SOCK_RAW)
+		ret -= sizeof(struct iphdr);
+
+	expected = sizeof(rbuf);
+	if (flags & MSG_TRUNC) {
+		expected = cfg_payload_len;
+		if (corked)
+			expected *= NUM_LOOPS;
+		*bytes += expected;
+	} else {
+		*bytes += cfg_payload_len;
+	}
+	if (ret != expected)
+		error(1, 0, "recv: ret=%u (exp=%u)\n", ret, expected);
+
+	payload = rbuf;
+	len = sizeof(rbuf);
+	if (type == SOCK_RAW) {
+		payload += sizeof(struct iphdr);
+		len -= sizeof(struct iphdr);
+	}
+
+	if (memcmp(payload, tbuf, len)) {
+		do_print_data_mismatch(tbuf, payload, len);
+		error(1, 0, "\nrecv: data mismatch\n");
+	}
+
+	/* Stream sockets are not truncated, so flush explicitly */
+	if (type == SOCK_STREAM)
+		do_flush_tcp(fd, cfg_payload_len - sizeof(rbuf));
+
+	return true;
+}
+
+static void setup_iph(struct iphdr *iph, uint16_t payload_len)
+{
+	memset(iph, 0, sizeof(*iph));
+	iph->version	= 4;
+	iph->tos	= 0;
+	iph->ihl	= 5;
+	iph->ttl	= 8;
+	iph->saddr	= htonl(INADDR_LOOPBACK);
+	iph->daddr	= htonl(INADDR_LOOPBACK);
+	iph->protocol	= IPPROTO_EGP;
+	iph->tot_len	= htons(sizeof(*iph) + payload_len);
+	iph->check	= get_ip_csum((void *) iph, iph->ihl << 1);
+	/* No need to calculate checksum: set by kernel */
+}
+
+static void do_cork(int fd, bool enable)
+{
+	int cork = !!enable;
+
+	if (setsockopt(fd, IPPROTO_UDP, UDP_CORK, &cork, sizeof(cork)))
+		error(1, errno, "cork %u", enable);
+}
+
+static int do_read_notification(int fd)
+{
+	struct sock_extended_err *serr;
+	struct cmsghdr *cm;
+	struct msghdr msg = {};
+	char control[100];
+	int64_t hi, lo, range;
+	int ret;
+
+	msg.msg_control = control;
+	msg.msg_controllen = sizeof(control);
+
+	ret = recvmsg(fd, &msg, MSG_DONTWAIT | MSG_ERRQUEUE);
+	if (ret == -1 && errno == EAGAIN)
+		return 0;
+
+	if (ret == -1)
+		error(1, errno, "recvmsg notification");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, errno, "recvmsg notification: truncated");
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!cm)
+		error(1, 0, "cmsg: no cmsg");
+	if (!((cm->cmsg_level == SOL_IP && cm->cmsg_type == IP_RECVERR) ||
+	      (cm->cmsg_level == SOL_IPV6 && cm->cmsg_type == IPV6_RECVERR) ||
+	      (cm->cmsg_level == SOL_PACKET && cm->cmsg_type == PACKET_TX_TIMESTAMP)))
+		error(1, 0, "serr: wrong type");
+
+	serr = (void *) CMSG_DATA(cm);
+	if (serr->ee_errno != 0 || serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+		error(1, 0, "serr: wrong type");
+
+	hi = serr->ee_data;
+	lo = serr->ee_info;
+	range = hi - lo + 1;
+	if (range < 0)
+		range += UINT32_MAX;
+
+	if (cfg_verbose)
+		fprintf(stderr, "completed: %lu (h=%lu l=%lu)\n",
+			range, hi, lo);
+
+	return (int) range;
+}
+
+static int do_read_notifications(int fd)
+{
+	int ret, len = 0;
+
+	do {
+		ret = do_read_notification(fd);
+		len += ret;
+	} while (ret);
+
+	return len;
+}
+
+static void do_run(int fdt, int fdr, int domain, int type, int protocol)
+{
+	static char tbuf[1 << 16];
+	struct sockaddr_ll laddr;
+	struct msghdr msg;
+	struct iovec iov[2];
+	struct iphdr iph;
+	long numtx = 0, numrx = 0, bytesrx = 0, numtxc = 0, expected_txc = 0;
+	int cpu, i, total_len = 0, type_r = type;
+
+	memset(&msg, 0, sizeof(msg));
+	memset(&iov, 0, sizeof(iov));
+	for (i = 0; i < sizeof(tbuf); i++)
+		tbuf[i] = 'a' + (i % 26);
+
+	i = 0;
+
+	/* for packet sockets, must prepare link layer information */
+	if (domain == PF_PACKET) {
+		memset(&laddr, 0, sizeof(laddr));
+		laddr.sll_family	= AF_PACKET;
+		laddr.sll_ifindex	= 1;	/* lo */
+		laddr.sll_protocol	= htons(ETH_P_IP);
+		laddr.sll_halen		= ETH_ALEN;
+
+		msg.msg_name		= &laddr;
+		msg.msg_namelen		= sizeof(laddr);
+
+		/* with PF_PACKET tx, do not expect ip_hdr on Rx */
+		type_r			= SOCK_DGRAM;
+	}
+
+	if (domain == PF_PACKET || protocol == IPPROTO_RAW) {
+		setup_iph(&iph, cfg_payload_len);
+		iov[i].iov_base = (void *) &iph;
+		iov[i].iov_len = sizeof(iph);
+		total_len += iov[i].iov_len;
+		i++;
+	}
+	iov[i].iov_base = tbuf;
+	iov[i].iov_len = cfg_payload_len;
+	total_len += iov[i].iov_len;
+
+	msg.msg_iovlen = i + 1;
+	msg.msg_iov = iov;
+
+	cpu = getnumcpus() - 1;
+	setcpu(cpu);
+	fprintf(stderr, "cpu: %u\n", cpu);
+
+	do {
+		if (cfg_zerocopy)
+			numtxc += do_read_notifications(fdt);
+
+		if (flag_cork)
+			do_cork(fdt, true);
+
+		for (i = 0; i < NUM_LOOPS; i++) {
+			bool do_zcopy = cfg_zerocopy;
+
+			if (flag_cork && (i & 0x1))
+				do_zcopy = false;
+
+			if (!do_write_once(fdt, &msg, total_len, do_zcopy)) {
+				do_poll(fdt, POLLOUT);
+				break;
+			}
+
+			numtx++;
+			if (do_zcopy)
+				expected_txc++;
+		}
+		if (flag_cork)
+			do_cork(fdt, false);
+
+		while (do_read_once(fdr, tbuf, type_r, flag_cork, &bytesrx))
+			numrx++;
+
+		if (timer_report()) {
+			fprintf(stderr, "rx=%lu (%lu MB) tx=%lu txc=%lu\n",
+				numrx, bytesrx >> 20, numtx, numtxc);
+		}
+	} while (!timer_stop());
+
+	if (cfg_zerocopy)
+		numtxc += do_read_notifications(fdt);
+
+	if (flag_cork)
+		numtx /= NUM_LOOPS;
+
+	if (labs(numtx - numrx) > TXC_FUDGE)
+		error(1, 0, "missing packets: %lu != %lu\n", numrx, numtx);
+	if (cfg_zerocopy && labs(expected_txc - numtxc) > TXC_FUDGE)
+		error(1, 0, "missing completions: rx=%lu expected=%lu\n",
+			    numtxc, expected_txc);
+}
+
+static int do_setup_rx(int domain, int type, int protocol)
+{
+	int fdr;
+
+	if (domain == PF_PACKET) {
+		/* Even when testing PF_PACKET Tx, Rx on PF_INET */
+		domain = PF_INET;
+		type = SOCK_RAW;
+		protocol = IPPROTO_EGP;
+	} else if (protocol == IPPROTO_RAW) {
+		protocol = IPPROTO_EGP;
+	}
+
+	fdr = socket(domain, type, protocol);
+	if (fdr == -1)
+		error(1, errno, "socket r");
+
+	return fdr;
+}
+
+static void do_setup_and_run(int domain, int type, int protocol)
+{
+	struct sockaddr_in addr;
+	socklen_t alen;
+	int fdr, fdt, ret;
+
+	fprintf(stderr, "test socket(%u, %u, %u)\n", domain, type, protocol);
+
+	fdr = do_setup_rx(domain, type, protocol);
+	fdt = socket(domain, type, protocol);
+	if (fdt == -1)
+		error(1, errno, "socket t");
+
+	test_mtu_is_max(fdr);
+
+	if (domain != PF_PACKET) {
+		memset(&addr, 0, sizeof(addr));
+		addr.sin_family = AF_INET;
+		addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+		alen = sizeof(addr);
+
+		if (bind(fdr, (void *) &addr, sizeof(addr)))
+			error(1, errno, "bind");
+		if (type == SOCK_STREAM && listen(fdr, 1))
+			error(1, errno, "listen");
+		if (getsockname(fdr, (void *) &addr, &alen) ||
+		    alen != sizeof(addr))
+			error(1, 0, "getsockname");
+		if (connect(fdt, (void *) &addr, sizeof(addr)))
+			error(1, errno, "connect");
+	}
+
+	if (type == SOCK_STREAM) {
+		int fda = fdr;
+
+		fdr = accept(fda, NULL, NULL);
+		if (fdr == -1)
+			error(1, errno, "accept");
+		if (close(fda))
+			error(1, errno, "close listen sock");
+	}
+
+	ret = 1 << 21;
+	if (setsockopt(fdr, SOL_SOCKET, SO_RCVBUF, &ret, sizeof(ret)))
+		error(1, errno, "socklen r");
+	if (setsockopt(fdt, SOL_SOCKET, SO_SNDBUF, &ret, sizeof(ret)))
+		error(1, errno, "socklen t");
+
+	timer_start(cfg_len_ms);
+	do_run(fdt, fdr, domain, type, protocol);
+
+	if (close(fdt))
+		error(1, errno, "close t");
+	if (close(fdr))
+		error(1, errno, "close r");
+
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	const char on[] = "ON", off[] = "OFF";
+	const int max_payload = IP_MAXPACKET - sizeof(struct iphdr);
+	int c;
+
+	while ((c = getopt(argc, argv, "l:prRs:tuUvz")) != -1) {
+		switch (c) {
+		case 'l':
+			cfg_len_ms = strtoul(optarg, NULL, 10) * 1000;
+			break;
+		case 'p':
+			cfg_test_packet = true;
+			break;
+		case 'r':
+			cfg_test_raw = true;
+			break;
+		case 'R':
+			cfg_test_raw_hdrincl = true;
+			break;
+		case 's':
+			cfg_payload_len = strtoul(optarg, NULL, 0);
+			break;
+		case 't':
+			cfg_test_tcp = true;
+			break;
+		case 'u':
+			cfg_test_udp = true;
+			break;
+		case 'U':
+			cfg_test_udp_cork = true;
+			break;
+		case 'v':
+			cfg_verbose = true;
+			break;
+		case 'z':
+			cfg_zerocopy = true;
+			break;
+		}
+	}
+
+	if (cfg_payload_len > max_payload)
+		error(1, 0, "-s: payload too long");
+	if (cfg_payload_len >= (max_payload - sizeof(struct tcphdr) - 10))
+		fprintf(stderr, "warn: len may exceed limit\n");
+
+	if (cfg_verbose) {
+		fprintf(stderr, "time:     %u ms\n"
+				"size:     %u B\n"
+				"zerocopy: %s\n",
+			cfg_len_ms,
+			cfg_payload_len,
+			cfg_zerocopy ? on : off);
+	}
+}
+
+int main(int argc, char **argv)
+{
+	parse_opts(argc, argv);
+
+	if (cfg_test_packet)
+		do_setup_and_run(PF_PACKET, SOCK_DGRAM, 0);
+	if (cfg_test_udp)
+		do_setup_and_run(PF_INET, SOCK_DGRAM, 0);
+	if (cfg_test_udp_cork) {
+		int saved_payload_len = cfg_payload_len;
+
+		cfg_payload_len /= NUM_LOOPS;
+
+		flag_cork = true;
+		do_setup_and_run(PF_INET, SOCK_DGRAM, 0);
+		flag_cork = false;
+
+		cfg_payload_len = saved_payload_len;
+	}
+	if (cfg_test_raw)
+		do_setup_and_run(PF_INET, SOCK_RAW, IPPROTO_EGP);
+	if (cfg_test_raw_hdrincl)
+		do_setup_and_run(PF_INET, SOCK_RAW, IPPROTO_RAW);
+	if (cfg_test_tcp)
+		do_setup_and_run(PF_INET, SOCK_STREAM, 0);
+
+	fprintf(stderr, "OK. All tests passed\n");
+	return 0;
+}
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
  2017-02-22 16:38 ` [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
@ 2017-02-22 20:33   ` Eric Dumazet
  2017-02-23  1:51     ` Willem de Bruijn
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2017-02-22 20:33 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn

On Wed, 2017-02-22 at 11:38 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Refine skb_copy_ubufs to support compount pages. With upcoming TCP
> and UDP zerocopy sendmsg, such fragments may appear.
> 
> These skbuffs can have both kernel and zerocopy fragments, e.g., when
> corking. Avoid unnecessary copying of fragments that have no userspace
> reference.
> 
> It is not safe to modify skb frags when the skbuff is shared. This
> should not happen. Fail loudly if we find an unexpected edge case.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> ---
>  net/core/skbuff.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f3557958e9bf..67e4216fca01 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph);
>   *	If this function is called from an interrupt gfp_mask() must be
>   *	%GFP_ATOMIC.
>   *
> + *	skb_shinfo(skb) can only be safely modified when not accessed
> + *	concurrently. Fail if the skb is shared or cloned.
> + *
>   *	Returns 0 on success or a negative error code on failure
>   *	to allocate kernel memory to copy to.
>   */
> @@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>  	struct page *page, *head = NULL;
>  	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
>  
> +	if (skb_shared(skb) || skb_cloned(skb)) {
> +		WARN_ON_ONCE(1);
> +		return -EINVAL;
> +	}
> +
>  	for (i = 0; i < num_frags; i++) {
>  		u8 *vaddr;
> +		unsigned int order = 0;
> +		gfp_t mask = gfp_mask;
>  		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
>  
> -		page = alloc_page(gfp_mask);
> +		page = skb_frag_page(f);
> +		if (page_count(page) == 1) {
> +			skb_frag_ref(skb, i);

This could be : get_page(page);

> +			goto copy_done;
> +		}
> +
> +		if (f->size > PAGE_SIZE) {
> +			order = get_order(f->size);
> +			mask |= __GFP_COMP;

Note that this would probably fail under memory pressure.

We could instead try to explode the few segments into order-0 only
pages.

Hopefully this case should not be frequent.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
  2017-02-22 20:33   ` Eric Dumazet
@ 2017-02-23  1:51     ` Willem de Bruijn
  0 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-23  1:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Network Development, Willem de Bruijn

>>
>> -             page = alloc_page(gfp_mask);
>> +             page = skb_frag_page(f);
>> +             if (page_count(page) == 1) {
>> +                     skb_frag_ref(skb, i);
>
> This could be : get_page(page);

Ah, indeed. Thanks.

>
>> +                     goto copy_done;
>> +             }
>> +
>> +             if (f->size > PAGE_SIZE) {
>> +                     order = get_order(f->size);
>> +                     mask |= __GFP_COMP;
>
> Note that this would probably fail under memory pressure.
>
> We could instead try to explode the few segments into order-0 only
> pages.

Good point. I'll revise to use only order-0 here.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (11 preceding siblings ...)
  2017-02-22 16:39 ` [PATCH RFC v2 12/12] test: add sendmsg zerocopy tests Willem de Bruijn
@ 2017-02-23 15:45 ` David Miller
  2017-02-24 23:03 ` Alexei Starovoitov
  2017-02-27 18:57 ` Michael Kerrisk
  14 siblings, 0 replies; 37+ messages in thread
From: David Miller @ 2017-02-23 15:45 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 22 Feb 2017 11:38:49 -0500

> From: Willem de Bruijn <willemb@google.com>
> 
> RFCv2:
> 
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:

I've been over this series once or twice and generally speaking it looks
fine to me.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (12 preceding siblings ...)
  2017-02-23 15:45 ` [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY David Miller
@ 2017-02-24 23:03 ` Alexei Starovoitov
  2017-02-25  0:25   ` Willem de Bruijn
  2017-02-27 18:57 ` Michael Kerrisk
  14 siblings, 1 reply; 37+ messages in thread
From: Alexei Starovoitov @ 2017-02-24 23:03 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn

On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote:
> 
> * Limitations / Known Issues
> 
> - PF_INET6 is not yet supported.

we struggled so far to make it work in our setups which are ipv6 only.
Looking at patches it seems the code should just work.
What particularly is missing ?

Great stuff. Looking forward to net-next reopening :)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-24 23:03 ` Alexei Starovoitov
@ 2017-02-25  0:25   ` Willem de Bruijn
  0 siblings, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-25  0:25 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Network Development, Willem de Bruijn

On Fri, Feb 24, 2017 at 6:03 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Feb 22, 2017 at 11:38:49AM -0500, Willem de Bruijn wrote:
>>
>> * Limitations / Known Issues
>>
>> - PF_INET6 is not yet supported.
>
> we struggled so far to make it work in our setups which are ipv6 only.
> Looking at patches it seems the code should just work.
> What particularly is missing ?
>
> Great stuff. Looking forward to net-next reopening :)

Thanks for taking the feature for a spin!

The udp and raw paths require separate ipv6 patches. TCP should indeed
just work. I just ran a slightly modified snd_zerocopy_lo with good
results as well as a hacked netperf to another host.

I should have had ipv6 from the start, of course. Will add it before
resubmitting when net-next opens. For now, quick hack to
snd_zerocopy_lo.c:

diff --git a/tools/testing/selftests/net/snd_zerocopy_lo.c
b/tools/testing/selftests/net/snd_zerocopy_lo.c
index 309b016a4fd5..38a165e2af64 100644
--- a/tools/testing/selftests/net/snd_zerocopy_lo.c
+++ b/tools/testing/selftests/net/snd_zerocopy_lo.c
@@ -453,7 +453,7 @@ static int do_setup_rx(int domain, int type, int protocol)

 static void do_setup_and_run(int domain, int type, int protocol)
 {
-       struct sockaddr_in addr;
+       struct sockaddr_in6 addr;
        socklen_t alen;
        int fdr, fdt, ret;

@@ -468,8 +468,8 @@ static void do_setup_and_run(int domain, int type,
int protocol)

        if (domain != PF_PACKET) {
                memset(&addr, 0, sizeof(addr));
-               addr.sin_family = AF_INET;
-               addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+               addr.sin6_family = AF_INET6;
+               addr.sin6_addr = in6addr_loopback;
                alen = sizeof(addr);

                if (bind(fdr, (void *) &addr, sizeof(addr)))
@@ -589,7 +589,7 @@ int main(int argc, char **argv)
        if (cfg_test_raw_hdrincl)
                do_setup_and_run(PF_INET, SOCK_RAW, IPPROTO_RAW);
        if (cfg_test_tcp)
-               do_setup_and_run(PF_INET, SOCK_STREAM, 0);
+               do_setup_and_run(PF_INET6, SOCK_STREAM, 0);


Loopback zerocopy is disabled in RFCv2, so to use snd_zerocopy_lo to verify the
feature requires this hack in skb_orphan_frags_rx:

 static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
 {
-       if (likely(!skb_zcopy(skb)))
-               return 0;
-       return skb_copy_ubufs(skb, gfp_mask);
+       return skb_orphan_frags(skb, gfp_mask);
 }

With this change, I see

$ ./snd_zerocopy_lo_ipv6 -t
test socket(10, 1, 0)
cpu: 23
rx=106364 (6637 MB) tx=106364 txc=0
rx=209736 (13088 MB) tx=209736 txc=0
rx=314524 (19627 MB) tx=314524 txc=0
rx=419424 (26174 MB) tx=419424 txc=0
OK. All tests passed

$ ./snd_zerocopy_lo_ipv6 -t -z
test socket(10, 1, 0)
cpu: 23
rx=239792 (14964 MB) tx=239792 txc=239786
rx=477376 (29790 MB) tx=477376 txc=477370
rx=715016 (44620 MB) tx=715016 txc=715010
rx=952820 (59460 MB) tx=952820 txc=952814
OK. All tests passed

In comparison, the same without the change

$ ./snd_zerocopy_lo_ipv6 -t
test socket(10, 1, 0)
cpu: 23
rx=109908 (6858 MB) tx=109908 txc=0
rx=217100 (13548 MB) tx=217100 txc=0
rx=326584 (20380 MB) tx=326584 txc=0
rx=429568 (26807 MB) tx=429568 txc=0
OK. All tests passed

$ ./snd_zerocopy_lo_ipv6 -t  -z
test socket(10, 1, 0)
cpu: 23
rx=87636 (5468 MB) tx=87636 txc=87630
rx=174328 (10878 MB) tx=174328 txc=174322
rx=260360 (16247 MB) tx=260360 txc=260354
rx=346512 (21623 MB) tx=346512 txc=346506

Here the sk_buff hits the deep copy in skb_copy_ubufs called from
__netif_receive_skb_core, which actually degrades performance versus
copying as part of the sendmsg() syscall.

The netperf change is to add MSG_ZEROCOPY to send() in send_tcp_stream
and also adding a recvmsg(send_socket, &msg, MSG_ERRQUEUE) to the same
function, preferably called only once every N iterations. This does
not take any additional explicit references on the send_ring element
while data is in flight, so is really a hack, but ring contents should
be static throughout the test. I did not modify the omni tests, so
this requires building with --no-omni.

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (13 preceding siblings ...)
  2017-02-24 23:03 ` Alexei Starovoitov
@ 2017-02-27 18:57 ` Michael Kerrisk
  2017-02-28 19:46   ` Andy Lutomirski
  14 siblings, 1 reply; 37+ messages in thread
From: Michael Kerrisk @ 2017-02-27 18:57 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, Willem de Bruijn, Linux API

[CC += linux-api@vger.kernel.org]

Hi Willem

This is a change to the kernel-user-space API. Please CC
linux-api@vger.kernel.org on any future iterations of this patch.

Thanks,

Michael



On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> RFCv2:
>
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:
>
>   RFC -> RFCv2:
>     - review comment: do not loop skb with zerocopy frags onto rx:
>           add skb_orphan_frags_rx to orphan even refcounted frags
>           call this in __netif_receive_skb_core, deliver_skb and tun:
>           the same as 1080e512d44d ("net: orphan frags on receive")
>     - fix: hold an explicit sk reference on each notification skb.
>           previously relied on the reference (or wmem) held by the
>           data skb that would trigger notification, but this breaks
>           on skb_orphan.
>     - fix: when aborting a send, do not inc the zerocopy counter
>           this caused gaps in the notification chain
>     - fix: in packet with SOCK_DGRAM, pull ll headers before calling
>           zerocopy_sg_from_iter
>     - fix: if sock_zerocopy_realloc does not allow coalescing,
>           do not fail, just allocate a new ubuf
>     - fix: in tcp, check return value of second allocation attempt
>     - chg: allocate notification skbs from optmem
>           to avoid affecting tcp write queue accounting (TSQ)
>     - chg: limit #locked pages (ulimit) per user instead of per process
>     - chg: grow notification ids from 16 to 32 bit
>       - pass range [lo, hi] through 32 bit fields ee_info and ee_data
>     - chg: rebased to davem-net-next on top of v4.10-rc7
>     - add: limit notification coalescing
>           sharing ubufs limits overhead, but delays notification until
>           the last packet is released, possibly unbounded. Add a cap.
>     - tests: add snd_zerocopy_lo pf_packet test
>     - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
>
> The change to allocate notification skbuffs from optmem requires
> ensuring that net.core.optmem is at least a few 100KB. To
> experiment, run
>
>   sysctl -w net.core.optmem_max=1048576
>
> The snd_zerocopy_lo benchmarks reported in the individual patches were
> rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
> replaced with skb_orphan_frags to allow looping to local sockets. The
> netperf results below are also rerun with v2.
>
> In application load, copy avoidance shows a roughly 5% systemwide
> reduction in cycles when streaming large flows and a 4-8% reduction in
> wall clock time on early tensorflow test workloads.
>
>
> Overview (from original RFC):
>
> Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
> Implement the feature for TCP, UDP, RAW and packet sockets. This is
> a generalization of a previous packet socket RFC patch
>
>   http://patchwork.ozlabs.org/patch/413184/
>
> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
> creates skbuff fragments directly from these pages. On tx completion,
> it notifies the socket owner that it is safe to modify memory by
> queuing a completion notification onto the socket error queue.
>
> The kernel already implements such copy avoidance with vmsplice plus
> splice and with ubuf_info for tun and virtio. Extend the second
> with features required by TCP and others: reference counting to
> support cloning (retransmit queue) and shared fragments (GSO) and
> notification coalescing to handle corking.
>
> Notifications are queued onto the socket error queue as a range
> range [N, N+m], where N is a per-socket counter incremented on each
> successful zerocopy send call.
>
> * Performance
>
> The below table shows cycles reported by perf for a netperf process
> sending a single 10 Gbps TCP_STREAM. The first three columns show
> Mcycles spent in the netperf process context. The second three columns
> show time spent systemwide (-a -C A,B) on the two cpus that run the
> process and interrupt handler. Reported is the median of at least 3
> runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
> Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
> are disabled and the kernel is booted with idle=halt.
>
> NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
>
> perf stat -e cycles $NETPERF
> perf stat -C 2,3 -a -e cycles $NETPERF
>
>         --process cycles--      ----cpu cycles----
>            std      zc   %      std         zc   %
> 4K      27,609  11,217  41      49,217  39,175  79
> 16K     21,370   3,823  18      43,540  29,213  67
> 64K     20,557   2,312  11      42,189  26,910  64
> 256K    21,110   2,134  10      43,006  27,104  63
> 1M      20,987   1,610   8      42,759  25,931  61
>
> Perf record indicates the main source of these differences. Process
> cycles only at 1M writes (perf record; perf report -n):
>
> std:
> Samples: 42K of event 'cycles', Event count (approx.): 21258597313
>  79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
>   3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
>   1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
>   0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
>   0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb
>
> zc:
> Samples: 1K of event 'cycles', Event count (approx.): 1439509124
>  30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
>  14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
>   8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
>   4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
>   3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node
>
>
> * Safety
>
> The number of pages that can be pinned on behalf of a user with
> MSG_ZEROCOPY is bound by the locked memory ulimit.
>
> While the kernel holds process memory pinned, a process cannot safely
> reuse those pages for other purposes. Packets looped onto the receive
> stack and queued to a socket can be held indefinitely. Avoid unbounded
> notification latency by restricting user pages to egress paths only.
> skb_orphan_frags_rx() will create a private copy of pages even for
> refcounted packets when these are looped, as did skb_orphan_frags for
> the original tun zerocopy implementation.
>
> Pages are not remapped read-only. Processes can modify packet contents
> while packets are in flight in the kernel path. Bytes on which kernel
> control flow depends (headers) are copied to avoid TOCTTOU attacks.
> Datapath integrity does not otherwise depend on payload, with three
> exceptions: checksums, optional sk_filter/tc u32/.. and device +
> driver logic. The effect of wrong checksums is limited to the
> misbehaving process. TC filters that access contents may have to be
> excluded by adding an skb_orphan_frags_rx.
>
> Processes can also safely avoid OOM conditions by bounding the number
> of bytes passed with MSG_ZEROCOPY and by removing shared pages after
> transmission from their own memory map.
>
>
> * Limitations / Known Issues
>
> - PF_INET6 is not yet supported.
> - TCP does not build max GSO packets, especially for
>      small send buffers (< 4 KB)
>
> Willem de Bruijn (12):
>   sock: allocate skbs from optmem
>   sock: skb_copy_ubufs support for compound pages
>   sock: add generic socket zerocopy
>   sock: enable sendmsg zerocopy
>   sock: sendmsg zerocopy notification coalescing
>   sock: sendmsg zerocopy ulimit
>   sock: sendmsg zerocopy limit bytes per notification
>   tcp: enable sendmsg zerocopy
>   udp: enable sendmsg zerocopy
>   raw: enable sendmsg zerocopy with IP_HDRINCL
>   packet: enable sendmsg zerocopy
>   test: add sendmsg zerocopy tests
>
>  drivers/net/tun.c                             |   2 +-
>  drivers/vhost/net.c                           |   1 +
>  include/linux/sched.h                         |   2 +-
>  include/linux/skbuff.h                        |  94 +++-
>  include/linux/socket.h                        |   1 +
>  include/net/sock.h                            |   4 +
>  include/uapi/linux/errqueue.h                 |   1 +
>  net/core/datagram.c                           |  35 +-
>  net/core/dev.c                                |   4 +-
>  net/core/skbuff.c                             | 327 ++++++++++++--
>  net/core/sock.c                               |  29 ++
>  net/ipv4/ip_output.c                          |  34 +-
>  net/ipv4/raw.c                                |  27 +-
>  net/ipv4/tcp.c                                |  37 +-
>  net/packet/af_packet.c                        |  52 ++-
>  tools/testing/selftests/net/.gitignore        |   2 +
>  tools/testing/selftests/net/Makefile          |   1 +
>  tools/testing/selftests/net/snd_zerocopy.c    | 354 +++++++++++++++
>  tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++++++++++++++++++++++++++
>  19 files changed, 1536 insertions(+), 67 deletions(-)
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
>  create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c
>
> --
> 2.11.0.483.g087da7b7c-goog
>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-27 18:57 ` Michael Kerrisk
@ 2017-02-28 19:46   ` Andy Lutomirski
  2017-02-28 20:43     ` Willem de Bruijn
  2017-03-01  3:25     ` David Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-02-28 19:46 UTC (permalink / raw)
  To: Michael Kerrisk; +Cc: Willem de Bruijn, netdev, Willem de Bruijn, Linux API

On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
<mtk.manpages@gmail.com> wrote:
> [CC += linux-api@vger.kernel.org]
>
> Hi Willem
>

>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>> creates skbuff fragments directly from these pages. On tx completion,
>> it notifies the socket owner that it is safe to modify memory by
>> queuing a completion notification onto the socket error queue.

What happens if the user writes to the pages while it's not safe?

How about if you're writing to an interface or a route that has crypto
involved and a malicious user can make the data change in the middle
of a crypto operation, thus perhaps leaking the entire key?  (I
wouldn't be at all surprised if a lot of provably secure AEAD
constructions are entirely compromised if an attacker can get the
ciphertext and tag computed from a message that changed during the
computation.

I can see this working if you have a special type of skb that
indicates that the data might be concurrently written and have all the
normal skb APIs (including, especially, anything that clones it) make
a copy first.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 19:46   ` Andy Lutomirski
@ 2017-02-28 20:43     ` Willem de Bruijn
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01  3:25     ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-28 20:43 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
> <mtk.manpages@gmail.com> wrote:
>> [CC += linux-api@vger.kernel.org]
>>
>> Hi Willem
>>
>
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
>
> What happens if the user writes to the pages while it's not safe?
>
> How about if you're writing to an interface or a route that has crypto
> involved and a malicious user can make the data change in the middle
> of a crypto operation, thus perhaps leaking the entire key?  (I
> wouldn't be at all surprised if a lot of provably secure AEAD
> constructions are entirely compromised if an attacker can get the
> ciphertext and tag computed from a message that changed during the
> computation.

Operations that read or write payload, such as this crypto example,
but also ebpf in tc or iptables, for instance, demand a deep copy using
skb_copy_ubufs before the operation.

This blacklist approach requires caution, but these paths should be
few and countable. It is not possible to predict at the socket layer
whether a packet will encounter any such operation, so white-listing
a subset of end-to-end paths is not practical.

> I can see this working if you have a special type of skb that
> indicates that the data might be concurrently written and have all the
> normal skb APIs (including, especially, anything that clones it) make
> a copy first.

Support for cloned skbs is required for TCP, both at tcp_transmit_skb
and segmentation offload. Patch 4 especially adds reference counting
of shared pages across clones and other sk_buff operations like
pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
on clones in specific datapaths like the above.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-28 21:06         ` Andy Lutomirski
  2017-03-01  3:28           ` David Miller
  2017-02-28 21:09         ` Andy Lutomirski
  1 sibling, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2017-02-28 21:06 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
<willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>> <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> [CC += linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org]
>>>
>>> Hi Willem
>>>
>>
>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>> it notifies the socket owner that it is safe to modify memory by
>>>> queuing a completion notification onto the socket error queue.
>>
>> What happens if the user writes to the pages while it's not safe?
>>
>> How about if you're writing to an interface or a route that has crypto
>> involved and a malicious user can make the data change in the middle
>> of a crypto operation, thus perhaps leaking the entire key?  (I
>> wouldn't be at all surprised if a lot of provably secure AEAD
>> constructions are entirely compromised if an attacker can get the
>> ciphertext and tag computed from a message that changed during the
>> computation.
>
> Operations that read or write payload, such as this crypto example,
> but also ebpf in tc or iptables, for instance, demand a deep copy using
> skb_copy_ubufs before the operation.
>
> This blacklist approach requires caution, but these paths should be
> few and countable. It is not possible to predict at the socket layer
> whether a packet will encounter any such operation, so white-listing
> a subset of end-to-end paths is not practical.

How about hardware that malfunctions if the packet changes out from
under it?  A whitelist seems quite a bit safer.

--Andy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-02-28 21:06         ` Andy Lutomirski
@ 2017-02-28 21:09         ` Andy Lutomirski
  2017-02-28 21:28           ` Willem de Bruijn
  2017-02-28 21:47           ` Eric Dumazet
  1 sibling, 2 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-02-28 21:09 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
<willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> I can see this working if you have a special type of skb that
>> indicates that the data might be concurrently written and have all the
>> normal skb APIs (including, especially, anything that clones it) make
>> a copy first.
>
> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
> and segmentation offload. Patch 4 especially adds reference counting
> of shared pages across clones and other sk_buff operations like
> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
> on clones in specific datapaths like the above.

Does this mean that a user program that does a zerocopy send can cause
a retransmitted segment to contain different data than the original
segment?  If so, is that okay?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:09         ` Andy Lutomirski
@ 2017-02-28 21:28           ` Willem de Bruijn
  2017-02-28 21:47           ` Eric Dumazet
  1 sibling, 0 replies; 37+ messages in thread
From: Willem de Bruijn @ 2017-02-28 21:28 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Michael Kerrisk, netdev, Willem de Bruijn, Linux API

>>> I can see this working if you have a special type of skb that
>>> indicates that the data might be concurrently written and have all the
>>> normal skb APIs (including, especially, anything that clones it) make
>>> a copy first.
>>
>> Support for cloned skbs is required for TCP, both at tcp_transmit_skb
>> and segmentation offload. Patch 4 especially adds reference counting
>> of shared pages across clones and other sk_buff operations like
>> pskb_expand_head. This still allows for deep copy (skb_copy_ubufs)
>> on clones in specific datapaths like the above.
>
> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment? If so, is that okay?

That is possible, indeed. The bytestream at the receiver is then
likely undefined. Though integrity of the tcp receive stack should
not be affected. A valid question is whether the stack should protect
against such users. The pattern is reminiscent of evasion attacks. In
the limited case, privileged users already can generate this data
pattern, of course.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:09         ` Andy Lutomirski
  2017-02-28 21:28           ` Willem de Bruijn
@ 2017-02-28 21:47           ` Eric Dumazet
       [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2017-02-28 21:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:

> Does this mean that a user program that does a zerocopy send can cause
> a retransmitted segment to contain different data than the original
> segment?  If so, is that okay?

Same remark applies to sendfile() already, or other zero copy modes
(vmsplice() + splice() )

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2017-02-28 22:25               ` Andy Lutomirski
       [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2017-02-28 22:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>
>> Does this mean that a user program that does a zerocopy send can cause
>> a retransmitted segment to contain different data than the original
>> segment?  If so, is that okay?
>
> Same remark applies to sendfile() already

True.

>, or other zero copy modes
> (vmsplice() + splice() )

I hate vmsplice().  I thought I remembered it being essentially
disabled at some point due to security problems.

>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-02-28 22:40                   ` Eric Dumazet
  2017-02-28 22:52                     ` Andy Lutomirski
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2017-02-28 22:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
> >
> >> Does this mean that a user program that does a zerocopy send can cause
> >> a retransmitted segment to contain different data than the original
> >> segment?  If so, is that okay?
> >
> > Same remark applies to sendfile() already
> 
> True.
> 
> >, or other zero copy modes
> > (vmsplice() + splice() )
> 
> I hate vmsplice().  I thought I remembered it being essentially
> disabled at some point due to security problems.

Right, zero copy is hard ;)

vmsplice() is not disabled in current kernels, unless I missed
something.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 22:40                   ` Eric Dumazet
@ 2017-02-28 22:52                     ` Andy Lutomirski
  2017-02-28 23:22                       ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Andy Lutomirski @ 2017-02-28 22:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 2:40 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2017-02-28 at 14:25 -0800, Andy Lutomirski wrote:
>> On Tue, Feb 28, 2017 at 1:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2017-02-28 at 13:09 -0800, Andy Lutomirski wrote:
>> >
>> >> Does this mean that a user program that does a zerocopy send can cause
>> >> a retransmitted segment to contain different data than the original
>> >> segment?  If so, is that okay?
>> >
>> > Same remark applies to sendfile() already
>>
>> True.
>>
>> >, or other zero copy modes
>> > (vmsplice() + splice() )
>>
>> I hate vmsplice().  I thought I remembered it being essentially
>> disabled at some point due to security problems.
>
> Right, zero copy is hard ;)
>
> vmsplice() is not disabled in current kernels, unless I missed
> something.
>

I think you're right.  That being said, from the man page:

The user pages are a gift to the kernel.  The application  may  not
modify this memory ever, otherwise the page cache and on-disk data may
differ.

This is just not okay IMO.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 22:52                     ` Andy Lutomirski
@ 2017-02-28 23:22                       ` Eric Dumazet
       [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2017-02-28 23:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Willem de Bruijn, Michael Kerrisk, netdev, Willem de Bruijn, Linux API

On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:

> The user pages are a gift to the kernel.  The application  may  not
> modify this memory ever, otherwise the page cache and on-disk data may
> differ.
> 
> This is just not okay IMO.

TCP works just fine in this case.

TX checksum will be computed by the NIC after/while data is copied.

If really the application changes the data, that will not cause any
problems, other than user side consistency.

This is why we require a copy (for all buffers that came from zero-copy)
if network stack hits a device that can not offload TX checksum.

Even pwrite() does not guarantee consistency if multiple threads are
using it on overlapping regions.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2017-03-01  0:28                           ` Tom Herbert
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Tom Herbert @ 2017-03-01  0:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andy Lutomirski, Willem de Bruijn, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>
>> The user pages are a gift to the kernel.  The application  may  not
>> modify this memory ever, otherwise the page cache and on-disk data may
>> differ.
>>
>> This is just not okay IMO.
>
> TCP works just fine in this case.
>
> TX checksum will be computed by the NIC after/while data is copied.
>
> If really the application changes the data, that will not cause any
> problems, other than user side consistency.
>
> This is why we require a copy (for all buffers that came from zero-copy)
> if network stack hits a device that can not offload TX checksum.
>
> Even pwrite() does not guarantee consistency if multiple threads are
> using it on overlapping regions.
>
The Mellanox team working on TLS offload pointed out to us that if
data is changed for a retransmit then it becomes trivial for someone
snooping to break the encryption. Sounds pretty scary and it would be
a shame if we couldn't use zero-copy in that use case :-( Hopefully we
can find a solution...

Tom

>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-03-01  0:37                               ` Eric Dumazet
  2017-03-01  0:58                               ` Willem de Bruijn
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2017-03-01  0:37 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Andy Lutomirski, Willem de Bruijn, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, 2017-02-28 at 16:28 -0800, Tom Herbert wrote:

> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...

Right, this is why offloading encryption over TCP is also hard ;)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
       [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-03-01  0:37                               ` Eric Dumazet
@ 2017-03-01  0:58                               ` Willem de Bruijn
  2017-03-01  1:50                                 ` Tom Herbert
  1 sibling, 1 reply; 37+ messages in thread
From: Willem de Bruijn @ 2017-03-01  0:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Andy Lutomirski, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert <tom-BjP2VixgY4xUbtYUoyoikg@public.gmane.org> wrote:
> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>
>>> The user pages are a gift to the kernel.  The application  may  not
>>> modify this memory ever, otherwise the page cache and on-disk data may
>>> differ.
>>>
>>> This is just not okay IMO.
>>
>> TCP works just fine in this case.
>>
>> TX checksum will be computed by the NIC after/while data is copied.
>>
>> If really the application changes the data, that will not cause any
>> problems, other than user side consistency.
>>
>> This is why we require a copy (for all buffers that came from zero-copy)
>> if network stack hits a device that can not offload TX checksum.
>>
>> Even pwrite() does not guarantee consistency if multiple threads are
>> using it on overlapping regions.
>>
> The Mellanox team working on TLS offload pointed out to us that if
> data is changed for a retransmit then it becomes trivial for someone
> snooping to break the encryption. Sounds pretty scary and it would be
> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
> can find a solution...
>

This requires collusion by the process initiating the zerocopy send
to help the entity snooping the link. That could be an attack on admin
configured tunnels, but user-directed encryption offload like AF_TLS
can still use zerocopy.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  0:58                               ` Willem de Bruijn
@ 2017-03-01  1:50                                 ` Tom Herbert
  0 siblings, 0 replies; 37+ messages in thread
From: Tom Herbert @ 2017-03-01  1:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Eric Dumazet, Andy Lutomirski, Michael Kerrisk, netdev,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 4:58 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Tue, Feb 28, 2017 at 7:28 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Feb 28, 2017 at 3:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Tue, 2017-02-28 at 14:52 -0800, Andy Lutomirski wrote:
>>>
>>>> The user pages are a gift to the kernel.  The application  may  not
>>>> modify this memory ever, otherwise the page cache and on-disk data may
>>>> differ.
>>>>
>>>> This is just not okay IMO.
>>>
>>> TCP works just fine in this case.
>>>
>>> TX checksum will be computed by the NIC after/while data is copied.
>>>
>>> If really the application changes the data, that will not cause any
>>> problems, other than user side consistency.
>>>
>>> This is why we require a copy (for all buffers that came from zero-copy)
>>> if network stack hits a device that can not offload TX checksum.
>>>
>>> Even pwrite() does not guarantee consistency if multiple threads are
>>> using it on overlapping regions.
>>>
>> The Mellanox team working on TLS offload pointed out to us that if
>> data is changed for a retransmit then it becomes trivial for someone
>> snooping to break the encryption. Sounds pretty scary and it would be
>> a shame if we couldn't use zero-copy in that use case :-( Hopefully we
>> can find a solution...
>>
>
> This requires collusion by the process initiating the zerocopy send
> to help the entity snooping the link. That could be an attack on admin
> configured tunnels, but user-directed encryption offload like AF_TLS
> can still use zerocopy.

Yes, but we can't trust the user to always understand or correctly
implement the semantic nuances when security is involved. If we can't
provide a  robust API then the only recourse is to not allow zero copy
in that case. We could suggest COW to solve all problems, but I think
we know where the conversation will go ;-)

Tom

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 19:46   ` Andy Lutomirski
  2017-02-28 20:43     ` Willem de Bruijn
@ 2017-03-01  3:25     ` David Miller
  1 sibling, 0 replies; 37+ messages in thread
From: David Miller @ 2017-03-01  3:25 UTC (permalink / raw)
  To: luto; +Cc: mtk.manpages, willemdebruijn.kernel, netdev, willemb, linux-api

From: Andy Lutomirski <luto@amacapital.net>
Date: Tue, 28 Feb 2017 11:46:23 -0800

> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
> <mtk.manpages@gmail.com> wrote:
>> [CC += linux-api@vger.kernel.org]
>>
>> Hi Willem
>>
> 
>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>> creates skbuff fragments directly from these pages. On tx completion,
>>> it notifies the socket owner that it is safe to modify memory by
>>> queuing a completion notification onto the socket error queue.
> 
> What happens if the user writes to the pages while it's not safe?

Just want to mention that this ability to write to data behind a
network send's back is not a new thing added by MSG_ZEROCOPY.

All of this is already possible with sendfile().  The pages can be
written to completely asynchronously to the data being pulled out of
the page cache into the transmit path.

The crypto case is interesting, but that is a seperate discussion
about an existing problem rather than something specifically new to
the MSG_ZEROCOPY changes.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-02-28 21:06         ` Andy Lutomirski
@ 2017-03-01  3:28           ` David Miller
  2017-03-01  3:43             ` Eric Dumazet
  2017-03-02 19:26             ` Andy Lutomirski
  0 siblings, 2 replies; 37+ messages in thread
From: David Miller @ 2017-03-01  3:28 UTC (permalink / raw)
  To: luto; +Cc: willemdebruijn.kernel, mtk.manpages, netdev, willemb, linux-api

From: Andy Lutomirski <luto@amacapital.net>
Date: Tue, 28 Feb 2017 13:06:49 -0800

> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>> <mtk.manpages@gmail.com> wrote:
>>>> [CC += linux-api@vger.kernel.org]
>>>>
>>>> Hi Willem
>>>>
>>>
>>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>>> it notifies the socket owner that it is safe to modify memory by
>>>>> queuing a completion notification onto the socket error queue.
>>>
>>> What happens if the user writes to the pages while it's not safe?
>>>
>>> How about if you're writing to an interface or a route that has crypto
>>> involved and a malicious user can make the data change in the middle
>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>> constructions are entirely compromised if an attacker can get the
>>> ciphertext and tag computed from a message that changed during the
>>> computation.
>>
>> Operations that read or write payload, such as this crypto example,
>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>> skb_copy_ubufs before the operation.
>>
>> This blacklist approach requires caution, but these paths should be
>> few and countable. It is not possible to predict at the socket layer
>> whether a packet will encounter any such operation, so white-listing
>> a subset of end-to-end paths is not practical.
> 
> How about hardware that malfunctions if the packet changes out from
> under it?  A whitelist seems quite a bit safer.

These device are already choking, because as I stated this can already
be done via sendfile().

Networking card wise this isn't an issue, chips bring the entire packet
into their FIFO, compute checksums on the fly mid-stream, and then write
the 16-bit checksum field before starting to write the packet onto the
wire.

I think this is completely a non-issue, and we thought about this right
from the start when sendfile() support was added nearly two decades ago.
If network cards from back then didn't crap out in this situation I
think the ones out there now are probably ok.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  3:28           ` David Miller
@ 2017-03-01  3:43             ` Eric Dumazet
  2017-03-02 19:26             ` Andy Lutomirski
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2017-03-01  3:43 UTC (permalink / raw)
  To: David Miller
  Cc: luto, willemdebruijn.kernel, mtk.manpages, netdev, willemb, linux-api

On Tue, 2017-02-28 at 22:28 -0500, David Miller wrote:

> These device are already choking, because as I stated this can already
> be done via sendfile().
> 
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
> 
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Well, we had to fix one issue with GSO fall back about 4 years ago.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cef401de7be8c4e155c6746bfccf721a4fa5fab9

So extra scrutiny would be nice.

Zero copy is incredibly hard to get right.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
  2017-03-01  3:28           ` David Miller
  2017-03-01  3:43             ` Eric Dumazet
@ 2017-03-02 19:26             ` Andy Lutomirski
  1 sibling, 0 replies; 37+ messages in thread
From: Andy Lutomirski @ 2017-03-02 19:26 UTC (permalink / raw)
  To: David Miller
  Cc: Willem de Bruijn, Michael Kerrisk, Network Development,
	Willem de Bruijn, Linux API

On Tue, Feb 28, 2017 at 7:28 PM, David Miller <davem@davemloft.net> wrote:
> From: Andy Lutomirski <luto@amacapital.net>
> Date: Tue, 28 Feb 2017 13:06:49 -0800
>
>> On Tue, Feb 28, 2017 at 12:43 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>> On Tue, Feb 28, 2017 at 2:46 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>>> On Mon, Feb 27, 2017 at 10:57 AM, Michael Kerrisk
>>>> <mtk.manpages@gmail.com> wrote:
>>>>> [CC += linux-api@vger.kernel.org]
>>>>>
>>>>> Hi Willem
>>>>>
>>>>
>>>>>> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
>>>>>> creates skbuff fragments directly from these pages. On tx completion,
>>>>>> it notifies the socket owner that it is safe to modify memory by
>>>>>> queuing a completion notification onto the socket error queue.
>>>>
>>>> What happens if the user writes to the pages while it's not safe?
>>>>
>>>> How about if you're writing to an interface or a route that has crypto
>>>> involved and a malicious user can make the data change in the middle
>>>> of a crypto operation, thus perhaps leaking the entire key?  (I
>>>> wouldn't be at all surprised if a lot of provably secure AEAD
>>>> constructions are entirely compromised if an attacker can get the
>>>> ciphertext and tag computed from a message that changed during the
>>>> computation.
>>>
>>> Operations that read or write payload, such as this crypto example,
>>> but also ebpf in tc or iptables, for instance, demand a deep copy using
>>> skb_copy_ubufs before the operation.
>>>
>>> This blacklist approach requires caution, but these paths should be
>>> few and countable. It is not possible to predict at the socket layer
>>> whether a packet will encounter any such operation, so white-listing
>>> a subset of end-to-end paths is not practical.
>>
>> How about hardware that malfunctions if the packet changes out from
>> under it?  A whitelist seems quite a bit safer.
>
> These device are already choking, because as I stated this can already
> be done via sendfile().
>
> Networking card wise this isn't an issue, chips bring the entire packet
> into their FIFO, compute checksums on the fly mid-stream, and then write
> the 16-bit checksum field before starting to write the packet onto the
> wire.
>
> I think this is completely a non-issue, and we thought about this right
> from the start when sendfile() support was added nearly two decades ago.
> If network cards from back then didn't crap out in this situation I
> think the ones out there now are probably ok.

Fair enough.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-03-02 19:36 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-22 16:38 [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 01/12] sock: allocate skbs from optmem Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
2017-02-22 20:33   ` Eric Dumazet
2017-02-23  1:51     ` Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 03/12] sock: add generic socket zerocopy Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 04/12] sock: enable sendmsg zerocopy Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 09/12] udp: " Willem de Bruijn
2017-02-22 16:38 ` [PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL Willem de Bruijn
2017-02-22 16:39 ` [PATCH RFC v2 11/12] packet: enable sendmsg zerocopy Willem de Bruijn
2017-02-22 16:39 ` [PATCH RFC v2 12/12] test: add sendmsg zerocopy tests Willem de Bruijn
2017-02-23 15:45 ` [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY David Miller
2017-02-24 23:03 ` Alexei Starovoitov
2017-02-25  0:25   ` Willem de Bruijn
2017-02-27 18:57 ` Michael Kerrisk
2017-02-28 19:46   ` Andy Lutomirski
2017-02-28 20:43     ` Willem de Bruijn
     [not found]       ` <CAF=yD-K_0zO3pMeXf-UKGTsD4sNOdyN9KJkUb5MnCO_J5pisrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-28 21:06         ` Andy Lutomirski
2017-03-01  3:28           ` David Miller
2017-03-01  3:43             ` Eric Dumazet
2017-03-02 19:26             ` Andy Lutomirski
2017-02-28 21:09         ` Andy Lutomirski
2017-02-28 21:28           ` Willem de Bruijn
2017-02-28 21:47           ` Eric Dumazet
     [not found]             ` <1488318476.9415.270.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2017-02-28 22:25               ` Andy Lutomirski
     [not found]                 ` <CALCETrVQj1AEsLEGGkWW1zApGz6_x2rDmE0wz4ft+O5h07f_Ug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-28 22:40                   ` Eric Dumazet
2017-02-28 22:52                     ` Andy Lutomirski
2017-02-28 23:22                       ` Eric Dumazet
     [not found]                         ` <1488324131.9415.278.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2017-03-01  0:28                           ` Tom Herbert
     [not found]                             ` <CALx6S357ssnbEu7CMrczEjiX25QYBJh3WG=w8KuAoxGQS4aKLA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-03-01  0:37                               ` Eric Dumazet
2017-03-01  0:58                               ` Willem de Bruijn
2017-03-01  1:50                                 ` Tom Herbert
2017-03-01  3:25     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.