linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY
@ 2017-06-21 21:18 Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 01/13] sock: allocate skbs from optmem Willem de Bruijn
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Willem de Bruijn

From: Willem de Bruijn <willemb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Introduce zerocopy socket send flag MSG_ZEROCOPY. This extends the
shared page support (SKBTX_SHARED_FRAG) from sendpage to sendmsg.
Implement the feature for TCP, UDP, RAW and packet sockets.

On a send call with MSG_ZEROCOPY, the kernel pins user pages and
links these directly into the skbuff frags[] array.

Each send call with MSG_ZEROCOPY that transmits data will eventually
queue a completion notification on the error queue: a per-socket u32
incremented on each such call. A request may have to revert to copy
to succeed, for instance when a device cannot support scatter-gather
IO. In that case a flag is passed along to notify that the operation
succeeded without zerocopy optimization.

The implementation extends the existing zerocopy infra for tuntap,
vhost and xen with features needed for TCP, notably reference
counting to handle cloning on retransmit and GSO.

For more details, see also the netdev 2.1 paper and presentation at
https://netdevconf.org/2.1/session.html?debruijn

Changelog:

  v2 -> v3:
    - fix rebase conflict: SO_ZEROCOPY 59 -> 60

  v1 -> v2:
    - fix (kbuild-bot): do not remove uarg until patch 5
    - fix (kbuild-bot): move zerocopy_sg_from_iter doc with function
    - fix: remove unused extern in header file

  RFCv2 -> v1:
    - patch 2
        - review comment: in skb_copy_ubufs, always allocate order-0
            page, also when replacing compound source pages.
    - patch 3
        - fix: always queue completion notification on MSG_ZEROCOPY,
	    also if revert to copy.
	- fix: on syscall abort, correctly revert notification state
	- minor: skip queue notification on SOCK_DEAD
	- minor: replace BUG_ON with WARN_ON in recoverable error
    - patch 4
        - new: add socket option SOCK_ZEROCOPY.
	    only honor MSG_ZEROCOPY if set, ignore for legacy apps.
    - patch 5
        - fix: clear zerocopy state on skb_linearize
    - patch 6
        - fix: only coalesce if prev errqueue elem is zerocopy
	- minor: try coalescing with list tail instead of head
        - minor: merge bytelen limit patch
    - patch 7
        - new: signal when data had to be copied
    - patch 8 (tcp)
        - optimize: avoid setting PSH bit when exceeding max frags.
	    that limits GRO on the client. do not goto new_segment.
	- fix: fail on MSG_ZEROCOPY | MSG_FASTOPEN
	- minor: do not wait for memory: does not work for optmem
	- minor: simplify alloc
    - patch 9 (udp)
        - new: add PF_INET6
        - fix: attach zerocopy notification even if revert to copy
	- minor: simplify alloc size arithmetic
    - patch 10 (raw hdrinc)
        - new: add PF_INET6
    - patch 11 (pf_packet)
        - minor: simplify slightly
    - patch 12
        - new msg_zerocopy regression test: use veth pair to test
	    all protocols: ipv4/ipv6/packet, tcp/udp/raw, cork
	    all relevant ethtool settings: rx off, sg off
	    all relevant packet lengths: 0, <MAX_HEADER, max size

  RFC -> RFCv2:
    - review comment: do not loop skb with zerocopy frags onto rx:
          add skb_orphan_frags_rx to orphan even refcounted frags
	  call this in __netif_receive_skb_core, deliver_skb and tun:
	  the same as 1080e512d44d ("net: orphan frags on receive")
    - fix: hold an explicit sk reference on each notification skb.
          previously relied on the reference (or wmem) held by the
	  data skb that would trigger notification, but this breaks
	  on skb_orphan.
    - fix: when aborting a send, do not inc the zerocopy counter
          this caused gaps in the notification chain
    - fix: in packet with SOCK_DGRAM, pull ll headers before calling
          zerocopy_sg_from_iter
    - fix: if sock_zerocopy_realloc does not allow coalescing,
          do not fail, just allocate a new ubuf
    - fix: in tcp, check return value of second allocation attempt
    - chg: allocate notification skbs from optmem
          to avoid affecting tcp write queue accounting (TSQ)
    - chg: limit #locked pages (ulimit) per user instead of per process
    - chg: grow notification ids from 16 to 32 bit
      - pass range [lo, hi] through 32 bit fields ee_info and ee_data
    - chg: rebased to davem-net-next on top of v4.10-rc7
    - add: limit notification coalescing
          sharing ubufs limits overhead, but delays notification until
	  the last packet is released, possibly unbounded. Add a cap. 
    - tests: add snd_zerocopy_lo pf_packet test
    - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

Limitations / Known Issues:
    - TCP may build slightly smaller than max TSO packets due to
      exceeding MAX_SKB_FRAGS frags when zerocopy pages are unaligned.
    - All SKBTX_SHARED_FRAG may require additional __skb_linearize or
      skb_copy_ubufs calls in u32, skb_find_text, similar to
      skb_checksum_help.
    - UDP and RAW may be of limited use without loopback due to
      MTU being too small to see copy avoidance gains.

Notification skbuffs are allocated from optmem. For sockets that
cannot effectively coalesce notifications, the optmem max may need
to be increased to avoid hitting -ENOBUFS:

  sysctl -w net.core.optmem_max=1048576

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.

For the single-machine veth tests to succeed, loopback support has to
be temporarily enabled by making skb_orphan_frags_rx map to
skb_orphan_frags.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

        --process cycles--      ----cpu cycles----
           std      zc   %      std         zc   %
4K      27,609  11,217  41      49,217  39,175  79
16K     21,370   3,823  18      43,540  29,213  67
64K     20,557   2,312  11      42,189  26,910  64
256K    21,110   2,134  10      43,006  27,104  63
1M      20,987   1,610   8      42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313                                                   
 79.41%         33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%          1396  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  1.66%           694  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.79%           325  netperf  [kernel.kallsyms]  [k] tcp_ack
  0.43%           188  netperf  [kernel.kallsyms]  [k] __alloc_skb

zc:
Samples: 1K of event 'cycles', Event count (approx.): 1439509124                                                     
 30.36%           584  netperf.zerocop  [kernel.kallsyms]  [k] gup_pte_range
 14.63%           284  netperf.zerocop  [kernel.kallsyms]  [k] __zerocopy_sg_from_iter
  8.03%           159  netperf.zerocop  [kernel.kallsyms]  [k] skb_zerocopy_add_frags_iter
  4.84%            96  netperf.zerocop  [kernel.kallsyms]  [k] __alloc_skb
  3.10%            60  netperf.zerocop  [kernel.kallsyms]  [k] kmem_cache_alloc_node


* Safety

The number of pages that can be pinned on behalf of a user with
MSG_ZEROCOPY is bound by the locked memory ulimit.

While the kernel holds process memory pinned, a process cannot safely
reuse those pages for other purposes. Packets looped onto the receive
stack and queued to a socket can be held indefinitely. Avoid unbounded
notification latency by restricting user pages to egress paths only.
skb_orphan_frags_rx() will create a private copy of pages even for
refcounted packets when these are looped, as did skb_orphan_frags for
the original tun zerocopy implementation.

Pages are not remapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.
Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. TC filters that access contents may have to be
excluded by adding an skb_orphan_frags_rx. 

Processes can also safely avoid OOM conditions by bounding the number
of bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map.


Willem de Bruijn (13):
  sock: allocate skbs from optmem
  sock: skb_copy_ubufs support for compound pages
  sock: add MSG_ZEROCOPY
  sock: add SOCK_ZEROCOPY sockopt
  sock: enable MSG_ZEROCOPY
  sock: MSG_ZEROCOPY notification coalescing
  sock: add ee_code SO_EE_CODE_ZEROCOPY_COPIED
  sock: ulimit on MSG_ZEROCOPY pages
  tcp: enable MSG_ZEROCOPY
  udp: enable MSG_ZEROCOPY
  raw: enable MSG_ZEROCOPY with IP_HDRINCL
  packet: enable MSG_ZEROCOPY
  test: add msg_zerocopy test

 arch/alpha/include/uapi/asm/socket.h        |   2 +
 arch/frv/include/uapi/asm/socket.h          |   2 +
 arch/ia64/include/uapi/asm/socket.h         |   2 +
 arch/m32r/include/uapi/asm/socket.h         |   2 +
 arch/mips/include/uapi/asm/socket.h         |   2 +
 arch/mn10300/include/uapi/asm/socket.h      |   2 +
 arch/parisc/include/uapi/asm/socket.h       |   2 +
 arch/s390/include/uapi/asm/socket.h         |   2 +
 arch/sparc/include/uapi/asm/socket.h        |   2 +
 arch/xtensa/include/uapi/asm/socket.h       |   2 +
 drivers/net/tun.c                           |   2 +-
 drivers/vhost/net.c                         |   1 +
 include/linux/sched/user.h                  |   3 +-
 include/linux/skbuff.h                      | 106 ++++-
 include/linux/socket.h                      |   1 +
 include/net/sock.h                          |   4 +
 include/uapi/asm-generic/socket.h           |   2 +
 include/uapi/linux/errqueue.h               |   3 +
 net/core/datagram.c                         |  55 ++-
 net/core/dev.c                              |   4 +-
 net/core/skbuff.c                           | 390 ++++++++++++++--
 net/core/sock.c                             |  43 ++
 net/ipv4/ip_output.c                        |  37 +-
 net/ipv4/raw.c                              |  23 +-
 net/ipv4/tcp.c                              |  32 +-
 net/ipv6/ip6_output.c                       |  40 +-
 net/ipv6/raw.c                              |  20 +-
 net/packet/af_packet.c                      |  36 +-
 tools/testing/selftests/net/.gitignore      |   1 +
 tools/testing/selftests/net/Makefile        |   2 +-
 tools/testing/selftests/net/msg_zerocopy.c  | 697 ++++++++++++++++++++++++++++
 tools/testing/selftests/net/msg_zerocopy.sh | 113 +++++
 32 files changed, 1535 insertions(+), 100 deletions(-)
 create mode 100644 tools/testing/selftests/net/msg_zerocopy.c
 create mode 100755 tools/testing/selftests/net/msg_zerocopy.sh

-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 01/13] sock: allocate skbs from optmem
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
       [not found] ` <20170621211816.53837-1-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/net/sock.h |  2 ++
 net/core/sock.c    | 27 +++++++++++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 00d09140e354..101247d63914 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1527,6 +1527,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 			     gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+			     gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
 void sock_rfree(struct sk_buff *skb);
 void sock_efree(struct sk_buff *skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6f4b090241c1..34b5dcd93340 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1923,6 +1923,33 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
 }
 EXPORT_SYMBOL(sock_wmalloc);
 
+static void sock_ofree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	atomic_sub(skb->truesize, &sk->sk_omem_alloc);
+}
+
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+			     gfp_t priority)
+{
+	struct sk_buff *skb;
+
+	/* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
+	if (atomic_read(&sk->sk_omem_alloc) + SKB_TRUESIZE(size) >
+	    sysctl_optmem_max)
+		return NULL;
+
+	skb = alloc_skb(size, priority);
+	if (!skb)
+		return NULL;
+
+	atomic_add(skb->truesize, &sk->sk_omem_alloc);
+	skb->sk = sk;
+	skb->destructor = sock_ofree;
+	return skb;
+}
+
 /*
  * Allocate a memory block from the socket's option memory buffer.
  */
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found] ` <20170621211816.53837-1-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-06-21 21:18   ` Willem de Bruijn
       [not found]     ` <20170621211816.53837-3-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-06-21 21:18   ` [PATCH net-next v3 13/13] test: add msg_zerocopy test Willem de Bruijn
  1 sibling, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Willem de Bruijn

From: Willem de Bruijn <willemb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Refine skb_copy_ubufs to support compound pages. With upcoming TCP
and UDP zerocopy sendmsg, such fragments may appear.

The existing code replaces each page one for one. Splitting each
compound page into an independent number of regular pages can result
in exceeding limit MAX_SKB_FRAGS.

Instead, fill all destination pages but the last to PAGE_SIZE.
Split the existing alloc + copy loop into separate stages. Compute
the bytelength and allocate the minimum number of pages needed to
hold this. Revise the copy loop to fill each destination page.

It is not safe to modify skb frags when the skbuff is shared. No
existing codepath should hit this case.

Eventually, this fragile function can perhaps be replaced with calls
to skb_linearize -- when converted to not always require GFP_ATOMIC.

Signed-off-by: Willem de Bruijn <willemb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/skbuff.h |  9 +++++++--
 net/core/skbuff.c      | 49 ++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a17e235639ae..81c17bd41661 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1783,13 +1783,18 @@ static inline unsigned int skb_headlen(const struct sk_buff *skb)
 	return skb->len - skb->data_len;
 }
 
-static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+static inline unsigned int __skb_pagelen(const struct sk_buff *skb)
 {
 	unsigned int i, len = 0;
 
 	for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
 		len += skb_frag_size(&skb_shinfo(skb)->frags[i]);
-	return len + skb_headlen(skb);
+	return len;
+}
+
+static inline unsigned int skb_pagelen(const struct sk_buff *skb)
+{
+	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
 /**
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f75897a33fa4..96db26594192 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -958,15 +958,20 @@ EXPORT_SYMBOL_GPL(skb_morph);
  */
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
 	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
+	int i, new_frags;
+	u32 d_off;
 
-	for (i = 0; i < num_frags; i++) {
-		u8 *vaddr;
-		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+	if (!num_frags)
+		return 0;
 
+	if (skb_shared(skb) || skb_unclone(skb, gfp_mask))
+		return -EINVAL;
+
+	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
 		if (!page) {
 			while (head) {
@@ -976,14 +981,35 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 			}
 			return -ENOMEM;
 		}
-		vaddr = kmap_atomic(skb_frag_page(f));
-		memcpy(page_address(page),
-		       vaddr + f->page_offset, skb_frag_size(f));
-		kunmap_atomic(vaddr);
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
 
+	page = head;
+	d_off = 0;
+	for (i = 0; i < num_frags; i++) {
+		u8 *vaddr;
+		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
+		u32 f_off, f_size, copy;
+
+		f_off = f->page_offset;
+		f_size = f->size;
+
+		vaddr = kmap_atomic(skb_frag_page(f));
+		while (f_size) {
+			if (d_off == PAGE_SIZE) {
+				d_off = 0;
+				page = (struct page *)page_private(page);
+			}
+			copy = min_t(u32, PAGE_SIZE - d_off, f_size);
+			memcpy(page_address(page) + d_off, vaddr + f_off, copy);
+			f_size -= copy;
+			d_off += copy;
+			f_off += copy;
+		}
+		kunmap_atomic(vaddr);
+	}
+
 	/* skb frags release userspace buffers */
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
@@ -991,11 +1017,12 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	uarg->callback(uarg, false);
 
 	/* skb frags point to kernel buffers */
-	for (i = num_frags - 1; i >= 0; i--) {
-		__skb_fill_page_desc(skb, i, head, 0,
-				     skb_shinfo(skb)->frags[i].size);
+	for (i = 0; i < new_frags - 1; i++) {
+		__skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
 		head = (struct page *)page_private(head);
 	}
+	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+	skb_shinfo(skb)->nr_frags = new_frags;
 
 	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
 	return 0;
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 03/13] sock: add MSG_ZEROCOPY
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 01/13] sock: allocate skbs from optmem Willem de Bruijn
       [not found] ` <20170621211816.53837-1-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-06-21 21:18 ` Willem de Bruijn
       [not found]   ` <20170621211816.53837-4-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-06-21 21:18 ` [PATCH net-next v3 04/13] sock: add SOCK_ZEROCOPY sockopt Willem de Bruijn
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |  61 ++++++++++++++++
 include/linux/socket.h        |   1 +
 include/net/sock.h            |   2 +
 include/uapi/linux/errqueue.h |   1 +
 net/core/datagram.c           |  55 +++++++++------
 net/core/skbuff.c             | 157 ++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   2 +
 7 files changed, 258 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 81c17bd41661..76fdc81270c0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -393,6 +393,7 @@ enum {
 	SKBTX_SCHED_TSTAMP = 1 << 6,
 };
 
+#define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP)
 #define SKBTX_ANY_TSTAMP	(SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP)
@@ -409,8 +410,30 @@ struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
 	unsigned long desc;
+	atomic_t refcnt;
 };
 
+#define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
+
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+
+static inline void sock_zerocopy_get(struct ubuf_info *uarg)
+{
+	atomic_inc(&uarg->refcnt);
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
+
+int skb_zerocopy_iter(struct sock *sk, struct sk_buff *skb, struct msghdr *msg,
+		      int len);
+int skb_zerocopy_iter_alloc(struct sk_buff *skb, struct msghdr *msg, int len);
+int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
+			     struct msghdr *msg, int len,
+			     struct ubuf_info *uarg);
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -1201,6 +1224,44 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
+{
+	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
+
+	return is_zcopy ? skb_uarg(skb) : NULL;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+	if (skb && uarg && !skb_zcopy(skb)) {
+		sock_zerocopy_get(uarg);
+		skb_shinfo(skb)->destructor_arg = uarg;
+		skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+	}
+}
+
+/* Release a reference on a zerocopy structure */
+static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy)
+{
+	struct ubuf_info *uarg = skb_zcopy(skb);
+
+	if (uarg) {
+		sock_zerocopy_put(uarg);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
+	}
+}
+
+/* Abort a zerocopy operation and revert zckey on error in send syscall */
+static inline void skb_zcopy_abort(struct sk_buff *skb)
+{
+	struct ubuf_info *uarg = skb_zcopy(skb);
+
+	if (uarg) {
+		sock_zerocopy_put_abort(uarg);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
+	}
+}
+
 /**
  *	skb_queue_empty - check if a queue is empty
  *	@list: queue head
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8b13db5163cc..8ad963cdc88c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
 #define MSG_BATCH	0x40000 /* sendmmsg(): more messages coming */
 #define MSG_EOF         MSG_FIN
 
+#define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/include/net/sock.h b/include/net/sock.h
index 101247d63914..d1a473409826 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -290,6 +290,7 @@ struct sock_common {
   *	@sk_stamp: time stamp of last packet received
   *	@sk_tsflags: SO_TIMESTAMPING socket options
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
+  *	@sk_zckey: counter to order MSG_ZEROCOPY notifications
   *	@sk_socket: Identd and reporting IO signals
   *	@sk_user_data: RPC layer private data
   *	@sk_frag: cached page frag
@@ -458,6 +459,7 @@ struct sock {
 	u16			sk_tsflags;
 	u8			sk_shutdown;
 	u32			sk_tskey;
+	atomic_t		sk_zckey;
 	struct socket		*sk_socket;
 	void			*sk_user_data;
 #ifdef CONFIG_SECURITY
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1f444a..0f15a77c9e39 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -18,6 +18,7 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP	2
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
+#define SO_EE_ORIGIN_ZEROCOPY	5
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e5311a7c70da..2a1a52971630 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -573,27 +573,12 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 
-/**
- *	zerocopy_sg_from_iter - Build a zerocopy datagram from an iov_iter
- *	@skb: buffer to copy
- *	@from: the source to copy from
- *
- *	The function will first copy up to headlen, and then pin the userspace
- *	pages and build frags through them.
- *
- *	Returns 0, -EFAULT or -EMSGSIZE.
- */
-int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+			    struct iov_iter *from, size_t length)
 {
-	int len = iov_iter_count(from);
-	int copy = min_t(int, skb_headlen(skb), len);
-	int frag = 0;
+	int frag = skb_shinfo(skb)->nr_frags;
 
-	/* copy up to skb headlen */
-	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
-		return -EFAULT;
-
-	while (iov_iter_count(from)) {
+	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
 		size_t start;
 		ssize_t copied;
@@ -603,18 +588,24 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, ~0U,
+		copied = iov_iter_get_pages(from, pages, length,
 					    MAX_SKB_FRAGS - frag, &start);
 		if (copied < 0)
 			return -EFAULT;
 
 		iov_iter_advance(from, copied);
+		length -= copied;
 
 		truesize = PAGE_ALIGN(copied + start);
 		skb->data_len += copied;
 		skb->len += copied;
 		skb->truesize += truesize;
-		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		if (sk && sk->sk_type == SOCK_STREAM) {
+			sk->sk_wmem_queued += truesize;
+			sk_mem_charge(sk, truesize);
+		} else {
+			atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		}
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			skb_fill_page_desc(skb, frag++, pages[n], start, size);
@@ -625,6 +616,28 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 	}
 	return 0;
 }
+EXPORT_SYMBOL(__zerocopy_sg_from_iter);
+
+/**
+ *	zerocopy_sg_from_iter - Build a zerocopy datagram from an iov_iter
+ *	@skb: buffer to copy
+ *	@from: the source to copy from
+ *
+ *	The function will first copy up to headlen, and then pin the userspace
+ *	pages and build frags through them.
+ *
+ *	Returns 0, -EFAULT or -EMSGSIZE.
+ */
+int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+{
+	int copy = min_t(int, skb_headlen(skb), iov_iter_count(from));
+
+	/* copy up to skb headlen */
+	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
+		return -EFAULT;
+
+	return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+}
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 96db26594192..31e31185c557 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -941,6 +941,163 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
+{
+	struct sk_buff *skb;
+	struct ubuf_info *uarg;
+
+	WARN_ON_ONCE(!in_task());
+
+	skb = sock_omalloc(sk, 0, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
+	uarg = (void *)skb->cb;
+
+	uarg->callback = sock_zerocopy_callback;
+	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	atomic_set(&uarg->refcnt, 0);
+	sock_hold(sk);
+
+	return uarg;
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_alloc);
+
+static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg)
+{
+	return container_of((void *)uarg, struct sk_buff, cb);
+}
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
+{
+	struct sock_exterr_skb *serr;
+	struct sk_buff *skb = skb_from_uarg(uarg);
+	struct sock *sk = skb->sk;
+	u16 id = uarg->desc;
+
+	if (sock_flag(sk, SOCK_DEAD))
+		goto release;
+
+	serr = SKB_EXT_ERR(skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
+	serr->ee.ee_data = id;
+
+	skb_queue_tail(&sk->sk_error_queue, skb);
+	skb = NULL;
+
+	sk->sk_error_report(sk);
+
+release:
+	consume_skb(skb);
+	sock_put(sk);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
+
+void sock_zerocopy_put(struct ubuf_info *uarg)
+{
+	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		if (uarg->callback)
+			uarg->callback(uarg, true);
+		else
+			consume_skb(skb_from_uarg(uarg));
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put);
+
+void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+{
+	if (uarg) {
+		struct sock *sk = skb_from_uarg(uarg)->sk;
+
+		atomic_dec(&sk->sk_zckey);
+
+		/* sock_zerocopy_put expects a ref. Most sockets take one per
+		 * skb, which is zero on abort. tcp_sendmsg holds one extra, to
+		 * avoid an skb send inside the main loop triggering uarg free.
+		 */
+		if (sk->sk_type != SOCK_STREAM)
+			atomic_inc(&uarg->refcnt);
+
+		sock_zerocopy_put(uarg);
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
+
+extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+				   struct iov_iter *from, size_t length);
+
+int skb_zerocopy_iter(struct sock *sk, struct sk_buff *skb, struct msghdr *msg,
+		      int len)
+{
+	return __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter);
+
+int skb_zerocopy_iter_alloc(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+	struct ubuf_info *uarg;
+	int err;
+
+	uarg = sock_zerocopy_alloc(skb->sk, len);
+	if (!uarg)
+		return -ENOBUFS;
+
+	err = skb_zerocopy_iter(skb->sk, skb, msg, len);
+	if (err) {
+		sock_zerocopy_put_abort(uarg);
+		return err;
+	}
+
+	skb_zcopy_set(skb, uarg);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_alloc);
+
+int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
+			     struct msghdr *msg, int len,
+			     struct ubuf_info *uarg)
+{
+	struct iov_iter orig_iter = msg->msg_iter;
+	int err, orig_len = skb->len;
+
+	err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
+	if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
+		/* Streams do not free skb on error. Reset to prev state. */
+		msg->msg_iter = orig_iter;
+		___pskb_trim(skb, orig_len);
+		return err;
+	}
+
+	skb_zcopy_set(skb, uarg);
+	return skb->len - orig_len;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
+
+/* unused only until next patch in the series; will remove attribute */
+static int __attribute__((unused))
+	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+			      gfp_t gfp_mask)
+{
+	if (skb_zcopy(orig)) {
+		if (skb_zcopy(nskb)) {
+			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
+			if (!gfp_mask) {
+				WARN_ON_ONCE(1);
+				return -ENOMEM;
+			}
+			if (skb_uarg(nskb) == skb_uarg(orig))
+				return 0;
+			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
+				return -EIO;
+		}
+		skb_zcopy_set(nskb, skb_uarg(orig));
+	}
+	return 0;
+}
+
 /**
  *	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
  *	@skb: the skb to modify
diff --git a/net/core/sock.c b/net/core/sock.c
index 34b5dcd93340..2822d30f2ad2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1670,6 +1670,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		atomic_set(&newsk->sk_drops, 0);
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
+		atomic_set(&newsk->sk_zckey, 0);
 
 		sock_reset_flag(newsk, SOCK_DONE);
 
@@ -2700,6 +2701,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_sndtimeo		=	MAX_SCHEDULE_TIMEOUT;
 
 	sk->sk_stamp = SK_DEFAULT_STAMP;
+	atomic_set(&sk->sk_zckey, 0);
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id		=	0;
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 04/13] sock: add SOCK_ZEROCOPY sockopt
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (2 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 03/13] sock: add MSG_ZEROCOPY Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 05/13] sock: enable MSG_ZEROCOPY Willem de Bruijn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The send call ignores unknown flags. Legacy applications may already
unwittingly pass MSG_ZEROCOPY. Continue to ignore this flag unless a
socket opts in to zerocopy.

Introduce socket option SO_ZEROCOPY to enable MSG_ZEROCOPY processing.
Processes can also query this socket option to detect kernel support
for the feature. Older kernels will return ENOPROTOOPT.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 arch/alpha/include/uapi/asm/socket.h   |  2 ++
 arch/frv/include/uapi/asm/socket.h     |  2 ++
 arch/ia64/include/uapi/asm/socket.h    |  2 ++
 arch/m32r/include/uapi/asm/socket.h    |  2 ++
 arch/mips/include/uapi/asm/socket.h    |  2 ++
 arch/mn10300/include/uapi/asm/socket.h |  2 ++
 arch/parisc/include/uapi/asm/socket.h  |  2 ++
 arch/s390/include/uapi/asm/socket.h    |  2 ++
 arch/sparc/include/uapi/asm/socket.h   |  2 ++
 arch/xtensa/include/uapi/asm/socket.h  |  2 ++
 include/uapi/asm-generic/socket.h      |  2 ++
 net/core/skbuff.c                      |  3 +++
 net/core/sock.c                        | 14 ++++++++++++++
 13 files changed, 39 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 7b285dd4fe05..c6133a045352 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -109,4 +109,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index f1e3b20dce9f..9abf02d6855a 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -102,5 +102,7 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 5dd5c5d0d642..002eb85a6941 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -111,4 +111,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index f8f7b47e247f..e268e51a38d1 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 882823bec153..6c755bc07975 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -120,4 +120,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index c710db354ff2..ac82a3f26dbf 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -102,4 +102,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index a0d4dc9f4eb2..3b2bf7ae703b 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -101,4 +101,6 @@
 
 #define SO_PEERGROUPS		0x4034
 
+#define SO_ZEROCOPY		0x4035
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 52a63f4175cb..a56916c83565 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -108,4 +108,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 186fd8199f54..b2f5c50d0947 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -98,6 +98,8 @@
 
 #define SO_PEERGROUPS		0x003d
 
+#define SO_ZEROCOPY		0x003e
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 3eed2761c149..220059999e74 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -113,4 +113,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index 9861be8da65e..e47c9e436221 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -104,4 +104,6 @@
 
 #define SO_PEERGROUPS		59
 
+#define SO_ZEROCOPY		60
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 31e31185c557..34fa1f2b7ff2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -948,6 +948,9 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 
 	WARN_ON_ONCE(!in_task());
 
+	if (!sock_flag(sk, SOCK_ZEROCOPY))
+		return NULL;
+
 	skb = sock_omalloc(sk, 0, GFP_KERNEL);
 	if (!skb)
 		return NULL;
diff --git a/net/core/sock.c b/net/core/sock.c
index 2822d30f2ad2..14f317d9a2bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1055,6 +1055,16 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		if (val == 1)
 			dst_negative_advice(sk);
 		break;
+
+	case SO_ZEROCOPY:
+		if (sk->sk_state != TCP_CLOSE)
+			ret = -EBUSY;
+		else if (val < 0 || val > 1)
+			ret = -EINVAL;
+		else
+			sock_valbool_flag(sk, SOCK_ZEROCOPY, valbool);
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1383,6 +1393,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 		v.val64 = sock_gen_cookie(sk);
 		break;
 
+	case SO_ZEROCOPY:
+		v.val = sock_flag(sk, SOCK_ZEROCOPY);
+		break;
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 05/13] sock: enable MSG_ZEROCOPY
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (3 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 04/13] sock: add SOCK_ZEROCOPY sockopt Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 06/13] sock: MSG_ZEROCOPY notification coalescing Willem de Bruijn
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/net/tun.c      |  2 +-
 drivers/vhost/net.c    |  1 +
 include/linux/skbuff.h | 14 +++++++++++++-
 net/core/dev.c         |  4 ++--
 net/core/skbuff.c      | 48 +++++++++++++++++++-----------------------------
 5 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ae49f4b99b67..9b1225cf555f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -892,7 +892,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	    sk_filter(tfile->socket.sk, skb))
 		goto drop;
 
-	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+	if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 		goto drop;
 
 	skb_tx_timestamp(skb);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e3d7ea1288c6..614a577a0a66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -533,6 +533,7 @@ static void handle_tx(struct vhost_net *net)
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
+			atomic_set(&ubuf->refcnt, 1);
 			msg.msg_control = ubuf;
 			msg.msg_controllen = sizeof(ubuf);
 			ubufs = nvq->ubufs;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 76fdc81270c0..c20248b26b53 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2495,7 +2495,17 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+	if (likely(!skb_zcopy(skb)))
+		return 0;
+	if (skb_uarg(skb)->callback == sock_zerocopy_callback)
+		return 0;
+	return skb_copy_ubufs(skb, gfp_mask);
+}
+
+/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
+static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
+{
+	if (likely(!skb_zcopy(skb)))
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
@@ -2927,6 +2937,8 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
+	if (skb_zcopy(skb))
+		return false;
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 5d1830b8d2cf..c0b4c54f3b52 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1860,7 +1860,7 @@ static inline int deliver_skb(struct sk_buff *skb,
 			      struct packet_type *pt_prev,
 			      struct net_device *orig_dev)
 {
-	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+	if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 		return -ENOMEM;
 	atomic_inc(&skb->users);
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
@@ -4292,7 +4292,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
 	}
 
 	if (pt_prev) {
-		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+		if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
 			goto drop;
 		else
 			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 34fa1f2b7ff2..89ba5ad024a5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -592,21 +592,10 @@ static void skb_release_data(struct sk_buff *skb)
 	for (i = 0; i < shinfo->nr_frags; i++)
 		__skb_frag_unref(&shinfo->frags[i]);
 
-	/*
-	 * If skb buf is from userspace, we need to notify the caller
-	 * the lower device DMA has done;
-	 */
-	if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = shinfo->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, true);
-	}
-
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
+	skb_zcopy_clear(skb, true);
 	skb_free_head(skb);
 }
 
@@ -720,14 +709,7 @@ EXPORT_SYMBOL(kfree_skb_list);
  */
 void skb_tx_error(struct sk_buff *skb)
 {
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = skb_shinfo(skb)->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, false);
-		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
-	}
+	skb_zcopy_clear(skb, true);
 }
 EXPORT_SYMBOL(skb_tx_error);
 
@@ -1079,9 +1061,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
 
-/* unused only until next patch in the series; will remove attribute */
-static int __attribute__((unused))
-	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
 	if (skb_zcopy(orig)) {
@@ -1120,7 +1100,6 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 	int i, new_frags;
 	u32 d_off;
 
@@ -1174,8 +1153,6 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
 
-	uarg->callback(uarg, false);
-
 	/* skb frags point to kernel buffers */
 	for (i = 0; i < new_frags - 1; i++) {
 		__skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
@@ -1184,7 +1161,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
 	skb_shinfo(skb)->nr_frags = new_frags;
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	skb_zcopy_clear(skb, false);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(skb_copy_ubufs);
@@ -1345,7 +1322,8 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 	if (skb_shinfo(skb)->nr_frags) {
 		int i;
 
-		if (skb_orphan_frags(skb, gfp_mask)) {
+		if (skb_orphan_frags(skb, gfp_mask) ||
+		    skb_zerocopy_clone(n, skb, gfp_mask)) {
 			kfree_skb(n);
 			n = NULL;
 			goto out;
@@ -1422,9 +1400,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 * be since all we did is relocate the values
 	 */
 	if (skb_cloned(skb)) {
-		/* copy this zero copy skb frags */
 		if (skb_orphan_frags(skb, gfp_mask))
 			goto nofrags;
+		if (skb_zcopy(skb))
+			atomic_inc(&skb_uarg(skb)->refcnt);
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			skb_frag_ref(skb, i);
 
@@ -1916,6 +1895,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta)
 	skb->tail     += delta;
 	skb->data_len -= delta;
 
+	if (!skb->data_len)
+		skb_zcopy_clear(skb, false);
+
 	return skb_tail_pointer(skb);
 }
 EXPORT_SYMBOL(__pskb_pull_tail);
@@ -2547,6 +2529,7 @@ skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		skb_tx_error(from);
 		return -ENOMEM;
 	}
+	skb_zerocopy_clone(to, from, GFP_ATOMIC);
 
 	for (i = 0; i < skb_shinfo(from)->nr_frags; i++) {
 		if (!len)
@@ -2844,6 +2827,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 
 	skb_shinfo(skb1)->tx_flags |= skb_shinfo(skb)->tx_flags &
 				      SKBTX_SHARED_FRAG;
+	skb_zerocopy_clone(skb1, skb, 0);
 	if (len < pos)	/* Split line is inside header. */
 		skb_split_inside_header(skb, skb1, len, pos);
 	else		/* Second chunk has no header, nothing to copy. */
@@ -2887,6 +2871,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	if (skb_headlen(skb))
 		return 0;
+	if (skb_zcopy(tgt) || skb_zcopy(skb))
+		return 0;
 
 	todo = shiftlen;
 	from = 0;
@@ -3460,6 +3446,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
 					      SKBTX_SHARED_FRAG;
+		if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
+			goto err;
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
@@ -4583,6 +4571,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 	if (skb_has_frag_list(to) || skb_has_frag_list(from))
 		return false;
+	if (skb_zcopy(to) || skb_zcopy(from))
+		return false;
 
 	if (skb_headlen(from) != 0) {
 		struct page *page;
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 06/13] sock: MSG_ZEROCOPY notification coalescing
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (4 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 05/13] sock: enable MSG_ZEROCOPY Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
       [not found]   ` <20170621211816.53837-7-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-06-21 21:18 ` [PATCH net-next v3 07/13] sock: add ee_code SO_EE_CODE_ZEROCOPY_COPIED Willem de Bruijn
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause send() calls to append new data to an
existing sk_buff and, thus, ubuf_info. In that case the notification
must hold a range. odify ubuf_info to store a inclusive range [N..N+m]
and add skb_zerocopy_realloc() to optionally extend an existing range.

Also coalesce notifications in this common case: if a notification
[1, 1] is about to be queued while [0, 0] is the queue tail, just modify
the head of the queue to read [0, 1].

Coalescing is limited to a few TSO frames worth of data to bound
notification latency.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 15 ++++++--
 net/core/skbuff.c      | 98 ++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c20248b26b53..554017a3aa3b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -408,14 +408,25 @@ enum {
  */
 struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
-	void *ctx;
-	unsigned long desc;
+	union {
+		struct {
+			unsigned long desc;
+			void *ctx;
+		};
+		struct {
+			u32 id;
+			u16 len;
+			u32 bytelen;
+		};
+	};
 	atomic_t refcnt;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg);
 
 static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 89ba5ad024a5..41fedc4e651f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -941,7 +941,9 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg = (void *)skb->cb;
 
 	uarg->callback = sock_zerocopy_callback;
-	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
+	uarg->len = 1;
+	uarg->bytelen = size;
 	atomic_set(&uarg->refcnt, 0);
 	sock_hold(sk);
 
@@ -954,24 +956,98 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg)
 	return container_of((void *)uarg, struct sk_buff, cb);
 }
 
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg)
+{
+	if (uarg) {
+		const u32 byte_limit = 1 << 19;		/* limit to a few TSO */
+		u32 bytelen, next;
+
+		/* realloc only when socket is locked (TCP, UDP cork),
+		 * so uarg->len and sk_zckey access is serialized
+		 */
+		if (!sock_owned_by_user(sk)) {
+			WARN_ON_ONCE(1);
+			return NULL;
+		}
+
+		bytelen = uarg->bytelen + size;
+		if (uarg->len == USHRT_MAX - 1 || bytelen > byte_limit) {
+			/* TCP can create new skb to attach new uarg */
+			if (sk->sk_type == SOCK_STREAM)
+				goto new_alloc;
+			return NULL;
+		}
+
+		next = (u32)atomic_read(&sk->sk_zckey);
+		if ((u32)(uarg->id + uarg->len) == next) {
+			uarg->len++;
+			uarg->bytelen = bytelen;
+			atomic_set(&sk->sk_zckey, ++next);
+			return uarg;
+		}
+	}
+
+new_alloc:
+	return sock_zerocopy_alloc(sk, size);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
+
+static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	u64 sum_len;
+	u32 old_lo, old_hi;
+
+	old_lo = serr->ee.ee_info;
+	old_hi = serr->ee.ee_data;
+	sum_len = old_hi - old_lo + 1ULL + len;
+
+	if (sum_len >= (1ULL << 32))
+		return false;
+
+	if (lo != old_hi + 1)
+		return false;
+
+	serr->ee.ee_data += len;
+	return true;
+}
+
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 {
 	struct sock_exterr_skb *serr;
-	struct sk_buff *skb = skb_from_uarg(uarg);
+	struct sk_buff *tail, *skb = skb_from_uarg(uarg);
 	struct sock *sk = skb->sk;
-	u16 id = uarg->desc;
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	unsigned long flags;
+	u32 lo, hi;
+	u16 len;
 
-	if (sock_flag(sk, SOCK_DEAD))
+	/* if !len, there was only 1 call, and it was aborted
+	 * so do not queue a completion notification
+	 */
+	if (!uarg->len || sock_flag(sk, SOCK_DEAD))
 		goto release;
 
+	len = uarg->len;
+	lo = uarg->id;
+	hi = uarg->id + len - 1;
+
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = 0;
 	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
-	serr->ee.ee_data = id;
+	serr->ee.ee_data = hi;
+	serr->ee.ee_info = lo;
 
-	skb_queue_tail(&sk->sk_error_queue, skb);
-	skb = NULL;
+	spin_lock_irqsave(&q->lock, flags);
+	tail = skb_peek_tail(q);
+	if (!tail || SKB_EXT_ERR(tail)->ee.ee_origin != SO_EE_ORIGIN_ZEROCOPY ||
+	    !skb_zerocopy_notify_extend(tail, lo, len)) {
+		__skb_queue_tail(q, skb);
+		skb = NULL;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
 
 	sk->sk_error_report(sk);
 
@@ -998,6 +1074,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
 		struct sock *sk = skb_from_uarg(uarg)->sk;
 
 		atomic_dec(&sk->sk_zckey);
+		uarg->len--;
 
 		/* sock_zerocopy_put expects a ref. Most sockets take one per
 		 * skb, which is zero on abort. tcp_sendmsg holds one extra, to
@@ -1045,9 +1122,16 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg)
 {
+	struct ubuf_info *orig_uarg = skb_zcopy(skb);
 	struct iov_iter orig_iter = msg->msg_iter;
 	int err, orig_len = skb->len;
 
+	/* An skb can only point to one uarg. This edge case happens when
+	 * TCP appends to an skb, but zerocopy_realloc triggered a new alloc.
+	 */
+	if (orig_uarg && uarg != orig_uarg)
+		return -EEXIST;
+
 	err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
 	if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
 		/* Streams do not free skb on error. Reset to prev state. */
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 07/13] sock: add ee_code SO_EE_CODE_ZEROCOPY_COPIED
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (5 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 06/13] sock: MSG_ZEROCOPY notification coalescing Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 08/13] sock: ulimit on MSG_ZEROCOPY pages Willem de Bruijn
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

A send request with MSG_ZEROCOPY falls back to copying when a request
cannot be completed in zerocopy mode.

Notify processes when the call reverted to copying through a new bit
in the completion notification, so that they can optionally decline
passing the flag on subsequent calls on the same path.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        | 2 ++
 include/uapi/linux/errqueue.h | 2 ++
 net/core/skbuff.c             | 5 ++++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 554017a3aa3b..3b8c006b8592 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -416,6 +416,7 @@ struct ubuf_info {
 		struct {
 			u32 id;
 			u16 len;
+			u16 zerocopy:1;
 			u32 bytelen;
 		};
 	};
@@ -1257,6 +1258,7 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy)
 	struct ubuf_info *uarg = skb_zcopy(skb);
 
 	if (uarg) {
+		uarg->zerocopy = uarg->zerocopy && zerocopy;
 		sock_zerocopy_put(uarg);
 		skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
 	}
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 0f15a77c9e39..78fdf52d6b2f 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -23,6 +23,8 @@ struct sock_extended_err {
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
 
+#define SO_EE_CODE_ZEROCOPY_COPIED	1
+
 /**
  *	struct scm_timestamping - timestamps exposed through cmsg
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 41fedc4e651f..35a7a7b1192f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -943,6 +943,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->callback = sock_zerocopy_callback;
 	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
 	uarg->len = 1;
+	uarg->zerocopy = 1;
 	uarg->bytelen = size;
 	atomic_set(&uarg->refcnt, 0);
 	sock_hold(sk);
@@ -1039,6 +1040,8 @@ void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
 	serr->ee.ee_data = hi;
 	serr->ee.ee_info = lo;
+	if (!success)
+		serr->ee.ee_code |= SO_EE_CODE_ZEROCOPY_COPIED;
 
 	spin_lock_irqsave(&q->lock, flags);
 	tail = skb_peek_tail(q);
@@ -1061,7 +1064,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 {
 	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
 		if (uarg->callback)
-			uarg->callback(uarg, true);
+			uarg->callback(uarg, uarg->zerocopy);
 		else
 			consume_skb(skb_from_uarg(uarg));
 	}
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 08/13] sock: ulimit on MSG_ZEROCOPY pages
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (6 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 07/13] sock: add ee_code SO_EE_CODE_ZEROCOPY_COPIED Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 09/13] tcp: enable MSG_ZEROCOPY Willem de Bruijn
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/sched/user.h |  3 ++-
 include/linux/skbuff.h     |  5 +++++
 net/core/skbuff.c          | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 5d5415e129d4..3c07e4135127 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -36,7 +36,8 @@ struct user_struct {
 	struct hlist_node uidhash_node;
 	kuid_t uid;
 
-#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
+    defined(CONFIG_NET)
 	atomic_long_t locked_vm;
 #endif
 };
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3b8c006b8592..cf7fe9f04afb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -421,6 +421,11 @@ struct ubuf_info {
 		};
 	};
 	atomic_t refcnt;
+
+	struct mmpin {
+		struct user_struct *user;
+		unsigned int num_pg;
+	} mmp;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 35a7a7b1192f..0beaf961f79c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -923,6 +923,44 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+{
+	unsigned long max_pg, num_pg, new_pg, old_pg;
+	struct user_struct *user;
+
+	if (capable(CAP_IPC_LOCK) || !size)
+		return 0;
+
+	num_pg = (size >> PAGE_SHIFT) + 2;	/* worst case */
+	max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	user = mmp->user ? : current_user();
+
+	do {
+		old_pg = atomic_long_read(&user->locked_vm);
+		new_pg = old_pg + num_pg;
+		if (new_pg > max_pg)
+			return -ENOBUFS;
+	} while (atomic_long_cmpxchg(&user->locked_vm, old_pg, new_pg) !=
+		 old_pg);
+
+	if (!mmp->user) {
+		mmp->user = get_uid(user);
+		mmp->num_pg = num_pg;
+	} else {
+		mmp->num_pg += num_pg;
+	}
+
+	return 0;
+}
+
+static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+{
+	if (mmp->user) {
+		atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+		free_uid(mmp->user);
+	}
+}
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
 	struct sk_buff *skb;
@@ -939,6 +977,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 
 	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
 	uarg = (void *)skb->cb;
+	uarg->mmp.user = NULL;
+
+	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+		kfree_skb(skb);
+		return NULL;
+	}
 
 	uarg->callback = sock_zerocopy_callback;
 	uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1;
@@ -982,6 +1026,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 
 		next = (u32)atomic_read(&sk->sk_zckey);
 		if ((u32)(uarg->id + uarg->len) == next) {
+			if (mm_account_pinned_pages(&uarg->mmp, size))
+				return NULL;
 			uarg->len++;
 			uarg->bytelen = bytelen;
 			atomic_set(&sk->sk_zckey, ++next);
@@ -1063,6 +1109,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
 	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		mm_unaccount_pinned_pages(&uarg->mmp);
+
 		if (uarg->callback)
 			uarg->callback(uarg, uarg->zerocopy);
 		else
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 09/13] tcp: enable MSG_ZEROCOPY
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (7 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 08/13] sock: ulimit on MSG_ZEROCOPY pages Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 10/13] udp: " Willem de Bruijn
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
both supported. Only data sent to remote destinations is sent without
copying. Packets looped onto a local destination have their payload
copied to avoid unbounded latency.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  msg_zerocopy.sh 4 tcp:

  without zerocopy
    tx=121792 (7600 MB) txc=0 zc=n
    rx=60458 (7600 MB)

  with zerocopy
    tx=286257 (17863 MB) txc=286257 zc=y
    rx=140022 (17863 MB)

  This test opens a pair of sockets over veth, one one calls send with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 058f509ca98e..02a91ac8d71b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1149,6 +1149,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	struct sockcm_cookie sockc;
+	struct ubuf_info *uarg = NULL;
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool process_backlog = false;
@@ -1158,6 +1159,26 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	lock_sock(sk);
 
 	flags = msg->msg_flags;
+
+	if (flags & MSG_ZEROCOPY && size) {
+		if (sk->sk_state != TCP_ESTABLISHED) {
+			err = -EINVAL;
+			goto out_err;
+		}
+
+		skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL;
+		uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+		if (!uarg) {
+			err = -ENOBUFS;
+			goto out_err;
+		}
+
+		/* skb may be freed in main loop, keep extra ref on uarg */
+		sock_zerocopy_get(uarg);
+		if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG))
+			uarg->zerocopy = 0;
+	}
+
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect)) {
 		err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size);
 		if (err == -EINPROGRESS && copied_syn > 0)
@@ -1281,7 +1302,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
 			if (err)
 				goto do_fault;
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1319,6 +1340,13 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
+		} else {
+			err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
+			if (err == -EMSGSIZE || err == -EEXIST)
+				goto new_segment;
+			if (err < 0)
+				goto do_error;
+			copy = err;
 		}
 
 		if (!copied)
@@ -1365,6 +1393,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
 	}
 out_nopush:
+	sock_zerocopy_put(uarg);
 	release_sock(sk);
 	return copied + copied_syn;
 
@@ -1382,6 +1411,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	if (copied + copied_syn)
 		goto out;
 out_err:
+	sock_zerocopy_put_abort(uarg);
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 &&
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 10/13] udp: enable MSG_ZEROCOPY
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (8 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 09/13] tcp: enable MSG_ZEROCOPY Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 11/13] raw: enable MSG_ZEROCOPY with IP_HDRINCL Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 12/13] packet: enable MSG_ZEROCOPY Willem de Bruijn
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add MSG_ZEROCOPY support to INET(6). This includes UDP, but also
RAW sockets that do not take the raw_send_hdrinc() path.

Zerocopy is only effective when payload is not touched at all. Limit
it to paths that support both checksum offload and scatter-gather.

When a caller passes MSG_ZEROCOPY to send and it returns a positive
result, the caller must always receive a completion notification.
Therefore, attach the structure even when zerocopy is not possible.
Also in edge cases, such as corking with mixed zc/non-zc calls.

Tested:
  msg_zerocopy.sh 4 udp:

  without zerocopy
    tx=146127 (9118 MB) txc=0 zc=n
    rx=146127 (9118 MB)

  with zerocopy
    tx=335789 (20954 MB) txc=335789 zc=y
    rx=335789 (20954 MB)

  msg_zerocopy.sh 4 raw:

  without zerocopy
    tx=106461 (6643 MB) txc=0 zc=n
    rx=106461 (6643 MB)

  with zerocopy
    tx=296082 (18476 MB) txc=296082 zc=y
    rx=296082 (18476 MB)

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c     |  4 ++++
 net/ipv4/ip_output.c  | 37 ++++++++++++++++++++++++++++++-------
 net/ipv6/ip6_output.c | 40 +++++++++++++++++++++++++++++++++-------
 3 files changed, 67 insertions(+), 14 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0beaf961f79c..7d4c12316df6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1145,6 +1145,10 @@ extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 int skb_zerocopy_iter(struct sock *sk, struct sk_buff *skb, struct msghdr *msg,
 		      int len)
 {
+	/* raw has extra indirection in raw_frag_vec */
+	if (sk->sk_type == SOCK_RAW && sk->sk_family != PF_PACKET)
+		msg = *(struct msghdr **)msg;
+
 	return __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7a3fd25e8913..3ff425f7ded6 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -919,7 +919,7 @@ static int __ip_append_data(struct sock *sk,
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
-
+	struct ubuf_info *uarg = NULL;
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -963,9 +963,21 @@ static int __ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+
+		if (!(rt->dst.dev->features & NETIF_F_SG) ||
+		    (sk->sk_type == SOCK_DGRAM && csummode == CHECKSUM_NONE)) {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	cork->length += length;
 	if ((((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
-	    (sk->sk_protocol == IPPROTO_UDP) &&
+	    (sk->sk_protocol == IPPROTO_UDP) && !uarg &&
 	    (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(&rt->dst) &&
 	    (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
 		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
@@ -997,6 +1009,7 @@ static int __ip_append_data(struct sock *sk,
 			unsigned int fraglen;
 			unsigned int fraggap;
 			unsigned int alloclen;
+			unsigned int zcopylen = 0;
 			struct sk_buff *skb_prev;
 alloc_new_skb:
 			skb_prev = skb;
@@ -1017,8 +1030,12 @@ static int __ip_append_data(struct sock *sk,
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else
+			else if (!uarg || !uarg->zerocopy)
 				alloclen = fraglen;
+			else {
+				alloclen = min_t(int, fraglen, MAX_HEADER);
+				zcopylen = fraglen - alloclen;
+			}
 
 			alloclen += exthdrlen;
 
@@ -1059,11 +1076,12 @@ static int __ip_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
 			 */
-			data = skb_put(skb, fraglen + exthdrlen);
+			data = skb_put(skb, fraglen + exthdrlen - zcopylen);
 			skb_set_network_header(skb, exthdrlen);
 			skb->transport_header = (skb->network_header +
 						 fragheaderlen);
@@ -1079,7 +1097,7 @@ static int __ip_append_data(struct sock *sk,
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = datalen - transhdrlen - fraggap - zcopylen;
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
@@ -1087,7 +1105,7 @@ static int __ip_append_data(struct sock *sk,
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
@@ -1115,7 +1133,7 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1145,6 +1163,10 @@ static int __ip_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			atomic_add(copy, &sk->sk_wmem_alloc);
+		} else {
+			err = skb_zerocopy_iter(sk, skb, from, copy);
+			if (err)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1155,6 +1177,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 5baa6fab4b97..38d9722d4e3c 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1307,6 +1307,7 @@ static int __ip6_append_data(struct sock *sk,
 	struct ipv6_txoptions *opt = v6_cork->opt;
 	int csummode = CHECKSUM_NONE;
 	unsigned int maxnonfragsize, headersize;
+	struct ubuf_info *uarg = NULL;
 
 	skb = skb_peek_tail(queue);
 	if (!skb) {
@@ -1368,6 +1369,18 @@ static int __ip6_append_data(struct sock *sk,
 			tskey = sk->sk_tskey++;
 	}
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+
+		if (!(rt->dst.dev->features & NETIF_F_SG) ||
+		    (sk->sk_type == SOCK_DGRAM && csummode == CHECKSUM_NONE)) {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	/*
 	 * Let's try using as much space as possible.
 	 * Use MTU if total length of the message fits into the MTU.
@@ -1387,7 +1400,7 @@ static int __ip6_append_data(struct sock *sk,
 	cork->length += length;
 	if ((((length + fragheaderlen) > mtu) ||
 	     (skb && skb_is_gso(skb))) &&
-	    (sk->sk_protocol == IPPROTO_UDP) &&
+	    (sk->sk_protocol == IPPROTO_UDP) && !uarg &&
 	    (rt->dst.dev->features & NETIF_F_UFO) && !dst_xfrm(&rt->dst) &&
 	    (sk->sk_type == SOCK_DGRAM) && !udp_get_no_check6_tx(sk)) {
 		err = ip6_ufo_append_data(sk, queue, getfrag, from, length,
@@ -1413,6 +1426,7 @@ static int __ip6_append_data(struct sock *sk,
 			unsigned int fraglen;
 			unsigned int fraggap;
 			unsigned int alloclen;
+			unsigned int zcopylen = 0;
 alloc_new_skb:
 			/* There's no room in the current skb */
 			if (skb)
@@ -1435,11 +1449,17 @@ static int __ip6_append_data(struct sock *sk,
 
 			if (datalen > (cork->length <= mtu && !(cork->flags & IPCORK_ALLFRAG) ? mtu : maxfraglen) - fragheaderlen)
 				datalen = maxfraglen - fragheaderlen - rt->dst.trailer_len;
+			fraglen = datalen + fragheaderlen;
+
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
-			else
-				alloclen = datalen + fragheaderlen;
+			else if (!uarg || !uarg->zerocopy)
+				alloclen = fraglen;
+			else {
+				alloclen = min_t(int, fraglen, MAX_HEADER);
+				zcopylen = fraglen - alloclen;
+			}
 
 			alloclen += dst_exthdrlen;
 
@@ -1461,7 +1481,7 @@ static int __ip6_append_data(struct sock *sk,
 			 */
 			alloclen += sizeof(struct frag_hdr);
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = datalen - transhdrlen - fraggap - zcopylen;
 			if (copy < 0) {
 				err = -EINVAL;
 				goto error;
@@ -1497,11 +1517,12 @@ static int __ip6_append_data(struct sock *sk,
 			tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes
 			 */
-			data = skb_put(skb, fraglen);
+			data = skb_put(skb, fraglen - zcopylen);
 			skb_set_network_header(skb, exthdrlen);
 			data += fragheaderlen;
 			skb->transport_header = (skb->network_header +
@@ -1524,7 +1545,7 @@ static int __ip6_append_data(struct sock *sk,
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			dst_exthdrlen = 0;
@@ -1552,7 +1573,7 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1582,6 +1603,10 @@ static int __ip6_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			atomic_add(copy, &sk->sk_wmem_alloc);
+		} else {
+			err = skb_zerocopy_iter(sk, skb, from, copy);
+			if (err)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1592,6 +1617,7 @@ static int __ip6_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	return err;
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 11/13] raw: enable MSG_ZEROCOPY with IP_HDRINCL
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (9 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 10/13] udp: " Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  2017-06-21 21:18 ` [PATCH net-next v3 12/13] packet: enable MSG_ZEROCOPY Willem de Bruijn
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Zerocopy support for udp also enables it for some raw sockets. Only
raw sockets that have hdrinc set take a different path. Add zerocopy
support for this variant.

Tested:
  msg_zerocopy.sh 4 raw_hdrincl:

  without zerocopy
    tx=150438 (9390 MB) txc=0 zc=n
    rx=150438 (9387 MB)

  with zerocopy
    tx=292454 (18255 MB) txc=292454 zc=y
    rx=292454 (18250 MB)

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/raw.c | 23 +++++++++++++++++++----
 net/ipv6/raw.c | 20 +++++++++++++++++---
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index bdffad875691..0a5a3f2ce81b 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,7 +351,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	unsigned int iphlen;
 	int err;
 	struct rtable *rt = *rtp;
-	int hlen, tlen;
+	int hlen, tlen, linear;
 
 	if (length > rt->dst.dev->mtu) {
 		ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
@@ -366,8 +366,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	hlen = LL_RESERVED_SPACE(rt->dst.dev);
 	tlen = rt->dst.dev->needed_tailroom;
+	linear = length;
+
+	if (flags & MSG_ZEROCOPY &&
+	    rt->dst.dev->features & NETIF_F_SG)
+		linear = min_t(int, linear, MAX_HEADER);
+
 	skb = sock_alloc_send_skb(sk,
-				  length + hlen + tlen + 15,
+				  linear + hlen + tlen + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (!skb)
 		goto error;
@@ -380,7 +386,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
-	skb_put(skb, length);
+	skb_put(skb, linear);
 
 	skb->ip_summed = CHECKSUM_NONE;
 
@@ -391,7 +397,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb->transport_header = skb->network_header;
 	err = -EFAULT;
-	if (memcpy_from_msg(iph, msg, length))
+	if (memcpy_from_msg(iph, msg, linear))
 		goto error_free;
 
 	iphlen = iph->ihl * 4;
@@ -423,6 +429,13 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 				skb_transport_header(skb))->type);
 	}
 
+	if (flags & MSG_ZEROCOPY) {
+		err = skb_zerocopy_iter_alloc(skb, (void *)&msg,
+					      length - linear);
+		if (err)
+			goto error_zcopy;
+	}
+
 	err = NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
 		      net, sk, skb, NULL, rt->dst.dev,
 		      dst_output);
@@ -433,6 +446,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 out:
 	return 0;
 
+error_zcopy:
+	skb_zcopy_abort(skb);
 error_free:
 	kfree_skb(skb);
 error:
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 60be012fe708..206cca2d9b29 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -627,6 +627,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 	struct rt6_info *rt = (struct rt6_info *)*dstp;
 	int hlen = LL_RESERVED_SPACE(rt->dst.dev);
 	int tlen = rt->dst.dev->needed_tailroom;
+	int linear = length;
 
 	if (length > rt->dst.dev->mtu) {
 		ipv6_local_error(sk, EMSGSIZE, fl6, rt->dst.dev->mtu);
@@ -637,8 +638,12 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 	if (flags&MSG_PROBE)
 		goto out;
 
+	if (flags & MSG_ZEROCOPY &&
+	    rt->dst.dev->features & NETIF_F_SG)
+		linear = min_t(int, length, MAX_HEADER);
+
 	skb = sock_alloc_send_skb(sk,
-				  length + hlen + tlen + 15,
+				  linear + hlen + tlen + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (!skb)
 		goto error;
@@ -650,7 +655,7 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 	skb_dst_set(skb, &rt->dst);
 	*dstp = NULL;
 
-	skb_put(skb, length);
+	skb_put(skb, linear);
 	skb_reset_network_header(skb);
 	iph = ipv6_hdr(skb);
 
@@ -660,10 +665,17 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 		skb_set_dst_pending_confirm(skb, 1);
 
 	skb->transport_header = skb->network_header;
-	err = memcpy_from_msg(iph, msg, length);
+	err = memcpy_from_msg(iph, msg, linear);
 	if (err)
 		goto error_fault;
 
+	if (flags & MSG_ZEROCOPY) {
+		err = skb_zerocopy_iter_alloc(skb, (void *)&msg,
+					      length - linear);
+		if (err)
+			goto error_zcopy;
+	}
+
 	/* if egress device is enslaved to an L3 master device pass the
 	 * skb to its handler for processing
 	 */
@@ -681,6 +693,8 @@ static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 out:
 	return 0;
 
+error_zcopy:
+	skb_zcopy_abort(skb);
 error_fault:
 	err = -EFAULT;
 	kfree_skb(skb);
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 12/13] packet: enable MSG_ZEROCOPY
  2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (10 preceding siblings ...)
  2017-06-21 21:18 ` [PATCH net-next v3 11/13] raw: enable MSG_ZEROCOPY with IP_HDRINCL Willem de Bruijn
@ 2017-06-21 21:18 ` Willem de Bruijn
  11 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-api, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Support MSG_ZEROCOPY on PF_PACKET transmission.

Tested:

  msg_zerocopy.sh 4 packet:

  without zerocopy
    tx=121543 (7588 MB) txc=0 zc=n
    rx=121543 (7584 MB)

  with zerocopy
    tx=338252 (21119 MB) txc=338252 zc=y
    rx=338252 (21108 MB)

  msg_zerocopy.sh 4 packet_dgram:

  without zerocopy
    tx=122699 (7659 MB) txc=0 zc=n
    rx=122699 (7656 MB)

  with zerocopy
    tx=340390 (21248 MB) txc=340390 zc=y
    rx=340390 (21241 MB)

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f9349a495caf..a04791fd81f2 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2795,30 +2795,39 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 				        size_t reserve, size_t len,
-				        size_t linear, int noblock,
+					size_t linear, int flags,
 				        int *err)
 {
 	struct sk_buff *skb;
+	size_t data_len;
 
-	/* Under a page?  Don't bother with paged skb. */
-	if (prepad + len < PAGE_SIZE || !linear)
-		linear = len;
+	if (flags & MSG_ZEROCOPY) {
+		/* Minimize linear, but respect header lower bound */
+		linear = reserve + min(len, max_t(size_t, linear, MAX_HEADER));
+		data_len = 0;
+	} else {
+		/* Under a page? Don't bother with paged skb. */
+		if (prepad + len < PAGE_SIZE || !linear)
+			linear = len;
+		data_len = len - linear;
+	}
 
-	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-				   err, 0);
+	skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
+				   flags & MSG_DONTWAIT, err, 0);
 	if (!skb)
 		return NULL;
 
 	skb_reserve(skb, reserve);
 	skb_put(skb, linear);
-	skb->data_len = len - linear;
-	skb->len += len - linear;
+	skb->data_len = data_len;
+	skb->len += data_len;
 
 	return skb;
 }
 
 static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
+	struct iov_iter *from = &msg->msg_iter;
 	struct sock *sk = sock->sk;
 	DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name);
 	struct sk_buff *skb;
@@ -2894,7 +2903,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len);
 	linear = max(linear, min_t(int, len, dev->hard_header_len));
 	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear,
-			       msg->msg_flags & MSG_DONTWAIT, &err);
+			       msg->msg_flags, &err);
 	if (skb == NULL)
 		goto out_unlock;
 
@@ -2908,10 +2917,16 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	/* Returns -EFAULT on error */
-	err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len);
+	err = skb_copy_datagram_from_iter(skb, offset, from, skb->len - offset);
 	if (err)
 		goto out_free;
 
+	if (msg->msg_flags & MSG_ZEROCOPY && len) {
+		err = skb_zerocopy_iter_alloc(skb, msg, iov_iter_count(from));
+		if (err)
+			goto out_free;
+	}
+
 	if (sock->type == SOCK_RAW &&
 	    !dev_validate_header(dev, skb->data, len)) {
 		err = -EINVAL;
@@ -2954,6 +2969,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	return len;
 
 out_free:
+	skb_zcopy_abort(skb);
 	kfree_skb(skb);
 out_unlock:
 	if (dev)
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next v3 13/13] test: add msg_zerocopy test
       [not found] ` <20170621211816.53837-1-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-06-21 21:18   ` [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
@ 2017-06-21 21:18   ` Willem de Bruijn
  1 sibling, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-21 21:18 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: davem-fT/PcQaiUtIeIZ0/mPfg9Q, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Willem de Bruijn

From: Willem de Bruijn <willemb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Introduce regression test for msg_zerocopy feature. Send traffic from
one process to another with and without zerocopy.

Evaluate tcp, udp, raw and packet sockets, including variants
- udp: corking and corking with mixed copy/zerocopy calls
- raw: with and without hdrincl
- packet: at both raw and dgram level

Test on both ipv4 and ipv6, optionally with ethtool changes to
disable scatter-gather, tx checksum or tso offload. All of these
can affect zerocopy behavior.

The regression test can be run on a single machine if over a veth
pair. Then skb_orphan_frags_rx must be modified to be identical to
skb_orphan_frags to allow forwarding zerocopy locally.

The msg_zerocopy.sh script will setup the veth pair in network
namespaces and run all tests.

Signed-off-by: Willem de Bruijn <willemb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/net/.gitignore      |   1 +
 tools/testing/selftests/net/Makefile        |   2 +-
 tools/testing/selftests/net/msg_zerocopy.c  | 697 ++++++++++++++++++++++++++++
 tools/testing/selftests/net/msg_zerocopy.sh | 113 +++++
 4 files changed, 812 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/msg_zerocopy.c
 create mode 100755 tools/testing/selftests/net/msg_zerocopy.sh

diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore
index afe109e5508a..9801253e4802 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -1,3 +1,4 @@
+msg_zerocopy
 socket
 psock_fanout
 psock_tpacket
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 35cbb4cba410..ddb7146202a8 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -9,7 +9,7 @@ TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket
 TEST_GEN_FILES += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
-TEST_GEN_FILES += reuseport_dualstack
+TEST_GEN_FILES += reuseport_dualstack msg_zerocopy
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
new file mode 100644
index 000000000000..f74e962589e6
--- /dev/null
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -0,0 +1,697 @@
+/* Evaluate MSG_ZEROCOPY
+ *
+ * Send traffic between two processes over one of the supported
+ * protocols and modes:
+ *
+ * PF_INET/PF_INET6
+ * - SOCK_STREAM
+ * - SOCK_DGRAM
+ * - SOCK_DGRAM with UDP_CORK
+ * - SOCK_RAW
+ * - SOCK_RAW with IP_HDRINCL
+ *
+ * PF_PACKET
+ * - SOCK_DGRAM
+ * - SOCK_RAW
+ *
+ * Start this program on two connected hosts, one in send mode and
+ * the other with option '-r' to put it in receiver mode.
+ *
+ * If zerocopy mode ('-z') is enabled, the sender will verify that
+ * the kernel queues completions on the error queue for all zerocopy
+ * transfers.
+ */
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <error.h>
+#include <errno.h>
+#include <limits.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#ifndef SO_EE_ORIGIN_ZEROCOPY
+#define SO_EE_ORIGIN_ZEROCOPY		SO_EE_ORIGIN_UPAGE
+#endif
+
+#ifndef SO_ZEROCOPY
+#define SO_ZEROCOPY	59
+#endif
+
+#ifndef SO_EE_CODE_ZEROCOPY_COPIED
+#define SO_EE_CODE_ZEROCOPY_COPIED	1
+#endif
+
+#ifndef MSG_ZEROCOPY
+#define MSG_ZEROCOPY	0x4000000
+#endif
+
+static int  cfg_cork;
+static bool cfg_cork_mixed;
+static int  cfg_cpu		= -1;		/* default: pin to last cpu */
+static int  cfg_family		= PF_UNSPEC;
+static int  cfg_ifindex		= 1;
+static int  cfg_payload_len;
+static int  cfg_port		= 8000;
+static bool cfg_rx;
+static int  cfg_runtime_ms	= 4200;
+static int  cfg_verbose;
+static int  cfg_waittime_ms	= 500;
+static bool cfg_zerocopy;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_dst_addr;
+static struct sockaddr_storage cfg_src_addr;
+
+static char payload[IP_MAXPACKET];
+static long packets, bytes, completions, expected_completions;
+static int  zerocopied = -1;
+static uint32_t next_completion;
+
+static unsigned long gettimeofday_ms(void)
+{
+	struct timeval tv;
+
+	gettimeofday(&tv, NULL);
+	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static uint16_t get_ip_csum(const uint16_t *start, int num_words)
+{
+	unsigned long sum = 0;
+	int i;
+
+	for (i = 0; i < num_words; i++)
+		sum += start[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xFFFF) + (sum >> 16);
+
+	return ~sum;
+}
+
+static int do_setcpu(int cpu)
+{
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(cpu, &mask);
+	if (sched_setaffinity(0, sizeof(mask), &mask))
+		error(1, 0, "setaffinity %d", cpu);
+
+	if (cfg_verbose)
+		fprintf(stderr, "cpu: %u\n", cpu);
+
+	return 0;
+}
+
+static void do_setsockopt(int fd, int level, int optname, int val)
+{
+	if (setsockopt(fd, level, optname, &val, sizeof(val)))
+		error(1, errno, "setsockopt %d.%d: %d", level, optname, val);
+}
+
+static int do_poll(int fd, int events)
+{
+	struct pollfd pfd;
+	int ret;
+
+	pfd.events = events;
+	pfd.revents = 0;
+	pfd.fd = fd;
+
+	ret = poll(&pfd, 1, cfg_waittime_ms);
+	if (ret == -1)
+		error(1, errno, "poll");
+
+	return ret && (pfd.revents & events);
+}
+
+static int do_accept(int fd)
+{
+	int fda = fd;
+
+	fd = accept(fda, NULL, NULL);
+	if (fd == -1)
+		error(1, errno, "accept");
+	if (close(fda))
+		error(1, errno, "close listen sock");
+
+	return fd;
+}
+
+static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy)
+{
+	int ret, len, i, flags;
+
+	len = 0;
+	for (i = 0; i < msg->msg_iovlen; i++)
+		len += msg->msg_iov[i].iov_len;
+
+	flags = MSG_DONTWAIT;
+	if (do_zerocopy)
+		flags |= MSG_ZEROCOPY;
+
+	ret = sendmsg(fd, msg, flags);
+	if (ret == -1 && errno == EAGAIN)
+		return false;
+	if (ret == -1)
+		error(1, errno, "send");
+	if (cfg_verbose && ret != len)
+		fprintf(stderr, "send: ret=%u != %u\n", ret, len);
+
+	if (len) {
+		packets++;
+		bytes += ret;
+		if (do_zerocopy && ret)
+			expected_completions++;
+	}
+
+	return true;
+}
+
+static void do_sendmsg_corked(int fd, struct msghdr *msg)
+{
+	bool do_zerocopy = cfg_zerocopy;
+	int i, payload_len, extra_len;
+
+	/* split up the packet. for non-multiple, make first buffer longer */
+	payload_len = cfg_payload_len / cfg_cork;
+	extra_len = cfg_payload_len - (cfg_cork * payload_len);
+
+	do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 1);
+
+	for (i = 0; i < cfg_cork; i++) {
+
+		/* in mixed-frags mode, alternate zerocopy and copy frags
+		 * start with non-zerocopy, to ensure attach later works
+		 */
+		if (cfg_cork_mixed)
+			do_zerocopy = (i & 1);
+
+		msg->msg_iov[0].iov_len = payload_len + extra_len;
+		extra_len = 0;
+
+		do_sendmsg(fd, msg, do_zerocopy);
+	}
+
+	do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
+}
+
+static int setup_iph(struct iphdr *iph, uint16_t payload_len)
+{
+	struct sockaddr_in *daddr = (void *) &cfg_dst_addr;
+	struct sockaddr_in *saddr = (void *) &cfg_src_addr;
+
+	memset(iph, 0, sizeof(*iph));
+
+	iph->version	= 4;
+	iph->tos	= 0;
+	iph->ihl	= 5;
+	iph->ttl	= 2;
+	iph->saddr	= saddr->sin_addr.s_addr;
+	iph->daddr	= daddr->sin_addr.s_addr;
+	iph->protocol	= IPPROTO_EGP;
+	iph->tot_len	= htons(sizeof(*iph) + payload_len);
+	iph->check	= get_ip_csum((void *) iph, iph->ihl << 1);
+
+	return sizeof(*iph);
+}
+
+static int setup_ip6h(struct ipv6hdr *ip6h, uint16_t payload_len)
+{
+	struct sockaddr_in6 *daddr = (void *) &cfg_dst_addr;
+	struct sockaddr_in6 *saddr = (void *) &cfg_src_addr;
+
+	memset(ip6h, 0, sizeof(*ip6h));
+
+	ip6h->version		= 6;
+	ip6h->payload_len	= htons(payload_len);
+	ip6h->nexthdr		= IPPROTO_EGP;
+	ip6h->hop_limit		= 2;
+	ip6h->saddr		= saddr->sin6_addr;
+	ip6h->daddr		= daddr->sin6_addr;
+
+	return sizeof(*ip6h);
+}
+
+static void setup_sockaddr(int domain, const char *str_addr, void *sockaddr)
+{
+	struct sockaddr_in *addr4 = (void *) sockaddr;
+	struct sockaddr_in6 *addr6 = (void *) sockaddr;
+
+	switch (domain) {
+	case PF_INET:
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(cfg_port);
+		if (inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
+			error(1, 0, "ipv4 parse error: %s", str_addr);
+		break;
+	case PF_INET6:
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(cfg_port);
+		if (inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
+			error(1, 0, "ipv6 parse error: %s", str_addr);
+		break;
+	default:
+		error(1, 0, "illegal domain");
+	}
+}
+
+static int do_setup_tx(int domain, int type, int protocol)
+{
+	int fd;
+
+	fd = socket(domain, type, protocol);
+	if (fd == -1)
+		error(1, errno, "socket t");
+
+	do_setsockopt(fd, SOL_SOCKET, SO_SNDBUF, 1 << 21);
+	if (cfg_zerocopy)
+		do_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, 1);
+
+	if (domain != PF_PACKET)
+		if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
+			error(1, errno, "connect");
+
+	return fd;
+}
+
+static bool do_recv_completion(int fd)
+{
+	struct sock_extended_err *serr;
+	struct cmsghdr *cm;
+	struct msghdr msg = {};
+	char control[100];
+	uint32_t hi, lo, range;
+	int ret, zerocopy;
+
+	msg.msg_control = control;
+	msg.msg_controllen = sizeof(control);
+
+	ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+	if (ret == -1 && errno == EAGAIN)
+		return false;
+	if (ret == -1)
+		error(1, errno, "recvmsg notification");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, errno, "recvmsg notification: truncated");
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!cm)
+		error(1, 0, "cmsg: no cmsg");
+	if (!((cm->cmsg_level == SOL_IP && cm->cmsg_type == IP_RECVERR) ||
+	      (cm->cmsg_level == SOL_IPV6 && cm->cmsg_type == IPV6_RECVERR) ||
+	      (cm->cmsg_level == SOL_PACKET && cm->cmsg_type == PACKET_TX_TIMESTAMP)))
+		error(1, 0, "serr: wrong type: %d.%d",
+		      cm->cmsg_level, cm->cmsg_type);
+
+	serr = (void *) CMSG_DATA(cm);
+	if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+		error(1, 0, "serr: wrong origin: %u", serr->ee_origin);
+	if (serr->ee_errno != 0)
+		error(1, 0, "serr: wrong error code: %u", serr->ee_errno);
+
+	hi = serr->ee_data;
+	lo = serr->ee_info;
+	range = hi - lo + 1;
+
+	/* Detect notification gaps. These should not happen often, if at all.
+	 * Gaps can occur due to drops, reordering and retransmissions.
+	 */
+	if (lo != next_completion)
+		fprintf(stderr, "gap: %u..%u does not append to %u\n",
+			lo, hi, next_completion);
+	next_completion = hi + 1;
+
+	zerocopy = !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED);
+	if (zerocopied == -1)
+		zerocopied = zerocopy;
+	else if (zerocopied != zerocopy) {
+		fprintf(stderr, "serr: inconsistent\n");
+		zerocopied = zerocopy;
+	}
+
+	if (cfg_verbose >= 2)
+		fprintf(stderr, "completed: %u (h=%u l=%u)\n",
+			range, hi, lo);
+
+	completions += range;
+	return true;
+}
+
+/* Read all outstanding messages on the errqueue */
+static void do_recv_completions(int fd)
+{
+	while (do_recv_completion(fd)) {}
+}
+
+/* Wait for all remaining completions on the errqueue */
+static void do_recv_remaining_completions(int fd)
+{
+	int64_t tstop = gettimeofday_ms() + cfg_waittime_ms;
+
+	while (completions < expected_completions &&
+	       gettimeofday_ms() < tstop) {
+		if (do_poll(fd, POLLERR))
+			do_recv_completions(fd);
+	}
+
+	if (completions < expected_completions)
+		error(1, 0, "missing notifications: %lu < %lu\n",
+		      completions, expected_completions);
+}
+
+static void do_tx(int domain, int type, int protocol)
+{
+	struct sockaddr_ll laddr;
+	struct msghdr msg = {0};
+	struct iovec iov[3] = { {0} };
+	struct ethhdr eth;
+	union {
+		struct ipv6hdr ip6h;
+		struct iphdr iph;
+	} nh;
+	uint64_t tstop;
+	int fd;
+
+	fd = do_setup_tx(domain, type, protocol);
+
+	if (domain == PF_PACKET) {
+		uint16_t proto = cfg_family == PF_INET ? ETH_P_IP : ETH_P_IPV6;
+
+		/* sock_raw passes ll header as data */
+		if (type == SOCK_RAW) {
+			memset(eth.h_dest, 0x06, ETH_ALEN);
+			memset(eth.h_source, 0x02, ETH_ALEN);
+			eth.h_proto = htons(proto);
+			iov[0].iov_base = &eth;
+			iov[0].iov_len = sizeof(eth);
+			msg.msg_iovlen++;
+		}
+
+		/* both sock_raw and sock_dgram expect name */
+		memset(&laddr, 0, sizeof(laddr));
+		laddr.sll_family	= AF_PACKET;
+		laddr.sll_ifindex	= cfg_ifindex;
+		laddr.sll_protocol	= htons(proto);
+		laddr.sll_halen		= ETH_ALEN;
+
+		memset(laddr.sll_addr, 0x06, ETH_ALEN);
+
+		msg.msg_name		= &laddr;
+		msg.msg_namelen		= sizeof(laddr);
+	}
+
+	/* packet and raw sockets with hdrincl must pass network header */
+	if (domain == PF_PACKET || protocol == IPPROTO_RAW) {
+		if (cfg_family == PF_INET)
+			iov[1].iov_len = setup_iph(&nh.iph, cfg_payload_len);
+		else
+			iov[1].iov_len = setup_ip6h(&nh.ip6h, cfg_payload_len);
+
+		iov[1].iov_base = (void *) &nh;
+		msg.msg_iovlen++;
+	}
+
+	iov[2].iov_base = payload;
+	iov[2].iov_len = cfg_payload_len;
+	msg.msg_iovlen++;
+	msg.msg_iov = &iov[3 - msg.msg_iovlen];
+
+	tstop = gettimeofday_ms() + cfg_runtime_ms;
+	do {
+		if (cfg_cork)
+			do_sendmsg_corked(fd, &msg);
+		else
+			do_sendmsg(fd, &msg, cfg_zerocopy);
+
+		while (!do_poll(fd, POLLOUT)) {
+			if (cfg_zerocopy)
+				do_recv_completions(fd);
+		}
+
+	} while (gettimeofday_ms() < tstop);
+
+	if (cfg_zerocopy)
+		do_recv_remaining_completions(fd);
+
+	if (close(fd))
+		error(1, errno, "close");
+
+	fprintf(stderr, "tx=%lu (%lu MB) txc=%lu zc=%c\n",
+		packets, bytes >> 20, completions,
+		zerocopied == 1 ? 'y' : 'n');
+}
+
+static int do_setup_rx(int domain, int type, int protocol)
+{
+	int fd;
+
+	/* If tx over PF_PACKET, rx over PF_INET(6)/SOCK_RAW,
+	 * to recv the only copy of the packet, not a clone
+	 */
+	if (domain == PF_PACKET)
+		error(1, 0, "Use PF_INET/SOCK_RAW to read");
+
+	if (type == SOCK_RAW && protocol == IPPROTO_RAW)
+		error(1, 0, "IPPROTO_RAW: not supported on Rx");
+
+	fd = socket(domain, type, protocol);
+	if (fd == -1)
+		error(1, errno, "socket r");
+
+	do_setsockopt(fd, SOL_SOCKET, SO_RCVBUF, 1 << 21);
+	do_setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT, 1 << 16);
+	do_setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, 1);
+
+	if (bind(fd, (void *) &cfg_dst_addr, cfg_alen))
+		error(1, errno, "bind");
+
+	if (type == SOCK_STREAM) {
+		if (listen(fd, 1))
+			error(1, errno, "listen");
+		fd = do_accept(fd);
+	}
+
+	return fd;
+}
+
+/* Flush all outstanding bytes for the tcp receive queue */
+static void do_flush_tcp(int fd)
+{
+	int ret;
+
+	/* MSG_TRUNC flushes up to len bytes */
+	ret = recv(fd, NULL, 1 << 21, MSG_TRUNC | MSG_DONTWAIT);
+	if (ret == -1 && errno == EAGAIN)
+		return;
+	if (ret == -1)
+		error(1, errno, "flush");
+	if (!ret)
+		return;
+
+	packets++;
+	bytes += ret;
+}
+
+/* Flush all outstanding datagrams. Verify first few bytes of each. */
+static void do_flush_datagram(int fd, int type)
+{
+	char buf[64];
+	int ret, off = 0;
+
+	/* MSG_TRUNC will return full datagram length */
+	ret = recv(fd, buf, sizeof(buf), MSG_DONTWAIT | MSG_TRUNC);
+	if (ret == -1 && errno == EAGAIN)
+		return;
+
+	/* raw ipv4 return with header, raw ipv6 without */
+	if (cfg_family == PF_INET && type == SOCK_RAW) {
+		off += sizeof(struct iphdr);
+		ret -= sizeof(struct iphdr);
+	}
+
+	if (ret == -1)
+		error(1, errno, "recv");
+	if (ret != cfg_payload_len)
+		error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len);
+	if (ret > sizeof(buf) - off)
+		ret = sizeof(buf) - off;
+	if (memcmp(buf + off, payload, ret))
+		error(1, 0, "recv: data mismatch");
+
+	packets++;
+	bytes += cfg_payload_len;
+}
+
+static void do_rx(int domain, int type, int protocol)
+{
+	uint64_t tstop;
+	int fd;
+
+	fd = do_setup_rx(domain, type, protocol);
+
+	tstop = gettimeofday_ms() + cfg_runtime_ms;
+	do {
+		if (type == SOCK_STREAM)
+			do_flush_tcp(fd);
+		else
+			do_flush_datagram(fd, type);
+
+		do_poll(fd, POLLIN);
+
+	} while (gettimeofday_ms() < tstop);
+
+	if (close(fd))
+		error(1, errno, "close");
+
+	fprintf(stderr, "rx=%lu (%lu MB)\n", packets, bytes >> 20);
+}
+
+static void do_test(int domain, int type, int protocol)
+{
+	int i;
+
+	if (cfg_cork && (domain == PF_PACKET || type != SOCK_DGRAM))
+		error(1, 0, "can only cork udp sockets");
+
+	do_setcpu(cfg_cpu);
+
+	for (i = 0; i < IP_MAXPACKET; i++)
+		payload[i] = 'a' + (i % 26);
+
+	if (cfg_rx)
+		do_rx(domain, type, protocol);
+	else
+		do_tx(domain, type, protocol);
+}
+
+static void usage(const char *filepath)
+{
+	error(1, 0, "Usage: %s [options] <test>", filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+	const int max_payload_len = sizeof(payload) -
+				    sizeof(struct ipv6hdr) -
+				    sizeof(struct tcphdr) -
+				    40 /* max tcp options */;
+	int c;
+
+	cfg_payload_len = max_payload_len;
+
+	while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vz")) != -1) {
+		switch (c) {
+		case '4':
+			if (cfg_family != PF_UNSPEC)
+				error(1, 0, "Pass one of -4 or -6");
+			cfg_family = PF_INET;
+			cfg_alen = sizeof(struct sockaddr_in);
+			break;
+		case '6':
+			if (cfg_family != PF_UNSPEC)
+				error(1, 0, "Pass one of -4 or -6");
+			cfg_family = PF_INET6;
+			cfg_alen = sizeof(struct sockaddr_in6);
+			break;
+		case 'c':
+			cfg_cork = strtol(optarg, NULL, 0);
+			break;
+		case 'C':
+			cfg_cpu = strtol(optarg, NULL, 0);
+			break;
+		case 'D':
+			setup_sockaddr(cfg_family, optarg, &cfg_dst_addr);
+			break;
+		case 'i':
+			cfg_ifindex = if_nametoindex(optarg);
+			if (cfg_ifindex == 0)
+				error(1, errno, "invalid iface: %s", optarg);
+			break;
+		case 'm':
+			cfg_cork_mixed = true;
+			break;
+		case 'p':
+			cfg_port = htons(strtoul(optarg, NULL, 0));
+			break;
+		case 'r':
+			cfg_rx = true;
+			break;
+		case 's':
+			cfg_payload_len = strtoul(optarg, NULL, 0);
+			break;
+		case 'S':
+			setup_sockaddr(cfg_family, optarg, &cfg_src_addr);
+			break;
+		case 't':
+			cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
+			break;
+		case 'v':
+			cfg_verbose++;
+			break;
+		case 'z':
+			cfg_zerocopy = true;
+			break;
+		}
+	}
+
+	if (cfg_payload_len > max_payload_len)
+		error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
+	if (cfg_cork_mixed && (!cfg_zerocopy || !cfg_cork))
+		error(1, 0, "-m: cork_mixed requires corking and zerocopy");
+
+	if (optind != argc - 1)
+		usage(argv[0]);
+}
+
+int main(int argc, char **argv)
+{
+	const char *cfg_test;
+
+	parse_opts(argc, argv);
+
+	cfg_test = argv[argc - 1];
+
+	if (!strcmp(cfg_test, "packet"))
+		do_test(PF_PACKET, SOCK_RAW, 0);
+	else if (!strcmp(cfg_test, "packet_dgram"))
+		do_test(PF_PACKET, SOCK_DGRAM, 0);
+	else if (!strcmp(cfg_test, "raw"))
+		do_test(cfg_family, SOCK_RAW, IPPROTO_EGP);
+	else if (!strcmp(cfg_test, "raw_hdrincl"))
+		do_test(cfg_family, SOCK_RAW, IPPROTO_RAW);
+	else if (!strcmp(cfg_test, "tcp"))
+		do_test(cfg_family, SOCK_STREAM, 0);
+	else if (!strcmp(cfg_test, "udp"))
+		do_test(cfg_family, SOCK_DGRAM, 0);
+	else
+		error(1, 0, "unknown cfg_test %s", cfg_test);
+
+	return 0;
+}
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh
new file mode 100755
index 000000000000..86955e9a6a7c
--- /dev/null
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -0,0 +1,113 @@
+#!/bin/bash
+#
+# Send data between two processes across namespaces
+# Run twice: once without and once with zerocopy
+
+set -e
+
+readonly DEV="veth0"
+readonly DEV_MTU=65535
+readonly BIN="./msg_zerocopy"
+
+readonly RAND="$(mktemp -u XXXXXX)"
+readonly NSPREFIX="ns-${RAND}"
+readonly NS1="${NSPREFIX}1"
+readonly NS2="${NSPREFIX}2"
+
+readonly SADDR4='192.168.1.1'
+readonly DADDR4='192.168.1.2'
+readonly SADDR6='fd::1'
+readonly DADDR6='fd::2'
+
+readonly path_sysctl_mem="net.core.optmem_max"
+
+# Argument parsing
+if [[ "$#" -lt "2" ]]; then
+	echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
+	exit 1
+fi
+
+readonly IP="$1"
+shift
+readonly TXMODE="$1"
+shift
+readonly EXTRA_ARGS="$@"
+
+# Argument parsing: configure addresses
+if [[ "${IP}" == "4" ]]; then
+	readonly SADDR="${SADDR4}"
+	readonly DADDR="${DADDR4}"
+elif [[ "${IP}" == "6" ]]; then
+	readonly SADDR="${SADDR6}"
+	readonly DADDR="${DADDR6}"
+else
+	echo "Invalid IP version ${IP}"
+	exit 1
+fi
+
+# Argument parsing: select receive mode
+#
+# This differs from send mode for
+# - packet:	use raw recv, because packet receives skb clones
+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
+case "${TXMODE}" in
+'packet' | 'packet_dgram' | 'raw_hdrincl')
+	RXMODE='raw'
+	;;
+*)
+	RXMODE="${TXMODE}"
+	;;
+esac
+
+# Start of state changes: install cleanup handler
+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
+
+cleanup() {
+	ip netns del "${NS2}"
+	ip netns del "${NS1}"
+	sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
+}
+
+trap cleanup EXIT
+
+# Configure system settings
+sysctl -w -q "${path_sysctl_mem}=1000000"
+
+# Create virtual ethernet pair between network namespaces
+ip netns add "${NS1}"
+ip netns add "${NS2}"
+
+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
+  peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
+
+# Bring the devices up
+ip -netns "${NS1}" link set "${DEV}" up
+ip -netns "${NS2}" link set "${DEV}" up
+
+# Set fixed MAC addresses on the devices
+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
+
+# Add fixed IP addresses to the devices
+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
+ip -netns "${NS1}" addr add       fd::1/64 dev "${DEV}" nodad
+ip -netns "${NS2}" addr add       fd::2/64 dev "${DEV}" nodad
+
+# Optionally disable sg or csum offload to test edge cases
+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
+
+do_test() {
+	local readonly ARGS="$1"
+
+	echo "ipv${IP} ${TXMODE} ${ARGS}"
+	ip netns exec "${NS2}" "${BIN}" "-${IP}" -i "${DEV}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" ${ARGS} -r "${RXMODE}" &
+	sleep 0.2
+	ip netns exec "${NS1}" "${BIN}" "-${IP}" -i "${DEV}" -t 1 -C 3 -S "${SADDR}" -D "${DADDR}" ${ARGS} "${TXMODE}"
+	wait
+}
+
+do_test "${EXTRA_ARGS}"
+do_test "-z ${EXTRA_ARGS}"
+echo ok
+
-- 
2.13.1.611.g7e3b11ae1-goog

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found]     ` <20170621211816.53837-3-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-06-22 17:05       ` David Miller
       [not found]         ` <20170622.130528.1762873686654379973.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2017-06-22 17:05 UTC (permalink / raw)
  To: willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willemb-hpIqsD4AKlfQT0dZR+AlfA

From: Willem de Bruijn <willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Wed, 21 Jun 2017 17:18:05 -0400

> @@ -958,15 +958,20 @@ EXPORT_SYMBOL_GPL(skb_morph);
>   */
>  int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>  {
> -	int i;
>  	int num_frags = skb_shinfo(skb)->nr_frags;
>  	struct page *page, *head = NULL;
>  	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
> +	int i, new_frags;
> +	u32 d_off;

Reverse christmas tree with the local variable declarations.

> +	page = head;
> +	d_off = 0;
> +	for (i = 0; i < num_frags; i++) {
> +		u8 *vaddr;
> +		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> +		u32 f_off, f_size, copy;

Likewise.

> +		f_off = f->page_offset;
> +		f_size = f->size;
> +
> +		vaddr = kmap_atomic(skb_frag_page(f));

I looked at some kmap_atomic() implementations and I do not think
it supports compound pages.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 03/13] sock: add MSG_ZEROCOPY
       [not found]   ` <20170621211816.53837-4-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-06-22 17:06     ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-06-22 17:06 UTC (permalink / raw)
  To: willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willemb-hpIqsD4AKlfQT0dZR+AlfA

From: Willem de Bruijn <willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Wed, 21 Jun 2017 17:18:06 -0400

> +struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
> +{
> +	struct sk_buff *skb;
> +	struct ubuf_info *uarg;

Reverse christmas tree.

> +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
> +{
> +	struct sock_exterr_skb *serr;
> +	struct sk_buff *skb = skb_from_uarg(uarg);
> +	struct sock *sk = skb->sk;
> +	u16 id = uarg->desc;

Likewise.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 06/13] sock: MSG_ZEROCOPY notification coalescing
       [not found]   ` <20170621211816.53837-7-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-06-22 17:07     ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2017-06-22 17:07 UTC (permalink / raw)
  To: willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willemb-hpIqsD4AKlfQT0dZR+AlfA

From: Willem de Bruijn <willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Wed, 21 Jun 2017 17:18:09 -0400

> +static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
> +{
> +	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
> +	u64 sum_len;
> +	u32 old_lo, old_hi;

Reverse christmas tree.

>  void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
>  {
>  	struct sock_exterr_skb *serr;
> -	struct sk_buff *skb = skb_from_uarg(uarg);
> +	struct sk_buff *tail, *skb = skb_from_uarg(uarg);
>  	struct sock *sk = skb->sk;
> -	u16 id = uarg->desc;
> +	struct sk_buff_head *q = &sk->sk_error_queue;
> +	unsigned long flags;
> +	u32 lo, hi;
> +	u16 len;

Likewise.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found]         ` <20170622.130528.1762873686654379973.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
@ 2017-06-22 20:57           ` Willem de Bruijn
       [not found]             ` <CAF=yD-+WQ1b9BytG4JuhgHCR8gAKvUVJPUfi4t-u_DqAFnVrkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-22 20:57 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Linux API, Willem de Bruijn

>
> Likewise.
>
>> +             f_off = f->page_offset;
>> +             f_size = f->size;
>> +
>> +             vaddr = kmap_atomic(skb_frag_page(f));
>
> I looked at some kmap_atomic() implementations and I do not think
> it supports compound pages.

Indeed. Thanks. It appears that I can do the obvious thing and
kmap the individual page that is being copied inside the loop:

  kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));

This is similar to existing logic in copy_huge_page_from_user
and __flush_dcache_page in arch/arm/mm/flush.c

But, this also applies to other skb operations that call kmap_atomic,
such as skb_copy_bits and __skb_checksum. Not all can be called
from a codepath with a compound user page, but I have to address
the ones that can.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found]             ` <CAF=yD-+WQ1b9BytG4JuhgHCR8gAKvUVJPUfi4t-u_DqAFnVrkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-23  1:36               ` David Miller
  2017-06-23  3:59                 ` Willem de Bruijn
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2017-06-23  1:36 UTC (permalink / raw)
  To: willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	willemb-hpIqsD4AKlfQT0dZR+AlfA

From: Willem de Bruijn <willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Thu, 22 Jun 2017 16:57:07 -0400

>>
>> Likewise.
>>
>>> +             f_off = f->page_offset;
>>> +             f_size = f->size;
>>> +
>>> +             vaddr = kmap_atomic(skb_frag_page(f));
>>
>> I looked at some kmap_atomic() implementations and I do not think
>> it supports compound pages.
> 
> Indeed. Thanks. It appears that I can do the obvious thing and
> kmap the individual page that is being copied inside the loop:
> 
>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
> 
> This is similar to existing logic in copy_huge_page_from_user
> and __flush_dcache_page in arch/arm/mm/flush.c
> 
> But, this also applies to other skb operations that call kmap_atomic,
> such as skb_copy_bits and __skb_checksum. Not all can be called
> from a codepath with a compound user page, but I have to address
> the ones that can.

Yeah that's quite a mess, it looks like this assumption that
kmap can handle compound pages exists in quite a few places.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
  2017-06-23  1:36               ` David Miller
@ 2017-06-23  3:59                 ` Willem de Bruijn
       [not found]                   ` <CAF=yD-JGekEkBsFfgZa+-TN3-QBPy4sfK0QLPx83Uh3DAoodvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-23  3:59 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Linux API, Willem de Bruijn

On Thu, Jun 22, 2017 at 9:36 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Thu, 22 Jun 2017 16:57:07 -0400
>
>>>
>>> Likewise.
>>>
>>>> +             f_off = f->page_offset;
>>>> +             f_size = f->size;
>>>> +
>>>> +             vaddr = kmap_atomic(skb_frag_page(f));
>>>
>>> I looked at some kmap_atomic() implementations and I do not think
>>> it supports compound pages.
>>
>> Indeed. Thanks. It appears that I can do the obvious thing and
>> kmap the individual page that is being copied inside the loop:
>>
>>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
>>
>> This is similar to existing logic in copy_huge_page_from_user
>> and __flush_dcache_page in arch/arm/mm/flush.c
>>
>> But, this also applies to other skb operations that call kmap_atomic,
>> such as skb_copy_bits and __skb_checksum. Not all can be called
>> from a codepath with a compound user page, but I have to address
>> the ones that can.
>
> Yeah that's quite a mess, it looks like this assumption that
> kmap can handle compound pages exists in quite a few places.

I hadn't even considered that skbs can already hold compound
page frags without zerocopy.

Open coding all call sites to iterate is tedious and unnecessary
in the common case where a page is not highmem.

kmap_atomic has enough slots to map an entire order-3 compound
page at once. But kmap_atomic cannot fail and there may be edge
cases that are larger than order-3.

Packet rings allocate with __GFP_COMP and an order derived
from (user supplied) tp_block_size, for instance. But it links each
skb_frag_t from an individual page, so this case seems okay.

Perhaps calls to kmap_atomic can be replaced with a
kmap_compound(..) that checks

 __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR

before calling kmap_atomic on all pages in the compound page. In
the common case that the page is not high mem, a single call is
enough, as there is no per-page operation.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found]                   ` <CAF=yD-JGekEkBsFfgZa+-TN3-QBPy4sfK0QLPx83Uh3DAoodvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-27 15:53                     ` Willem de Bruijn
       [not found]                       ` <CAF=yD-LJ-EvWbKtxqrcK1D=nRSzWzpPcFVOX1poZ3+P=aa0QfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-27 15:53 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Linux API, Willem de Bruijn

>>>> I looked at some kmap_atomic() implementations and I do not think
>>>> it supports compound pages.
>>>
>>> Indeed. Thanks. It appears that I can do the obvious thing and
>>> kmap the individual page that is being copied inside the loop:
>>>
>>>   kmap_atomic(skb_frag_page(f) + (f_off >> PAGE_SHIFT));
>>>
>>> This is similar to existing logic in copy_huge_page_from_user
>>> and __flush_dcache_page in arch/arm/mm/flush.c
>>>
>>> But, this also applies to other skb operations that call kmap_atomic,
>>> such as skb_copy_bits and __skb_checksum. Not all can be called
>>> from a codepath with a compound user page, but I have to address
>>> the ones that can.
>>
>> Yeah that's quite a mess, it looks like this assumption that
>> kmap can handle compound pages exists in quite a few places.
>
> I hadn't even considered that skbs can already hold compound
> page frags without zerocopy.
>
> Open coding all call sites to iterate is tedious and unnecessary
> in the common case where a page is not highmem.
>
> kmap_atomic has enough slots to map an entire order-3 compound
> page at once. But kmap_atomic cannot fail and there may be edge
> cases that are larger than order-3.
>
> Packet rings allocate with __GFP_COMP and an order derived
> from (user supplied) tp_block_size, for instance. But it links each
> skb_frag_t from an individual page, so this case seems okay.
>
> Perhaps calls to kmap_atomic can be replaced with a
> kmap_compound(..) that checks
>
>  __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR
>
> before calling kmap_atomic on all pages in the compound page. In
> the common case that the page is not high mem, a single call is
> enough, as there is no per-page operation.

This does not work. Some callers, such as __skb_checksum, cannot
fail, so neither can kmap_compound. Also, vaddr of consecutive
kmap_atomic calls are not guaranteed to be in order. Indeed, on x86
and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT))

An alternative is to change the kmap_atomic callers in skbuff.c. To
avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic
in a macro that loops only if needed:

static inline bool skb_frag_must_loop(struct page *p)
{
#if defined(CONFIG_HIGHMEM) || defined(CONFIG_X86_32)
        if (PageHighMem(p))
                return true;
#endif
        return false;
}

#define skb_frag_map_foreach(f, start, size, p, p_off, cp, copied)      \
        for (p = skb_frag_page(f) + ((start) >> PAGE_SHIFT),            \
             p_off = (start) & (PAGE_SIZE - 1),                         \
             copied = 0,                                                \
             cp = skb_frag_must_loop(p) ?                               \
                    min_t(u32, size, PAGE_SIZE - p_off) : size;         \
             copied < size;                                             \
             copied += cp, p++, p_off = 0,                              \
             cp = min_t(u32, size - copied, PAGE_SIZE))                 \

This does not change behavior on machines without high mem
or on low mem pages.

skb_seq_read keeps a mapping between calls to the function,
so will need a separate approach.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages
       [not found]                       ` <CAF=yD-LJ-EvWbKtxqrcK1D=nRSzWzpPcFVOX1poZ3+P=aa0QfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-29 15:54                         ` Willem de Bruijn
  0 siblings, 0 replies; 22+ messages in thread
From: Willem de Bruijn @ 2017-06-29 15:54 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Linux API, Willem de Bruijn

>> Perhaps calls to kmap_atomic can be replaced with a
>> kmap_compound(..) that checks
>>
>>  __this_cpu_read(__kmap_atomic_idx) +  (1 << compound_order(p)) < KM_TYPE_NR
>>
>> before calling kmap_atomic on all pages in the compound page. In
>> the common case that the page is not high mem, a single call is
>> enough, as there is no per-page operation.
>
> This does not work. Some callers, such as __skb_checksum, cannot
> fail, so neither can kmap_compound. Also, vaddr of consecutive
> kmap_atomic calls are not guaranteed to be in order. Indeed, on x86
> and arm vaddr appears to grows down: (FIXADDR_TOP - ((x) << PAGE_SHIFT))
>
> An alternative is to change the kmap_atomic callers in skbuff.c. To
> avoid open coding, we can wrap the kmap_atomic; op; kunmap_atomic
> in a macro that loops only if needed

I'll send this as RFC. It's not the most elegant solution.

The issue only arises with pages allocated with both __GFP_COMP and
__GFP_HIGHMEM, which is rare: skb_page_frag_refill,
alloc_skb_with_frags, __napi_alloc_skb and most device drivers do not
pass the high mem flag.

Exceptions are rds, mlx5. And transparent hugepages, which is a
problem with zerocopy fragments only (though not only msg_zerocopy,
potentially also the existing virtio and xen paths).

A simpler solution, then, may be to covert rds and mlx5 to not pass
__GFP_HIGHMEM and copy data on all zerocopy requests for this type of
pages.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-06-29 15:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-21 21:18 [PATCH net-next v3 00/13] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 01/13] sock: allocate skbs from optmem Willem de Bruijn
     [not found] ` <20170621211816.53837-1-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-06-21 21:18   ` [PATCH net-next v3 02/13] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
     [not found]     ` <20170621211816.53837-3-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-06-22 17:05       ` David Miller
     [not found]         ` <20170622.130528.1762873686654379973.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2017-06-22 20:57           ` Willem de Bruijn
     [not found]             ` <CAF=yD-+WQ1b9BytG4JuhgHCR8gAKvUVJPUfi4t-u_DqAFnVrkA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-23  1:36               ` David Miller
2017-06-23  3:59                 ` Willem de Bruijn
     [not found]                   ` <CAF=yD-JGekEkBsFfgZa+-TN3-QBPy4sfK0QLPx83Uh3DAoodvg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-27 15:53                     ` Willem de Bruijn
     [not found]                       ` <CAF=yD-LJ-EvWbKtxqrcK1D=nRSzWzpPcFVOX1poZ3+P=aa0QfA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-29 15:54                         ` Willem de Bruijn
2017-06-21 21:18   ` [PATCH net-next v3 13/13] test: add msg_zerocopy test Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 03/13] sock: add MSG_ZEROCOPY Willem de Bruijn
     [not found]   ` <20170621211816.53837-4-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-06-22 17:06     ` David Miller
2017-06-21 21:18 ` [PATCH net-next v3 04/13] sock: add SOCK_ZEROCOPY sockopt Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 05/13] sock: enable MSG_ZEROCOPY Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 06/13] sock: MSG_ZEROCOPY notification coalescing Willem de Bruijn
     [not found]   ` <20170621211816.53837-7-willemdebruijn.kernel-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-06-22 17:07     ` David Miller
2017-06-21 21:18 ` [PATCH net-next v3 07/13] sock: add ee_code SO_EE_CODE_ZEROCOPY_COPIED Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 08/13] sock: ulimit on MSG_ZEROCOPY pages Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 09/13] tcp: enable MSG_ZEROCOPY Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 10/13] udp: " Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 11/13] raw: enable MSG_ZEROCOPY with IP_HDRINCL Willem de Bruijn
2017-06-21 21:18 ` [PATCH net-next v3 12/13] packet: enable MSG_ZEROCOPY Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).