All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY
@ 2015-08-20 14:36 Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 01/10] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of 3 runs. std
is a standard netperf, zc uses zerocopy and % is the ratio.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

	--process cycles--	----cpu cycles----
	   std	   zc	 %	std	    zc	 %
4K	11,060	5,615	51	20,517	19,694	96
16K	 8,706	2,045	23	17,913	15,549	86
64K	 8,105	1,152	14	17,592	12,167	69
256K	 8,087	 926	11	16,953	11,279	66
1M	 7,955	 826	10	17,228	10,655	62

Perf record indicates the main source of these differences. Process
cycles only (perf record; perf report -n):

std:
 Samples: 15K of event 'cycles', Event count (approx.): 7967793182
 73.02%         11564  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  4.73%           746  netperf  [kernel.kallsyms]  [k] __memset
  2.73%           433  netperf  [kernel.kallsyms]  [k] tcp_sendmsg
  2.41%           383  netperf  [kernel.kallsyms]  [k] get_page_from_freelist
  0.90%           143  netperf  [kernel.kallsyms]  [k] copy_from_iter

zc:
 Samples: 1K of event 'cycles', Event count (approx.): 858290585
 17.11%           182  netperf.zc.aug2  [kernel.kallsyms]   [k] gup_pte_range
  9.31%           100  netperf.zc.aug2  [kernel.kallsyms]   [k] __memset
  7.79%            81  netperf.zc.aug2  [kernel.kallsyms]   [k] __zerocopy_sg_from_iter
  3.87%            44  netperf.zc.aug2  [kernel.kallsyms]   [k] __alloc_skb
  3.75%            18  netperf.zc.aug2  netperf.zc.aug2015  [.] allocate_buffer_ring

The individual patches report additional micro-benchmark results.


* Safety

The number of pages that can be pinned on behalf of a process with
MSG_ZEROCOPY is bound by the locked memory ulimit.

Pages are not mapped read-only. Processes can modify packet contents
while packets are in flight in the kernel path. Bytes on which kernel
control flow depends (headers) are copied to avoid TOCTTOU attacks.

Datapath integrity does not otherwise depend on payload, with three
exceptions: checksums, optional sk_filter/tc u32/.. and device +
driver logic. The effect of wrong checksums is limited to the
misbehaving process. Filters may have to be addressed by inserting a
preventative skb_copy_ubufs(). Device drivers can be whitelisted,
similar to scatter-gather support (NETIF_F_SG).

Conversely, while the kernel holds process memory pinned, a process
cannot safely reuse those pages for other purposes. Some protocols,
notably TCP, may hold data for an unbounded length of time. Tun and
virtio bound latency by calling skb_copy_ubuf before cloning and
before injecting packets in unbounded latency paths. This approach
is not feasible for TCP.

Processes can safely avoid OOM conditions by bounding the number of
bytes passed with MSG_ZEROCOPY and by removing shared pages after
transmission from their own memory map -- for instance, depending on
type of page, by calling munmap() or with madvise MADV_SOFT_OFFLINE or
MADV_DONTNEED. Long-lived kernel references are an anomaly and this
operation should be rare. The mechanism was suggested in the earlier
zerocopy packet socket patch.


* Limitations / Known Issues

- PF_INET6 and PF_UNIX are not yet supported.
- UDP/RAW/PACKET should sleep on ubuf_info alloc failure
     they currently immediately return ENOBUFS
- TCP does not build max GSO packets, especially for
     small send buffers (< 4 KB)

Willem de Bruijn (10):
  sock: skb_copy_ubufs support for compound pages
  sock: add generic socket zerocopy
  sock: enable generic socket zerocopy
  sock: zerocopy coalesce support
  tcp: enable MSG_ZEROCOPY
  udp: enable MSG_ZEROCOPY
  raw: enable MSG_ZEROCOPY with hdrincl
  packet: enable MSG_ZEROCOPY
  sock: RLIMIT number of pinned pages with MSG_ZEROCOPY
  test: add zerocopy tests

 drivers/vhost/net.c                           |   1 +
 include/linux/mm_types.h                      |   1 +
 include/linux/skbuff.h                        |  72 +++-
 include/linux/socket.h                        |   1 +
 include/net/sock.h                            |   2 +
 include/uapi/linux/errqueue.h                 |   1 +
 net/core/datagram.c                           |  37 +-
 net/core/skbuff.c                             | 297 ++++++++++++--
 net/core/sock.c                               |   2 +
 net/ipv4/ip_output.c                          |  30 +-
 net/ipv4/raw.c                                |  27 +-
 net/ipv4/tcp.c                                |  31 +-
 net/packet/af_packet.c                        |  44 ++-
 tools/testing/selftests/net/Makefile          |   2 +-
 tools/testing/selftests/net/snd_zerocopy.c    | 353 +++++++++++++++++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 535 ++++++++++++++++++++++++++
 16 files changed, 1372 insertions(+), 64 deletions(-)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

-- 
2.5.0.276.gf5e568e

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 01/10] sock: skb_copy_ubufs support for compound pages
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 02/10] sock: add sendmsg zerocopy Willem de Bruijn
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Refine skb_copy_ubufs to support compount pages. With upcoming TCP
and UDP zerocopy sendmsg, such fragments may appear.

These skbuffs can also combine kernel and user fragments, e.g., when
corking. Skip the copy for fragments that have only 1 (kernel)
reference.

It is not safe to modify skb frags when the skbuff is shared. This
should not happen. Fail loudly if we find an unexpected edge case.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6a19ca..f1aa781 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(skb_morph);
  *	If this function is called from an interrupt gfp_mask() must be
  *	%GFP_ATOMIC.
  *
+ *	skb_shinfo(skb) can only be safely modified when not accessed
+ *	concurrently. Fail if the skb is shared or cloned.
+ *
  *	Returns 0 on success or a negative error code on failure
  *	to allocate kernel memory to copy to.
  */
@@ -880,11 +883,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	struct page *page, *head = NULL;
 	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
+	if (skb_shared(skb) || skb_cloned(skb)) {
+		WARN_ON(1);
+		return -EINVAL;
+	}
+
 	for (i = 0; i < num_frags; i++) {
 		u8 *vaddr;
+		unsigned int order = 0;
+		gfp_t mask = gfp_mask;
 		skb_frag_t *f = &skb_shinfo(skb)->frags[i];
 
-		page = alloc_page(gfp_mask);
+		page = skb_frag_page(f);
+		if (page_count(page) == 1) {
+			skb_frag_ref(skb, i);
+			goto copy_done;
+		}
+
+		if (f->size > PAGE_SIZE) {
+			order = get_order(f->size);
+			mask |= __GFP_COMP;
+		}
+
+		page = alloc_pages(mask, order);
 		if (!page) {
 			while (head) {
 				struct page *next = (struct page *)page_private(head);
@@ -897,6 +918,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 		memcpy(page_address(page),
 		       vaddr + f->page_offset, skb_frag_size(f));
 		kunmap_atomic(vaddr);
+copy_done:
 		set_page_private(page, (unsigned long)head);
 		head = page;
 	}
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 02/10] sock: add sendmsg zerocopy
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 01/10] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 03/10] sock: enable " Willem de Bruijn
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h        |  46 +++++++++++++++++
 include/linux/socket.h        |   1 +
 include/net/sock.h            |   2 +
 include/uapi/linux/errqueue.h |   1 +
 net/core/datagram.c           |  37 ++++++++++----
 net/core/skbuff.c             | 114 ++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   2 +
 7 files changed, 192 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 065e10b..a93e17c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -306,6 +306,7 @@ enum {
 	SKBTX_ACK_TSTAMP = 1 << 7,
 };
 
+#define SKBTX_ZEROCOPY_FRAG	(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP	(SKBTX_SW_TSTAMP    | \
 				 SKBTX_SCHED_TSTAMP | \
 				 SKBTX_ACK_TSTAMP)
@@ -323,8 +324,27 @@ struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
 	unsigned long desc;
+	atomic_t refcnt;
 };
 
+#define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
+
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+
+static inline void sock_zerocopy_get(struct ubuf_info *uarg)
+{
+	atomic_inc(&uarg->refcnt);
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg);
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
+
+bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size);
+int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
+				struct iov_iter *iter, int len,
+				struct ubuf_info *uarg);
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -1037,6 +1057,32 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
 	return &skb_shinfo(skb)->hwtstamps;
 }
 
+static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
+{
+	bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
+
+	return is_zcopy ? skb_uarg(skb) : NULL;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+	if (uarg) {
+		sock_zerocopy_get(uarg);
+		skb_shinfo(skb)->destructor_arg = uarg;
+		skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+	}
+}
+
+static inline void skb_zcopy_clear(struct sk_buff *skb)
+{
+	struct ubuf_info *uarg = skb_zcopy(skb);
+
+	if (uarg) {
+		sock_zerocopy_put(uarg);
+		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	}
+}
+
 /**
  *	skb_queue_empty - check if a queue is empty
  *	@list: queue head
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 5bf59c8..5e99866 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -276,6 +276,7 @@ struct ucred {
 #define MSG_SENDPAGE_NOTLAST 0x20000 /* sendpage() internal : not the last page */
 #define MSG_EOF         MSG_FIN
 
+#define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
diff --git a/include/net/sock.h b/include/net/sock.h
index 43c6abc..56895af 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -281,6 +281,7 @@ struct cg_proto;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_tsflags: SO_TIMESTAMPING socket options
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
+  *	@sk_zckey: counter to order MSG_ZEROCOPY notifications
   *	@sk_socket: Identd and reporting IO signals
   *	@sk_user_data: RPC layer private data
   *	@sk_frag: cached page frag
@@ -419,6 +420,7 @@ struct sock {
 	ktime_t			sk_stamp;
 	u16			sk_tsflags;
 	u32			sk_tskey;
+	atomic_t		sk_zckey;
 	struct socket		*sk_socket;
 	void			*sk_user_data;
 	struct page_frag	sk_frag;
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1..0f15a77 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -18,6 +18,7 @@ struct sock_extended_err {
 #define SO_EE_ORIGIN_ICMP	2
 #define SO_EE_ORIGIN_ICMP6	3
 #define SO_EE_ORIGIN_TXSTATUS	4
+#define SO_EE_ORIGIN_ZEROCOPY	5
 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS
 
 #define SO_EE_OFFENDER(ee)	((struct sockaddr*)((ee)+1))
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 617088a..4d5bbab 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -520,19 +520,16 @@ EXPORT_SYMBOL(skb_copy_datagram_from_iter);
  *	The function will first copy up to headlen, and then pin the userspace
  *	pages and build frags through them.
  *
+ *	XXX: move to net/core/skbuff.c (skipping in this RFC patchset)
+ *
  *	Returns 0, -EFAULT or -EMSGSIZE.
  */
-int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+			    struct iov_iter *from, size_t length)
 {
-	int len = iov_iter_count(from);
-	int copy = min_t(int, skb_headlen(skb), len);
-	int frag = 0;
-
-	/* copy up to skb headlen */
-	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
-		return -EFAULT;
+	int frag = skb_shinfo(skb)->nr_frags;
 
-	while (iov_iter_count(from)) {
+	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
 		size_t start;
 		ssize_t copied;
@@ -542,18 +539,24 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, ~0U,
+		copied = iov_iter_get_pages(from, pages, length,
 					    MAX_SKB_FRAGS - frag, &start);
 		if (copied < 0)
 			return -EFAULT;
 
 		iov_iter_advance(from, copied);
+		length -= copied;
 
 		truesize = PAGE_ALIGN(copied + start);
 		skb->data_len += copied;
 		skb->len += copied;
 		skb->truesize += truesize;
-		atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		if (sk && sk->sk_type == SOCK_STREAM) {
+			sk->sk_wmem_queued += truesize;
+			sk_mem_charge(sk, truesize);
+		} else {
+			atomic_add(truesize, &skb->sk->sk_wmem_alloc);
+		}
 		while (copied) {
 			int size = min_t(int, copied, PAGE_SIZE - start);
 			skb_fill_page_desc(skb, frag++, pages[n], start, size);
@@ -564,6 +567,18 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 	}
 	return 0;
 }
+EXPORT_SYMBOL(__zerocopy_sg_from_iter);
+
+int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
+{
+	int copy = min_t(int, skb_headlen(skb), iov_iter_count(from));
+
+	/* copy up to skb headlen */
+	if (skb_copy_datagram_from_iter(skb, 0, from, copy))
+		return -EFAULT;
+
+	return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+}
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
 static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f1aa781..85dc612 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -858,6 +858,120 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+/* must only be called from process context */
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
+{
+	struct sk_buff *skb;
+	struct ubuf_info *uarg;
+
+	skb = sock_wmalloc(sk, 0, 0, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
+	uarg = (void *)skb->cb;
+
+	uarg->callback = sock_zerocopy_callback;
+	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	atomic_set(&uarg->refcnt, 0);
+
+	return uarg;
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_alloc);
+
+#define skb_from_uarg(skb)	container_of((void *)uarg, struct sk_buff, cb)
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
+{
+	struct sock_exterr_skb *serr;
+	struct sk_buff *skb = skb_from_uarg(skb);
+	struct sock *sk = skb->sk;
+	u16 id = uarg->desc;
+
+	serr = SKB_EXT_ERR(skb);
+	memset(serr, 0, sizeof(*serr));
+	serr->ee.ee_errno = 0;
+	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
+	serr->ee.ee_data = id;
+
+	skb_queue_tail(&sk->sk_error_queue, skb);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		sk->sk_error_report(sk);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
+
+void sock_zerocopy_put(struct ubuf_info *uarg)
+{
+	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		if (uarg->callback)
+			uarg->callback(uarg, true);
+		else
+			consume_skb(skb_from_uarg(uarg));
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put);
+
+bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size)
+{
+	struct ubuf_info *uarg;
+
+	uarg = sock_zerocopy_alloc(skb->sk, size);
+	if (!uarg)
+		return false;
+
+	skb_zcopy_set(skb, uarg);
+	return true;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_alloc);
+
+extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+				   struct iov_iter *from, size_t length);
+
+int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
+				struct iov_iter *iter, int len,
+				struct ubuf_info *uarg)
+{
+	struct ubuf_info *orig_uarg = skb_zcopy(skb);
+	struct iov_iter orig_iter = *iter;
+	int ret, orig_len = skb->len;
+
+	if (orig_uarg && orig_uarg != uarg)
+		return -EEXIST;
+
+	ret = __zerocopy_sg_from_iter(sk, skb, iter, len);
+	if (ret && (ret != -EMSGSIZE || skb->len == orig_len)) {
+		*iter = orig_iter;
+		___pskb_trim(skb, orig_len);
+		return ret;
+	}
+
+	if (!orig_uarg)
+		skb_zcopy_set(skb, uarg);
+
+	return skb->len - orig_len;
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_add_frags_iter);
+
+/* unused only until next patch in the series; will remove attribute */
+static int __attribute__((unused))
+	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+			      gfp_t gfp_mask)
+{
+	if (skb_zcopy(orig)) {
+		if (skb_zcopy(nskb)) {
+			/* !gfp_mask callers are verified to !skb_zcopy(nskb) */
+			BUG_ON(!gfp_mask);
+			if (skb_uarg(nskb) == skb_uarg(orig))
+				return 0;
+			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
+				return -EIO;
+		}
+		skb_zcopy_set(nskb, skb_uarg(orig));
+	}
+	return 0;
+}
+
 /**
  *	skb_copy_ubufs	-	copy userspace skb frags buffers to kernel
  *	@skb: the skb to modify
diff --git a/net/core/sock.c b/net/core/sock.c
index 193901d..0ab9a3b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1525,6 +1525,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 		newsk->sk_forward_alloc = 0;
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
+		atomic_set(&newsk->sk_zckey, 0);
 
 		sock_reset_flag(newsk, SOCK_DONE);
 		skb_queue_head_init(&newsk->sk_error_queue);
@@ -2345,6 +2346,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_sndtimeo		=	MAX_SCHEDULE_TIMEOUT;
 
 	sk->sk_stamp = ktime_set(-1L, 0);
+	atomic_set(&sk->sk_zckey, 0);
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	sk->sk_napi_id		=	0;
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 03/10] sock: enable sendmsg zerocopy
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 01/10] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 02/10] sock: add sendmsg zerocopy Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 04/10] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
    a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
    still call skb_copy_ubufs and thus copy frags in this case.
    For this reason, the skb_orphan_frags calls are not (yet) removed.

(2) not all copies deep in the stack are addressed yet. skb_shift,
    skb_split and skb_try_coalesce can be refined to avoid copying.
    These are not in the hot path and this patch is hairy enough as
    is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 drivers/vhost/net.c    |  1 +
 include/linux/skbuff.h |  6 +++++-
 net/core/skbuff.c      | 48 +++++++++++++++++++-----------------------------
 3 files changed, 25 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7d137a4..f3456e1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -380,6 +380,7 @@ static void handle_tx(struct vhost_net *net)
 			ubuf->callback = vhost_zerocopy_callback;
 			ubuf->ctx = nvq->ubufs;
 			ubuf->desc = nvq->upend_idx;
+			atomic_set(&ubuf->refcnt, 1);
 			msg.msg_control = ubuf;
 			msg.msg_controllen = sizeof(ubuf);
 			ubufs = nvq->ubufs;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a93e17c..3372f1c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2199,7 +2199,9 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-	if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+	if (likely(!skb_zcopy(skb)))
+		return 0;
+	if (likely(skb_uarg(skb)->callback == sock_zerocopy_callback))
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
@@ -2618,6 +2620,8 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
 				    const struct page *page, int off)
 {
+	if (skb_zcopy(skb))
+		return false;
 	if (i) {
 		const struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i - 1];
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 85dc612..6ee7282 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -582,21 +582,10 @@ static void skb_release_data(struct sk_buff *skb)
 	for (i = 0; i < shinfo->nr_frags; i++)
 		__skb_frag_unref(&shinfo->frags[i]);
 
-	/*
-	 * If skb buf is from userspace, we need to notify the caller
-	 * the lower device DMA has done;
-	 */
-	if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = shinfo->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, true);
-	}
-
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
+	skb_zcopy_clear(skb);
 	skb_free_head(skb);
 }
 
@@ -715,14 +704,7 @@ EXPORT_SYMBOL(kfree_skb_list);
  */
 void skb_tx_error(struct sk_buff *skb)
 {
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
-		struct ubuf_info *uarg;
-
-		uarg = skb_shinfo(skb)->destructor_arg;
-		if (uarg->callback)
-			uarg->callback(uarg, false);
-		skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
-	}
+	skb_zcopy_clear(skb);
 }
 EXPORT_SYMBOL(skb_tx_error);
 
@@ -953,9 +935,7 @@ int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_add_frags_iter);
 
-/* unused only until next patch in the series; will remove attribute */
-static int __attribute__((unused))
-	   skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
+static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
 			      gfp_t gfp_mask)
 {
 	if (skb_zcopy(orig)) {
@@ -964,6 +944,8 @@ static int __attribute__((unused))
 			BUG_ON(!gfp_mask);
 			if (skb_uarg(nskb) == skb_uarg(orig))
 				return 0;
+			/* nskb is always new, writable, so copy ubufs is ok */
+			BUG_ON(skb_shared(nskb) || skb_cloned(nskb));
 			if (skb_copy_ubufs(nskb, GFP_ATOMIC))
 				return -EIO;
 		}
@@ -995,7 +977,6 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	int i;
 	int num_frags = skb_shinfo(skb)->nr_frags;
 	struct page *page, *head = NULL;
-	struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
 	if (skb_shared(skb) || skb_cloned(skb)) {
 		WARN_ON(1);
@@ -1041,8 +1022,6 @@ copy_done:
 	for (i = 0; i < num_frags; i++)
 		skb_frag_unref(skb, i);
 
-	uarg->callback(uarg, false);
-
 	/* skb frags point to kernel buffers */
 	for (i = num_frags - 1; i >= 0; i--) {
 		__skb_fill_page_desc(skb, i, head, 0,
@@ -1050,7 +1029,7 @@ copy_done:
 		head = (struct page *)page_private(head);
 	}
 
-	skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+	skb_zcopy_clear(skb);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(skb_copy_ubufs);
@@ -1211,11 +1190,13 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 	if (skb_shinfo(skb)->nr_frags) {
 		int i;
 
-		if (skb_orphan_frags(skb, gfp_mask)) {
+		if (skb_orphan_frags(skb, gfp_mask) ||
+		    skb_zerocopy_clone(n, skb, gfp_mask)) {
 			kfree_skb(n);
 			n = NULL;
 			goto out;
 		}
+
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_shinfo(n)->frags[i] = skb_shinfo(skb)->frags[i];
 			skb_frag_ref(skb, i);
@@ -1288,9 +1269,10 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 	 * be since all we did is relocate the values
 	 */
 	if (skb_cloned(skb)) {
-		/* copy this zero copy skb frags */
 		if (skb_orphan_frags(skb, gfp_mask))
 			goto nofrags;
+		if (skb_zcopy(skb))
+			atomic_inc(&skb_uarg(skb)->refcnt);
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			skb_frag_ref(skb, i);
 
@@ -2409,6 +2391,7 @@ skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		skb_tx_error(from);
 		return -ENOMEM;
 	}
+	skb_zerocopy_clone(to, from, GFP_ATOMIC);
 
 	for (i = 0; i < skb_shinfo(from)->nr_frags; i++) {
 		if (!len)
@@ -2686,6 +2669,7 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 	int pos = skb_headlen(skb);
 
 	skb_shinfo(skb1)->tx_flags = skb_shinfo(skb)->tx_flags & SKBTX_SHARED_FRAG;
+	skb_zerocopy_clone(skb1, skb, 0);
 	if (len < pos)	/* Split line is inside header. */
 		skb_split_inside_header(skb, skb1, len, pos);
 	else		/* Second chunk has no header, nothing to copy. */
@@ -2727,6 +2711,8 @@ int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
 
 	BUG_ON(shiftlen > skb->len);
 	BUG_ON(skb_headlen(skb));	/* Would corrupt stream */
+	if (skb_zcopy(tgt) || skb_zcopy(skb))
+		return 0;
 
 	todo = shiftlen;
 	from = 0;
@@ -3250,6 +3236,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
 		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags &
 			SKBTX_SHARED_FRAG;
+		if (skb_zerocopy_clone(nskb, head_skb, GFP_ATOMIC))
+			goto err;
 
 		while (pos < offset + len) {
 			if (i >= nfrags) {
@@ -4279,6 +4267,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 	if (skb_has_frag_list(to) || skb_has_frag_list(from))
 		return false;
+	if (skb_zcopy(to) || skb_zcopy(from))
+		return false;
 
 	if (skb_headlen(from) != 0) {
 		struct page *page;
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 04/10] sock: sendmsg zerocopy notification coalescing
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (2 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 03/10] sock: enable " Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 05/10] tcp: enable sendmsg zerocopy Willem de Bruijn
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Support coalescing of zerocopy notifications.

In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause sendmsg() calls to append to a single
sk_buff and ubuf_info. Modify the notification path to return an
inclusive range of notifications [N..N+m].

Add skb_zerocopy_realloc() to reuse ubuf_info across sendmsg() calls.

Additionally, revise sock_zerocopy_callback() to coalesce consecutive
notifications: if an skb_uarg [1, 1] is freed while [0, 0] is on the
notification queue, modified the head of the queue to read [0, 1] and
drop the second separate notification.

For the case of reliable ordered transmission (TCP), only the upper
value of the range to be read, as the lower value is guaranteed to
be 1 above the last read notification.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h | 11 ++++++-
 net/core/skbuff.c      | 83 ++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3372f1c..99de112 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -323,13 +323,21 @@ enum {
 struct ubuf_info {
 	void (*callback)(struct ubuf_info *, bool zerocopy_success);
 	void *ctx;
-	unsigned long desc;
+	union {
+		unsigned long desc;
+		struct {
+			u16 id;
+			u16 len;
+		};
+	};
 	atomic_t refcnt;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg);
 
 static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 {
@@ -337,6 +345,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6ee7282..4ae60ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -854,7 +854,8 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg = (void *)skb->cb;
 
 	uarg->callback = sock_zerocopy_callback;
-	uarg->desc = atomic_inc_return(&sk->sk_zckey) - 1;
+	uarg->id = ((u16)atomic_inc_return(&sk->sk_zckey)) - 1;
+	uarg->len = 1;
 	atomic_set(&uarg->refcnt, 0);
 
 	return uarg;
@@ -863,20 +864,79 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_alloc);
 
 #define skb_from_uarg(skb)	container_of((void *)uarg, struct sk_buff, cb)
 
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+					struct ubuf_info *uarg)
+{
+	if (uarg) {
+		u16 next;
+
+		/* realloc only when socket is locked (TCP, UDP cork),
+		 * so uarg->len and sk_zckey access is serialized
+		 */
+		BUG_ON(!sock_owned_by_user(sk));
+
+		if (unlikely(uarg->len == USHRT_MAX - 1))
+			return NULL;
+
+		next = atomic_read(&sk->sk_zckey);
+		if ((u16)(uarg->id + uarg->len) == next) {
+			uarg->len++;
+			atomic_set(&sk->sk_zckey, ++next);
+			return uarg;
+		}
+	}
+
+	return sock_zerocopy_alloc(sk, size);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
+
+static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u16 lo, u16 len)
+{
+	struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+	long sum_len;
+	u16 old_lo, old_hi;
+
+	old_lo = serr->ee.ee_data & 0xFFFF;
+	old_hi = serr->ee.ee_data >> 16;
+	sum_len = old_hi - old_lo + 1 + len;
+	if (old_hi < old_lo)
+		sum_len += (1 << 16);
+
+	if (lo != old_hi + 1 || sum_len >= (1 << 16))
+		return false;
+
+	old_hi += len;
+	serr->ee.ee_data = (old_hi << 16) | old_lo;
+	return true;
+}
+
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 {
 	struct sock_exterr_skb *serr;
-	struct sk_buff *skb = skb_from_uarg(skb);
+	struct sk_buff *head, *skb = skb_from_uarg(skb);
 	struct sock *sk = skb->sk;
-	u16 id = uarg->desc;
+	struct sk_buff_head *q = &sk->sk_error_queue;
+	unsigned long flags;
+	u16 len, lo, hi;
+
+	len = uarg->len;
+	lo = uarg->id;
+	hi = uarg->id + len - 1;
 
 	serr = SKB_EXT_ERR(skb);
 	memset(serr, 0, sizeof(*serr));
 	serr->ee.ee_errno = 0;
 	serr->ee.ee_origin = SO_EE_ORIGIN_ZEROCOPY;
-	serr->ee.ee_data = id;
+	serr->ee.ee_data = (hi << 16) | lo;
 
-	skb_queue_tail(&sk->sk_error_queue, skb);
+	spin_lock_irqsave(&q->lock, flags);
+	head = skb_peek(q);
+	if (!head || !skb_zerocopy_notify_extend(head, lo, len)) {
+		__skb_queue_tail(q, skb);
+		skb = NULL;
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
+	consume_skb(skb);
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_error_report(sk);
@@ -886,7 +946,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
 	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
-		if (uarg->callback)
+		/* if !len, there was only 1 call, and it was aborted */
+		if (uarg->callback && uarg->len)
 			uarg->callback(uarg, true);
 		else
 			consume_skb(skb_from_uarg(uarg));
@@ -894,6 +955,16 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
+/* only called when sendmsg returns with error; no notification for this call */
+void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+{
+	if (uarg) {
+		uarg->len--;
+		sock_zerocopy_put(uarg);
+	}
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
+
 bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size)
 {
 	struct ubuf_info *uarg;
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 05/10] tcp: enable sendmsg zerocopy
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (3 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 04/10] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 06/10] udp: " Willem de Bruijn
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable support for MSG_ZEROCOPY to the TCP stack. Data that is
sent to a remote host will be zerocopy. TSO and GSO are supported.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  loopback test //net/socket:snd_zerocopy_lo -t -z produced:

  without zerocopy (-t):
    rx=93294 (5821 MB) tx=93294 txc=0
    rx=196194 (12243 MB) tx=196194 txc=0
    rx=297942 (18592 MB) tx=297942 txc=0
    rx=397752 (24821 MB) tx=397752 txc=0

  with zerocopy (-t -z):
    rx=200813 (12531 MB) tx=200814 txc=200799
    rx=426605 (26622 MB) tx=426606 txc=426585
    rx=645959 (40310 MB) tx=645960 txc=645933
    rx=877799 (54778 MB) tx=877800 txc=877765

  This test opens a pair of local sockets, one one calls sendmsg with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/tcp.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 45534a5..3711786 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1024,13 +1024,16 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 EXPORT_SYMBOL(tcp_sendpage);
 
-static inline int select_size(const struct sock *sk, bool sg)
+static inline int select_size(const struct sock *sk, bool sg, bool zerocopy)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	int tmp = tp->mss_cache;
 
 	if (sg) {
 		if (sk_can_gso(sk)) {
+			if (zerocopy)
+				return 0;
+
 			/* Small frames wont use a full page:
 			 * Payload will immediately follow tcp header.
 			 */
@@ -1085,6 +1088,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct ubuf_info *uarg = NULL;
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	bool sg;
@@ -1140,6 +1144,17 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 
 	sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+	if (sg && (flags & MSG_ZEROCOPY) && size) {
+		skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL;
+		uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+		if (!uarg) {
+			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
+				goto out_err;
+			uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+		}
+		sock_zerocopy_get(uarg);
+	}
+
 	while (msg_data_left(msg)) {
 		int copy = 0;
 		int max = size_goal;
@@ -1160,7 +1175,7 @@ new_segment:
 				goto wait_for_sndbuf;
 
 			skb = sk_stream_alloc_skb(sk,
-						  select_size(sk, sg),
+						  select_size(sk, sg, uarg),
 						  sk->sk_allocation,
 						  skb_queue_empty(&sk->sk_write_queue));
 			if (!skb)
@@ -1195,7 +1210,7 @@ new_segment:
 			err = skb_add_data_nocache(sk, skb, &msg->msg_iter, copy);
 			if (err)
 				goto do_fault;
-		} else {
+		} else if (!uarg) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1233,6 +1248,15 @@ new_segment:
 				get_page(pfrag->page);
 			}
 			pfrag->offset += copy;
+		} else {
+			err = skb_zerocopy_add_frags_iter(sk, skb,
+							  &msg->msg_iter,
+							  copy, uarg);
+			if (err == -EMSGSIZE)
+				goto new_segment;
+			if (err < 0)
+				goto do_error;
+			copy = err;
 		}
 
 		if (!copied)
@@ -1275,6 +1299,7 @@ out:
 	if (copied)
 		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
 out_nopush:
+	sock_zerocopy_put(uarg);
 	release_sock(sk);
 	return copied + copied_syn;
 
@@ -1292,6 +1317,7 @@ do_error:
 	if (copied + copied_syn)
 		goto out;
 out_err:
+	sock_zerocopy_put_abort(uarg);
 	err = sk_stream_error(sk, flags, err);
 	/* make sure we wake any epoll edge trigger waiter */
 	if (unlikely(skb_queue_len(&sk->sk_write_queue) == 0 && err == -EAGAIN))
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 06/10] udp: enable sendmsg zerocopy
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (4 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 05/10] tcp: enable sendmsg zerocopy Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 07/10] raw: enable sendmsg zerocopy with hdrincl Willem de Bruijn
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add MSG_ZEROCOPY support to inet/dgram. This includes udplite.

Tested:
  loopback test //net/socket:snd_zerocopy_lo -u -z passes:

  without zerocopy (-u):
    rx=106644 (6655 MB) tx=106644 txc=0
    rx=219264 (13683 MB) tx=219264 txc=0
    rx=326958 (20403 MB) tx=326958 txc=0
    rx=430260 (26850 MB) tx=430260 txc=0

  with zerocopy (-u -z):
    rx=306924 (19153 MB) tx=306924 txc=306918
    rx=644700 (40232 MB) tx=644700 txc=644694
    rx=979200 (61106 MB) tx=979200 txc=979194
    rx=1308414 (81651 MB) tx=1308414 txc=1308408

  loopback test also passes with corking, with a mix of
  copied and user pages (-U -z):

  without zerocopy (-U):
    rx=105364 (6575 MB) tx=632184 txc=0
    rx=222964 (13913 MB) tx=1337784 txc=0
    rx=349025 (21780 MB) tx=2094150 txc=0
    rx=477526 (29799 MB) tx=2865156 txc=0

  with zerocopy (-U -z):
    rx=140490 (8767 MB) tx=842940 txc=421459
    rx=283919 (17717 MB) tx=1703514 txc=851738
    rx=434414 (27109 MB) tx=2606484 txc=1303213
    rx=571965 (35693 MB) tx=3431790 txc=1715856

  In corked mode, each sendmsg call passes only 1/6th of the total
  datagram, rendering zerocopy less effective.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  5 +++++
 net/ipv4/ip_output.c   | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 99de112..c1ea855 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -335,6 +335,11 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+#define sock_can_zerocopy(sk, rt, csummode) \
+	((rt->dst.dev->features & NETIF_F_SG) && \
+	 ((sk->sk_type == SOCK_RAW) || \
+	  (sk->sk_type == SOCK_DGRAM && csummode & CHECKSUM_UNNECESSARY)))
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 					struct ubuf_info *uarg);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 0138fad..16bab5e 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -871,7 +871,7 @@ static int __ip_append_data(struct sock *sk,
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
-
+	struct ubuf_info *uarg = NULL;
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -914,9 +914,16 @@ static int __ip_append_data(struct sock *sk,
 	    !exthdrlen)
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length &&
+	    sock_can_zerocopy(sk, rt, skb ? skb->ip_summed : csummode)) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+	}
+
 	cork->length += length;
 	if (((length > mtu) || (skb && skb_is_gso(skb))) &&
-	    (sk->sk_protocol == IPPROTO_UDP) &&
+	    (sk->sk_protocol == IPPROTO_UDP) && !uarg &&
 	    (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
 	    (sk->sk_type == SOCK_DGRAM)) {
 		err = ip_ufo_append_data(sk, queue, getfrag, from, length,
@@ -968,6 +975,8 @@ alloc_new_skb:
 			if ((flags & MSG_MORE) &&
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
+			else if (uarg)
+				alloclen = min_t(int, fraglen, MAX_HEADER);
 			else
 				alloclen = fraglen;
 
@@ -1010,11 +1019,12 @@ alloc_new_skb:
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
 			 */
-			data = skb_put(skb, fraglen + exthdrlen);
+			data = skb_put(skb, alloclen);
 			skb_set_network_header(skb, exthdrlen);
 			skb->transport_header = (skb->network_header +
 						 fragheaderlen);
@@ -1030,7 +1040,9 @@ alloc_new_skb:
 				pskb_trim_unique(skb_prev, maxfraglen);
 			}
 
-			copy = datalen - transhdrlen - fraggap;
+			copy = min(datalen,
+				   alloclen - exthdrlen - fragheaderlen);
+			copy -= transhdrlen - fraggap;
 			if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) {
 				err = -EFAULT;
 				kfree_skb(skb);
@@ -1038,7 +1050,7 @@ alloc_new_skb:
 			}
 
 			offset += copy;
-			length -= datalen - fraggap;
+			length -= copy + transhdrlen;
 			transhdrlen = 0;
 			exthdrlen = 0;
 			csummode = CHECKSUM_NONE;
@@ -1063,6 +1075,17 @@ alloc_new_skb:
 				err = -EFAULT;
 				goto error;
 			}
+		} else if (uarg) {
+			struct iov_iter *iter;
+
+			if (sk->sk_type == SOCK_RAW)
+				iter = &((struct msghdr **)from)[0]->msg_iter;
+			else
+				iter = &((struct msghdr *)from)->msg_iter;
+			err = skb_zerocopy_add_frags_iter(sk, skb, iter, copy, uarg);
+			if (err < 0)
+				goto error;
+			copy = err;
 		} else {
 			int i = skb_shinfo(skb)->nr_frags;
 
@@ -1103,6 +1126,7 @@ alloc_new_skb:
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	return err;
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 07/10] raw: enable sendmsg zerocopy with hdrincl
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (5 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 06/10] udp: " Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 08/10] packet: enable sendmsg zerocopy Willem de Bruijn
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Add MSG_ZEROCOPY support to inet/raw when passing IP_HDRINCL

Tested:
  raw loopback test //net/socket:snd_zerocopy_lo -r -z passes:

  without zerocopy (-r):
    rx=69348 (4327 MB) tx=69348 txc=0
    rx=145590 (9085 MB) tx=145590 txc=0
    rx=219210 (13679 MB) tx=219210 txc=0
    rx=293688 (18327 MB) tx=293688 txc=0

  with zerocopy (-r -z):
    rx=258132 (16108 MB) tx=258132 txc=258122
    rx=541266 (33777 MB) tx=541266 txc=541256
    rx=822606 (51334 MB) tx=822606 txc=822596
    rx=1105776 (69005 MB) tx=1105776 txc=1105766

  raw hdrincl loopback test //net/socket:snd_zerocopy_lo -R -z passes:

  without zerocopy (-R):
    rx=101904 (6359 MB) tx=101904 txc=0
    rx=215256 (13432 MB) tx=215256 txc=0
    rx=328584 (20505 MB) tx=328584 txc=0
    rx=442008 (27583 MB) tx=442008 txc=0

  with zerocopy (-R -z):
    rx=265398 (16562 MB) tx=265398 txc=265392
    rx=558744 (34868 MB) tx=558744 txc=558738
    rx=853308 (53250 MB) tx=853308 txc=853302
    rx=1148142 (71649 MB) tx=1148142 txc=1148136

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/ipv4/raw.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 561cd4b..c4fa57d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -347,7 +347,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	unsigned int iphlen;
 	int err;
 	struct rtable *rt = *rtp;
-	int hlen, tlen;
+	int hlen, tlen, linear;
 
 	if (length > rt->dst.dev->mtu) {
 		ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
@@ -359,8 +359,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	hlen = LL_RESERVED_SPACE(rt->dst.dev);
 	tlen = rt->dst.dev->needed_tailroom;
+	linear = length;
+
+	if (flags & MSG_ZEROCOPY && length &&
+	    sock_can_zerocopy(sk, rt, CHECKSUM_UNNECESSARY))
+		linear = min_t(int, length, MAX_HEADER);
+
 	skb = sock_alloc_send_skb(sk,
-				  length + hlen + tlen + 15,
+				  linear + hlen + tlen + 15,
 				  flags & MSG_DONTWAIT, &err);
 	if (!skb)
 		goto error;
@@ -373,15 +379,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 
 	skb_reset_network_header(skb);
 	iph = ip_hdr(skb);
-	skb_put(skb, length);
+	skb_put(skb, linear);
 
 	skb->ip_summed = CHECKSUM_NONE;
 
 	sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags);
-
 	skb->transport_header = skb->network_header;
 	err = -EFAULT;
-	if (memcpy_from_msg(iph, msg, length))
+	if (memcpy_from_msg(iph, msg, linear))
 		goto error_free;
 
 	iphlen = iph->ihl * 4;
@@ -397,6 +402,17 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	if (iphlen > length)
 		goto error_free;
 
+	if (length != linear) {
+		size_t datalen = length - linear;
+
+		if (!skb_zerocopy_alloc(skb, datalen))
+			goto error_zcopy;
+		err = skb_zerocopy_add_frags_iter(sk, skb, &msg->msg_iter,
+						  datalen, skb_uarg(skb));
+		if (err != datalen)
+			goto error_zcopy;
+	}
+
 	if (iphlen >= sizeof(*iph)) {
 		if (!iph->saddr)
 			iph->saddr = fl4->saddr;
@@ -420,6 +436,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 out:
 	return 0;
 
+error_zcopy:
+	sock_zerocopy_put_abort(skb_zcopy(skb));
 error_free:
 	kfree_skb(skb);
 error:
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 08/10] packet: enable sendmsg zerocopy
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (6 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 07/10] raw: enable sendmsg zerocopy with hdrincl Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 14:36 ` [PATCH net-next RFC 09/10] sock: sendmsg zerocopy ulimit Willem de Bruijn
  2015-08-20 22:56 ` [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY David Miller
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Support MSG_ZEROCOPY on PF_PACKET transmission.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b5afe53..8c5588b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2476,28 +2476,47 @@ out:
 
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 				        size_t reserve, size_t len,
-				        size_t linear, int noblock,
+					size_t linear, int flags,
 				        int *err)
 {
 	struct sk_buff *skb;
+	size_t data_len;
 
-	/* Under a page?  Don't bother with paged skb. */
-	if (prepad + len < PAGE_SIZE || !linear)
-		linear = len;
+	if (flags & MSG_ZEROCOPY) {
+		/* Minimize linear, but respect header lower bound */
+		linear = min(len, max_t(size_t, linear, MAX_HEADER));
+		data_len = 0;
+	} else {
+		/* Under a page? Don't bother with paged skb. */
+		if (prepad + len < PAGE_SIZE || !linear)
+			linear = len;
+		data_len = len - linear;
+	}
 
-	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-				   err, 0);
+	skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
+				   flags & MSG_DONTWAIT, err, 0);
 	if (!skb)
 		return NULL;
 
 	skb_reserve(skb, reserve);
 	skb_put(skb, linear);
-	skb->data_len = len - linear;
-	skb->len += len - linear;
+	skb->data_len = data_len;
+	skb->len += data_len;
 
 	return skb;
 }
 
+static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
+					 struct msghdr *msg, size_t size)
+{
+	if (zerocopy_sg_from_iter(skb, &msg->msg_iter))
+		return -EIO;
+	if (!skb_zerocopy_alloc(skb, size))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
@@ -2515,6 +2534,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	int hlen, tlen;
 	int extra_len = 0;
 	ssize_t n;
+	bool zerocopy = msg->msg_flags & MSG_ZEROCOPY;
 
 	/*
 	 *	Get and verify the address.
@@ -2611,7 +2631,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	tlen = dev->needed_tailroom;
 	skb = packet_alloc_skb(sk, hlen + tlen, hlen, len,
 			       __virtio16_to_cpu(false, vnet_hdr.hdr_len),
-			       msg->msg_flags & MSG_DONTWAIT, &err);
+			       msg->msg_flags, &err);
 	if (skb == NULL)
 		goto out_unlock;
 
@@ -2628,7 +2648,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	}
 
 	/* Returns -EFAULT on error */
-	err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter, len);
+	if (zerocopy)
+		err = packet_zerocopy_sg_from_iovec(skb, msg, len);
+	else
+		err = skb_copy_datagram_from_iter(skb, offset, &msg->msg_iter,
+						  len);
 	if (err)
 		goto out_free;
 
@@ -2690,6 +2714,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	return len;
 
 out_free:
+	sock_zerocopy_put_abort(skb_zcopy(skb));
 	kfree_skb(skb);
 out_unlock:
 	if (dev)
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next RFC 09/10] sock: sendmsg zerocopy ulimit
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (7 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 08/10] packet: enable sendmsg zerocopy Willem de Bruijn
@ 2015-08-20 14:36 ` Willem de Bruijn
  2015-08-20 22:56 ` [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY David Miller
  9 siblings, 0 replies; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-20 14:36 UTC (permalink / raw)
  To: netdev; +Cc: mst, jasowang, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Bound the number of pages that a userspace process may pin.

Account pinned pages to the locked page count (`ulimit -l`) of the
caller and fail beyond the administrator controlled threshold, similar
to infiniband.

Use an atomic variable to avoid having to take mmap_sem. Taking the
lock is expensive and requires scheduling a worker on destruction,
as taking the lock may sleep, but ubuf_info are often destroyed in
atomic context.

The current mm_struct.pinned_vm_ is a hack. A non-RFC patchset would
convert unsigned long pinned_vm_ and all its callers (infiniband) to
atomic_long_t.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/mm_types.h |  1 +
 include/linux/skbuff.h   |  5 +++++
 net/core/skbuff.c        | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0038ac7..dc6e12a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -402,6 +402,7 @@ struct mm_struct {
 	unsigned long total_vm;		/* Total pages mapped */
 	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
 	unsigned long pinned_vm;	/* Refcount permanently increased */
+	atomic_t pinned_vm_;
 	unsigned long shared_vm;	/* Shared pages (files) */
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE */
 	unsigned long stack_vm;		/* VM_GROWSUP/DOWN */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c1ea855..95a9f75 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -331,6 +331,11 @@ struct ubuf_info {
 		};
 	};
 	atomic_t refcnt;
+
+	struct mmpin {
+		struct mm_struct *mm;
+		int num_pg;
+	} mmp;
 };
 
 #define skb_uarg(SKB)	((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4ae60ee..3742968 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -840,6 +840,42 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+{
+	unsigned long max_pg, num_pg, new_pg, old_pg;
+	struct mm_struct *mm;
+
+	if (capable(CAP_IPC_LOCK) || !size)
+		return 0;
+
+	num_pg = (size >> PAGE_SHIFT) + 2;	/* worst case */
+	max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	mm = mmp->mm ? : current->mm;
+
+	do {
+		old_pg = atomic_read(&mm->pinned_vm_);
+		new_pg = old_pg + num_pg;
+		if (new_pg > max_pg)
+			return -ENOMEM;
+	} while (atomic_cmpxchg(&mm->pinned_vm_, old_pg, new_pg) != old_pg);
+
+	if (!mmp->mm) {
+		mmp->mm = mm;
+		atomic_inc(&mm->mm_count);
+	}
+
+	mmp->num_pg += num_pg;
+	return 0;
+}
+
+static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+{
+	if (mmp->mm) {
+		atomic_sub(mmp->num_pg, &mmp->mm->pinned_vm_);
+		mmdrop(mmp->mm);
+	}
+}
+
 /* must only be called from process context */
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
@@ -852,6 +888,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 
 	BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
 	uarg = (void *)skb->cb;
+	uarg->mmp.mm = NULL;
+
+	if (mm_account_pinned_pages(&uarg->mmp, size)) {
+		kfree_skb(skb);
+		return NULL;
+	}
 
 	uarg->callback = sock_zerocopy_callback;
 	uarg->id = ((u16)atomic_inc_return(&sk->sk_zckey)) - 1;
@@ -880,6 +922,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
 
 		next = atomic_read(&sk->sk_zckey);
 		if ((u16)(uarg->id + uarg->len) == next) {
+			if (mm_account_pinned_pages(&uarg->mmp, size))
+				return NULL;
 			uarg->len++;
 			atomic_set(&sk->sk_zckey, ++next);
 			return uarg;
@@ -946,6 +990,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
 	if (uarg && atomic_dec_and_test(&uarg->refcnt)) {
+		mm_unaccount_pinned_pages(&uarg->mmp);
+
 		/* if !len, there was only 1 call, and it was aborted */
 		if (uarg->callback && uarg->len)
 			uarg->callback(uarg, true);
-- 
2.5.0.276.gf5e568e

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY
  2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
                   ` (8 preceding siblings ...)
  2015-08-20 14:36 ` [PATCH net-next RFC 09/10] sock: sendmsg zerocopy ulimit Willem de Bruijn
@ 2015-08-20 22:56 ` David Miller
  2015-08-21  2:49   ` Willem de Bruijn
  9 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-08-20 22:56 UTC (permalink / raw)
  To: willemb; +Cc: netdev, mst, jasowang

From: Willem de Bruijn <willemb@google.com>
Date: Thu, 20 Aug 2015 10:36:39 -0400

> Datapath integrity does not otherwise depend on payload, with three
> exceptions: checksums, optional sk_filter/tc u32/.. and device +
> driver logic. The effect of wrong checksums is limited to the
> misbehaving process. Filters may have to be addressed by inserting a
> preventative skb_copy_ubufs(). Device drivers can be whitelisted,
> similar to scatter-gather support (NETIF_F_SG).

Consider a userland NFS implementation sending over loopback while
constantly modifying the page.  The sunrpc code could be tricked into
seeing one thing during validation of the RPC headers then doing
another after the user makes changes.

I really don't think this is completely safe as-is.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY
  2015-08-20 22:56 ` [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY David Miller
@ 2015-08-21  2:49   ` Willem de Bruijn
  2015-08-21  5:17     ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Willem de Bruijn @ 2015-08-21  2:49 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Michael S. Tsirkin, Jason Wang

On Thu, Aug 20, 2015 at 6:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Willem de Bruijn <willemb@google.com>
> Date: Thu, 20 Aug 2015 10:36:39 -0400
>
>> Datapath integrity does not otherwise depend on payload, with three
>> exceptions: checksums, optional sk_filter/tc u32/.. and device +
>> driver logic. The effect of wrong checksums is limited to the
>> misbehaving process. Filters may have to be addressed by inserting a
>> preventative skb_copy_ubufs(). Device drivers can be whitelisted,
>> similar to scatter-gather support (NETIF_F_SG).
>
> Consider a userland NFS implementation sending over loopback while
> constantly modifying the page.  The sunrpc code could be tricked into
> seeing one thing during validation of the RPC headers then doing
> another after the user makes changes.
>
> I really don't think this is completely safe as-is.

Sunrpc is a great counter example. Anything that calls
kernel_recvmsg may be problematic, I guess. Copying when
passing to kernel sockets would plug that class of issues.

But there may still be others. Most obvious use case for copy
avoidance is pure device transmit. Excluding loopback may be
a reasonable way to initially limit the attack surface. With a flag
NETIF_F_ZC not supported on lo.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY
  2015-08-21  2:49   ` Willem de Bruijn
@ 2015-08-21  5:17     ` David Miller
  0 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2015-08-21  5:17 UTC (permalink / raw)
  To: willemb; +Cc: netdev, mst, jasowang

From: Willem de Bruijn <willemb@google.com>
Date: Thu, 20 Aug 2015 22:49:25 -0400

> But there may still be others. Most obvious use case for copy
> avoidance is pure device transmit. Excluding loopback may be
> a reasonable way to initially limit the attack surface. With a flag
> NETIF_F_ZC not supported on lo.

Good luck avoiding every case where a packet can be looped back into
the machine in some way.  A simple device flag is not going to do it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-08-21  5:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-20 14:36 [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 01/10] sock: skb_copy_ubufs support for compound pages Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 02/10] sock: add sendmsg zerocopy Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 03/10] sock: enable " Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 04/10] sock: sendmsg zerocopy notification coalescing Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 05/10] tcp: enable sendmsg zerocopy Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 06/10] udp: " Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 07/10] raw: enable sendmsg zerocopy with hdrincl Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 08/10] packet: enable sendmsg zerocopy Willem de Bruijn
2015-08-20 14:36 ` [PATCH net-next RFC 09/10] sock: sendmsg zerocopy ulimit Willem de Bruijn
2015-08-20 22:56 ` [PATCH net-next RFC 00/10] socket sendmsg MSG_ZEROCOPY David Miller
2015-08-21  2:49   ` Willem de Bruijn
2015-08-21  5:17     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.