All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy
@ 2020-12-02 22:09 Arjun Roy
  2020-12-02 22:09 ` [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy Arjun Roy
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

This patchset contains several optimizations for TCP Recv. Zerocopy.

Note this is v2 of the patchset, fixing two 32-bit compilation errors
and a stylistic error.

Summarized:
1. It is possible that a read payload is not exactly page aligned -
that there may exist "straggler" bytes that we cannot map into the
caller's address space cleanly. For this, we allow the caller to
provide as argument a "hybrid copy buffer", turning
getsockopt(TCP_ZEROCOPY_RECEIVE) into a "hybrid" operation that allows
the caller to avoid a subsequent recvmsg() call to read the
stragglers.

2. Similarly, for "small" read payloads that are either below the size
of a page, or small enough that remapping pages is not a performance
win - we allow the user to short-circuit the remapping operations
entirely and simply copy into the buffer provided.

Some of the patches in the middle of this set are refactors to support
this "short-circuiting" optimization.

3. We allow the user to provide a hint that performing a page zap
operation (and the accompanying TLB shootdown) may not be necessary,
for the provided region that the kernel will attempt to map pages
into. This allows us to avoid this expensive operation while holding
the socket lock, which provides a significant performance advantage.

With all of these changes combined, "medium" sized receive traffic
(multiple tens to few hundreds of KB) see significant efficiency gains
when using TCP receive zerocopy instead of regular recvmsg(). For
example, with RPC-style traffic with 32KB messages, there is a roughly
15% efficiency improvement when using zerocopy. Without these changes,
there is a roughly 60-70% efficiency reduction with such messages when
employing zerocopy.

Arjun Roy (8):
  net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  net-tcp: Introduce tcp_recvmsg_locked().
  net-zerocopy: Refactor skb frag fast-forward op.
  net-zerocopy: Refactor frag-is-remappable test.
  net-zerocopy: Fast return if inq < PAGE_SIZE
  net-zerocopy: Introduce short-circuit small reads.
  net-zerocopy: Set zerocopy hint when data is copied
  net-zerocopy: Defer vm zap unless actually needed.

 include/uapi/linux/tcp.h |   4 +
 net/ipv4/tcp.c           | 446 +++++++++++++++++++++++++++++----------
 2 files changed, 343 insertions(+), 107 deletions(-)

-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-03  0:15   ` Stephen Hemminger
  2020-12-02 22:09 ` [net-next v2 2/8] net-tcp: Introduce tcp_recvmsg_locked() Arjun Roy
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

When TCP receive zerocopy does not successfully map the entire
requested space, it outputs a 'hint' that the caller should recvmsg().

Augment zerocopy to accept a user buffer that it tries to copy this
hint into - if it is possible to copy the entire hint, it will do so.
This elides a recvmsg() call for received traffic that isn't exactly
page-aligned in size.

This was tested with RPC-style traffic of arbitrary sizes. Normally,
each received message required at least one getsockopt() call, and one
recvmsg() call for the remaining unaligned data.

With this change, almost all of the recvmsg() calls are eliminated,
leading to a savings of about 25%-50% in number of system calls
for RPC-style workloads.
---
 include/uapi/linux/tcp.h |  2 +
 net/ipv4/tcp.c           | 84 ++++++++++++++++++++++++++++++++--------
 2 files changed, 70 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index cfcb10b75483..62db78b9c1a0 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
 	__u32 recv_skip_hint;	/* out: amount of bytes to skip */
 	__u32 inq; /* out: amount of bytes in read queue */
 	__s32 err; /* out: socket error */
+	__u64 copybuf_address;	/* in: copybuf address (small reads) */
+	__s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
 };
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b2bc3d7fe9e8..887c6e986927 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1743,6 +1743,52 @@ int tcp_mmap(struct file *file, struct socket *sock,
 }
 EXPORT_SYMBOL(tcp_mmap);
 
+static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc,
+				   struct sk_buff *skb, u32 copylen,
+				   u32 *offset, u32 *seq)
+{
+	unsigned long copy_address = (unsigned long)zc->copybuf_address;
+	struct msghdr msg = {};
+	struct iovec iov;
+	int err;
+
+	if (copy_address != zc->copybuf_address)
+		return -EINVAL;
+
+	err = import_single_range(READ, (void __user *)copy_address,
+				  copylen, &iov, &msg.msg_iter);
+	if (err)
+		return err;
+	err = skb_copy_datagram_msg(skb, *offset, &msg, copylen);
+	if (err)
+		return err;
+	zc->recv_skip_hint -= copylen;
+	*offset += copylen;
+	*seq += copylen;
+	return (__s32)copylen;
+}
+
+static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc,
+					     struct sock *sk,
+					     struct sk_buff *skb,
+					     u32 *seq,
+					     s32 copybuf_len)
+{
+	u32 offset, copylen = min_t(u32, copybuf_len, zc->recv_skip_hint);
+
+	if (!copylen)
+		return 0;
+	/* skb is null if inq < PAGE_SIZE. */
+	if (skb)
+		offset = *seq - TCP_SKB_CB(skb)->seq;
+	else
+		skb = tcp_recv_skb(sk, *seq, &offset);
+
+	zc->copybuf_len = tcp_copy_straggler_data(zc, skb, copylen, &offset,
+						  seq);
+	return zc->copybuf_len < 0 ? 0 : copylen;
+}
+
 static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
 					struct page **pages,
 					unsigned long pages_to_map,
@@ -1776,8 +1822,10 @@ static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc)
 {
+	u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0;
 	unsigned long address = (unsigned long)zc->address;
-	u32 length = 0, seq, offset, zap_len;
+	s32 copybuf_len = zc->copybuf_len;
+	struct tcp_sock *tp = tcp_sk(sk);
 	#define PAGE_BATCH_SIZE 8
 	struct page *pages[PAGE_BATCH_SIZE];
 	const skb_frag_t *frags = NULL;
@@ -1785,10 +1833,12 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	struct sk_buff *skb = NULL;
 	unsigned long pg_idx = 0;
 	unsigned long curr_addr;
-	struct tcp_sock *tp;
-	int inq;
+	u32 seq = tp->copied_seq;
+	int inq = tcp_inq(sk);
 	int ret;
 
+	zc->copybuf_len = 0;
+
 	if (address & (PAGE_SIZE - 1) || address != zc->address)
 		return -EINVAL;
 
@@ -1797,8 +1847,6 @@ static int tcp_zerocopy_receive(struct sock *sk,
 
 	sock_rps_record_flow(sk);
 
-	tp = tcp_sk(sk);
-
 	mmap_read_lock(current->mm);
 
 	vma = find_vma(current->mm, address);
@@ -1806,17 +1854,16 @@ static int tcp_zerocopy_receive(struct sock *sk,
 		mmap_read_unlock(current->mm);
 		return -EINVAL;
 	}
-	zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
-
-	seq = tp->copied_seq;
-	inq = tcp_inq(sk);
-	zc->length = min_t(u32, zc->length, inq);
-	zap_len = zc->length & ~(PAGE_SIZE - 1);
-	if (zap_len) {
-		zap_page_range(vma, address, zap_len);
+	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
+	avail_len = min_t(u32, vma_len, inq);
+	aligned_len = avail_len & ~(PAGE_SIZE - 1);
+	if (aligned_len) {
+		zap_page_range(vma, address, aligned_len);
+		zc->length = aligned_len;
 		zc->recv_skip_hint = 0;
 	} else {
-		zc->recv_skip_hint = zc->length;
+		zc->length = avail_len;
+		zc->recv_skip_hint = avail_len;
 	}
 	ret = 0;
 	curr_addr = address;
@@ -1885,13 +1932,18 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	}
 out:
 	mmap_read_unlock(current->mm);
-	if (length) {
+	/* Try to copy straggler data. */
+	if (!ret)
+		copylen = tcp_zerocopy_handle_leftover_data(zc, sk, skb, &seq,
+							    copybuf_len);
+
+	if (length + copylen) {
 		WRITE_ONCE(tp->copied_seq, seq);
 		tcp_rcv_space_adjust(sk);
 
 		/* Clean up data we have read: This will do ACK frames. */
 		tcp_recv_skb(sk, seq, &offset);
-		tcp_cleanup_rbuf(sk, length);
+		tcp_cleanup_rbuf(sk, length + copylen);
 		ret = 0;
 		if (length == zc->length)
 			zc->recv_skip_hint = 0;
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 2/8] net-tcp: Introduce tcp_recvmsg_locked().
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
  2020-12-02 22:09 ` [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 3/8] net-zerocopy: Refactor skb frag fast-forward op Arjun Roy
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Refactor tcp_recvmsg() by splitting it into locked and unlocked
portions. Callers already holding the socket lock and not using
ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked().
This is in preparation for a short-circuit copy performed by
TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested
by the user) reads.
---
 net/ipv4/tcp.c | 69 ++++++++++++++++++++++++++++----------------------
 1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 887c6e986927..232cb478bacd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2065,36 +2065,28 @@ static int tcp_inq_hint(struct sock *sk)
  *	Probably, code can be easily improved even more.
  */
 
-int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
-		int flags, int *addr_len)
+static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
+			      int nonblock, int flags,
+			      struct scm_timestamping_internal *tss,
+			      int *cmsg_flags)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int copied = 0;
 	u32 peek_seq;
 	u32 *seq;
 	unsigned long used;
-	int err, inq;
+	int err;
 	int target;		/* Read at least this many bytes */
 	long timeo;
 	struct sk_buff *skb, *last;
 	u32 urg_hole = 0;
-	struct scm_timestamping_internal tss;
-	int cmsg_flags;
-
-	if (unlikely(flags & MSG_ERRQUEUE))
-		return inet_recv_error(sk, msg, len, addr_len);
-
-	if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue) &&
-	    (sk->sk_state == TCP_ESTABLISHED))
-		sk_busy_loop(sk, nonblock);
-
-	lock_sock(sk);
 
 	err = -ENOTCONN;
 	if (sk->sk_state == TCP_LISTEN)
 		goto out;
 
-	cmsg_flags = tp->recvmsg_inq ? 1 : 0;
+	if (tp->recvmsg_inq)
+		*cmsg_flags = 1;
 	timeo = sock_rcvtimeo(sk, nonblock);
 
 	/* Urgent data needs to be handled specially. */
@@ -2274,8 +2266,8 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		}
 
 		if (TCP_SKB_CB(skb)->has_rxtstamp) {
-			tcp_update_recv_tstamps(skb, &tss);
-			cmsg_flags |= 2;
+			tcp_update_recv_tstamps(skb, tss);
+			*cmsg_flags |= 2;
 		}
 
 		if (used + offset < skb->len)
@@ -2301,22 +2293,9 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 
 	/* Clean up data we have read: This will do ACK frames. */
 	tcp_cleanup_rbuf(sk, copied);
-
-	release_sock(sk);
-
-	if (cmsg_flags) {
-		if (cmsg_flags & 2)
-			tcp_recv_timestamp(msg, sk, &tss);
-		if (cmsg_flags & 1) {
-			inq = tcp_inq_hint(sk);
-			put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
-		}
-	}
-
 	return copied;
 
 out:
-	release_sock(sk);
 	return err;
 
 recv_urg:
@@ -2327,6 +2306,36 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	err = tcp_peek_sndq(sk, msg, len);
 	goto out;
 }
+
+int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
+		int flags, int *addr_len)
+{
+	int cmsg_flags = 0, ret, inq;
+	struct scm_timestamping_internal tss;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return inet_recv_error(sk, msg, len, addr_len);
+
+	if (sk_can_busy_loop(sk) &&
+	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
+	    sk->sk_state == TCP_ESTABLISHED)
+		sk_busy_loop(sk, nonblock);
+
+	lock_sock(sk);
+	ret = tcp_recvmsg_locked(sk, msg, len, nonblock, flags, &tss,
+				 &cmsg_flags);
+	release_sock(sk);
+
+	if (cmsg_flags && ret >= 0) {
+		if (cmsg_flags & 2)
+			tcp_recv_timestamp(msg, sk, &tss);
+		if (cmsg_flags & 1) {
+			inq = tcp_inq_hint(sk);
+			put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq);
+		}
+	}
+	return ret;
+}
 EXPORT_SYMBOL(tcp_recvmsg);
 
 void tcp_set_state(struct sock *sk, int state)
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 3/8] net-zerocopy: Refactor skb frag fast-forward op.
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
  2020-12-02 22:09 ` [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy Arjun Roy
  2020-12-02 22:09 ` [net-next v2 2/8] net-tcp: Introduce tcp_recvmsg_locked() Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 4/8] net-zerocopy: Refactor frag-is-remappable test Arjun Roy
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Refactor skb frag fast-forwarding for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.

skb_advance_to_frag(), given a skb and an offset into the skb,
iterates from the first frag for the skb until we're at the frag
specified by the offset. Assuming the offset provided refers to how
many bytes in the skb are already read, the returned frag points to
the next frag we may read from, while offset_frag is set to the number
of bytes from this frag that we have already read.

If frag is not null and offset_frag is equal to 0, then we may be able
to map this frag's page into the process address space with
vm_insert_page(). However, if offset_frag is not equal to 0, then we
cannot do so.
---
 net/ipv4/tcp.c | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 232cb478bacd..0f17b46c4c0c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1743,6 +1743,28 @@ int tcp_mmap(struct file *file, struct socket *sock,
 }
 EXPORT_SYMBOL(tcp_mmap);
 
+static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb,
+				       u32 *offset_frag)
+{
+	skb_frag_t *frag;
+
+	offset_skb -= skb_headlen(skb);
+	if ((int)offset_skb < 0 || skb_has_frag_list(skb))
+		return NULL;
+
+	frag = skb_shinfo(skb)->frags;
+	while (offset_skb) {
+		if (skb_frag_size(frag) > offset_skb) {
+			*offset_frag = offset_skb;
+			return frag;
+		}
+		offset_skb -= skb_frag_size(frag);
+		++frag;
+	}
+	*offset_frag = 0;
+	return frag;
+}
+
 static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc,
 				   struct sk_buff *skb, u32 copylen,
 				   u32 *offset, u32 *seq)
@@ -1869,6 +1891,8 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
 		if (zc->recv_skip_hint < PAGE_SIZE) {
+			u32 offset_frag;
+
 			/* If we're here, finish the current batch. */
 			if (pg_idx) {
 				ret = tcp_zerocopy_vm_insert_batch(vma, pages,
@@ -1889,16 +1913,9 @@ static int tcp_zerocopy_receive(struct sock *sk,
 				skb = tcp_recv_skb(sk, seq, &offset);
 			}
 			zc->recv_skip_hint = skb->len - offset;
-			offset -= skb_headlen(skb);
-			if ((int)offset < 0 || skb_has_frag_list(skb))
+			frags = skb_advance_to_frag(skb, offset, &offset_frag);
+			if (!frags || offset_frag)
 				break;
-			frags = skb_shinfo(skb)->frags;
-			while (offset) {
-				if (skb_frag_size(frags) > offset)
-					goto out;
-				offset -= skb_frag_size(frags);
-				frags++;
-			}
 		}
 		if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) {
 			int remaining = zc->recv_skip_hint;
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 4/8] net-zerocopy: Refactor frag-is-remappable test.
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
                   ` (2 preceding siblings ...)
  2020-12-02 22:09 ` [net-next v2 3/8] net-zerocopy: Refactor skb frag fast-forward op Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE Arjun Roy
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Refactor frag-is-remappable test for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.
---
 net/ipv4/tcp.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f17b46c4c0c..4bdd4a358588 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1765,6 +1765,26 @@ static skb_frag_t *skb_advance_to_frag(struct sk_buff *skb, u32 offset_skb,
 	return frag;
 }
 
+static bool can_map_frag(const skb_frag_t *frag)
+{
+	return skb_frag_size(frag) == PAGE_SIZE && !skb_frag_off(frag);
+}
+
+static int find_next_mappable_frag(const skb_frag_t *frag,
+				   int remaining_in_skb)
+{
+	int offset = 0;
+
+	if (likely(can_map_frag(frag)))
+		return 0;
+
+	while (offset < remaining_in_skb && !can_map_frag(frag)) {
+		offset += skb_frag_size(frag);
+		++frag;
+	}
+	return offset;
+}
+
 static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc,
 				   struct sk_buff *skb, u32 copylen,
 				   u32 *offset, u32 *seq)
@@ -1890,6 +1910,8 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	ret = 0;
 	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
+		int mappable_offset;
+
 		if (zc->recv_skip_hint < PAGE_SIZE) {
 			u32 offset_frag;
 
@@ -1917,15 +1939,11 @@ static int tcp_zerocopy_receive(struct sock *sk,
 			if (!frags || offset_frag)
 				break;
 		}
-		if (skb_frag_size(frags) != PAGE_SIZE || skb_frag_off(frags)) {
-			int remaining = zc->recv_skip_hint;
 
-			while (remaining && (skb_frag_size(frags) != PAGE_SIZE ||
-					     skb_frag_off(frags))) {
-				remaining -= skb_frag_size(frags);
-				frags++;
-			}
-			zc->recv_skip_hint -= remaining;
+		mappable_offset = find_next_mappable_frag(frags,
+							  zc->recv_skip_hint);
+		if (mappable_offset) {
+			zc->recv_skip_hint = mappable_offset;
 			break;
 		}
 		pages[pg_idx] = skb_frag_page(frags);
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
                   ` (3 preceding siblings ...)
  2020-12-02 22:09 ` [net-next v2 4/8] net-zerocopy: Refactor frag-is-remappable test Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 6/8] net-zerocopy: Introduce short-circuit small reads Arjun Roy
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, in which case we cannot remap pages. In this case,
simply return the appropriate hint for regular copying without taking
mmap_sem.
---
 net/ipv4/tcp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4bdd4a358588..b2f24a5ec230 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1889,6 +1889,14 @@ static int tcp_zerocopy_receive(struct sock *sk,
 
 	sock_rps_record_flow(sk);
 
+	if (inq < PAGE_SIZE) {
+		zc->length = 0;
+		zc->recv_skip_hint = inq;
+		if (!inq && sock_flag(sk, SOCK_DONE))
+			return -EIO;
+		return 0;
+	}
+
 	mmap_read_lock(current->mm);
 
 	vma = find_vma(current->mm, address);
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 6/8] net-zerocopy: Introduce short-circuit small reads.
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
                   ` (4 preceding siblings ...)
  2020-12-02 22:09 ` [net-next v2 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 7/8] net-zerocopy: Set zerocopy hint when data is copied Arjun Roy
  2020-12-02 22:09 ` [net-next v2 8/8] net-zerocopy: Defer vm zap unless actually needed Arjun Roy
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, or inq is generally small enough that
it is cheaper to copy rather than remap pages.

In these cases, we may want to either return early (inq=0) or
attempt to use the provided copy buffer to simply copy
the received data.

This allows us to save both system call overhead and
the latency of acquiring mmap_sem in read mode for cases where
it would be useless to do so.

This patchset enables this behaviour by:
1. Returning quickly if inq is 0.
2. Attempting to perform a regular copy if a hybrid copybuffer is
   provided and it is large enough to absorb all available bytes.
3. Return quickly if no such buffer was provided and there are less
   than PAGE_SIZE bytes available.

For small RPC ping-pong workloads, normally we would have
1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this
change, we remove the recvmsg() call entirely, reducing the syscall
overhead by about 33%. In testing with small (hundreds of bytes)
RPC traffic, this yields a syscall reduction of about 33% and
an efficiency gain of about 3-5% when defined as QPS/CPU Util.
---
 net/ipv4/tcp.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b2f24a5ec230..f67dd732a47b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1785,6 +1785,39 @@ static int find_next_mappable_frag(const skb_frag_t *frag,
 	return offset;
 }
 
+static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
+			      int nonblock, int flags,
+			      struct scm_timestamping_internal *tss,
+			      int *cmsg_flags);
+static int receive_fallback_to_copy(struct sock *sk,
+				    struct tcp_zerocopy_receive *zc, int inq)
+{
+	unsigned long copy_address = (unsigned long)zc->copybuf_address;
+	struct scm_timestamping_internal tss_unused;
+	int err, cmsg_flags_unused;
+	struct msghdr msg = {};
+	struct iovec iov;
+
+	zc->length = 0;
+	zc->recv_skip_hint = 0;
+
+	if (copy_address != zc->copybuf_address)
+		return -EINVAL;
+
+	err = import_single_range(READ, (void __user *)copy_address,
+				  inq, &iov, &msg.msg_iter);
+	if (err)
+		return err;
+
+	err = tcp_recvmsg_locked(sk, &msg, inq, /*nonblock=*/1, /*flags=*/0,
+				 &tss_unused, &cmsg_flags_unused);
+	if (err < 0)
+		return err;
+
+	zc->copybuf_len = err;
+	return 0;
+}
+
 static int tcp_copy_straggler_data(struct tcp_zerocopy_receive *zc,
 				   struct sk_buff *skb, u32 copylen,
 				   u32 *offset, u32 *seq)
@@ -1889,6 +1922,9 @@ static int tcp_zerocopy_receive(struct sock *sk,
 
 	sock_rps_record_flow(sk);
 
+	if (inq && inq <= copybuf_len)
+		return receive_fallback_to_copy(sk, zc, inq);
+
 	if (inq < PAGE_SIZE) {
 		zc->length = 0;
 		zc->recv_skip_hint = inq;
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 7/8] net-zerocopy: Set zerocopy hint when data is copied
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
                   ` (5 preceding siblings ...)
  2020-12-02 22:09 ` [net-next v2 6/8] net-zerocopy: Introduce short-circuit small reads Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  2020-12-02 22:09 ` [net-next v2 8/8] net-zerocopy: Defer vm zap unless actually needed Arjun Roy
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Set zerocopy hint, event when falling back to copy, so that the
pending data can be efficiently received using zerocopy when
possible.
---
 net/ipv4/tcp.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f67dd732a47b..49480ce162db 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1785,6 +1785,43 @@ static int find_next_mappable_frag(const skb_frag_t *frag,
 	return offset;
 }
 
+static void tcp_zerocopy_set_hint_for_skb(struct sock *sk,
+					  struct tcp_zerocopy_receive *zc,
+					  struct sk_buff *skb, u32 offset)
+{
+	u32 frag_offset, partial_frag_remainder = 0;
+	int mappable_offset;
+	skb_frag_t *frag;
+
+	/* worst case: skip to next skb. try to improve on this case below */
+	zc->recv_skip_hint = skb->len - offset;
+
+	/* Find the frag containing this offset (and how far into that frag) */
+	frag = skb_advance_to_frag(skb, offset, &frag_offset);
+	if (!frag)
+		return;
+
+	if (frag_offset) {
+		struct skb_shared_info *info = skb_shinfo(skb);
+
+		/* We read part of the last frag, must recvmsg() rest of skb. */
+		if (frag == &info->frags[info->nr_frags - 1])
+			return;
+
+		/* Else, we must at least read the remainder in this frag. */
+		partial_frag_remainder = skb_frag_size(frag) - frag_offset;
+		zc->recv_skip_hint -= partial_frag_remainder;
+		++frag;
+	}
+
+	/* partial_frag_remainder: If part way through a frag, must read rest.
+	 * mappable_offset: Bytes till next mappable frag, *not* counting bytes
+	 * in partial_frag_remainder.
+	 */
+	mappable_offset = find_next_mappable_frag(frag, zc->recv_skip_hint);
+	zc->recv_skip_hint = mappable_offset + partial_frag_remainder;
+}
+
 static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
 			      int nonblock, int flags,
 			      struct scm_timestamping_internal *tss,
@@ -1815,6 +1852,14 @@ static int receive_fallback_to_copy(struct sock *sk,
 		return err;
 
 	zc->copybuf_len = err;
+	if (likely(zc->copybuf_len)) {
+		struct sk_buff *skb;
+		u32 offset;
+
+		skb = tcp_recv_skb(sk, tcp_sk(sk)->copied_seq, &offset);
+		if (skb)
+			tcp_zerocopy_set_hint_for_skb(sk, zc, skb, offset);
+	}
 	return 0;
 }
 
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [net-next v2 8/8] net-zerocopy: Defer vm zap unless actually needed.
  2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
                   ` (6 preceding siblings ...)
  2020-12-02 22:09 ` [net-next v2 7/8] net-zerocopy: Set zerocopy hint when data is copied Arjun Roy
@ 2020-12-02 22:09 ` Arjun Roy
  7 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-02 22:09 UTC (permalink / raw)
  To: davem, netdev; +Cc: arjunroy, edumazet, soheil

From: Arjun Roy <arjunroy@google.com>

Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.

zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().

This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.

An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.

When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.
---
 include/uapi/linux/tcp.h |   2 +
 net/ipv4/tcp.c           | 147 ++++++++++++++++++++++++++-------------
 2 files changed, 99 insertions(+), 50 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 62db78b9c1a0..13ceeb395eb8 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -343,6 +343,7 @@ struct tcp_diag_md5sig {
 
 /* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
 
+#define TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT 0x1
 struct tcp_zerocopy_receive {
 	__u64 address;		/* in: address of mapping */
 	__u32 length;		/* in/out: number of bytes to map/mapped */
@@ -351,5 +352,6 @@ struct tcp_zerocopy_receive {
 	__s32 err; /* out: socket error */
 	__u64 copybuf_address;	/* in: copybuf address (small reads) */
 	__s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
+	__u32 flags; /* in: flags */
 };
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 49480ce162db..83d16f04f464 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1909,51 +1909,101 @@ static int tcp_zerocopy_handle_leftover_data(struct tcp_zerocopy_receive *zc,
 	return zc->copybuf_len < 0 ? 0 : copylen;
 }
 
+static int tcp_zerocopy_vm_insert_batch_error(struct vm_area_struct *vma,
+					      struct page **pending_pages,
+					      unsigned long pages_remaining,
+					      unsigned long *address,
+					      u32 *length,
+					      u32 *seq,
+					      struct tcp_zerocopy_receive *zc,
+					      u32 total_bytes_to_map,
+					      int err)
+{
+	/* At least one page did not map. Try zapping if we skipped earlier. */
+	if (err == -EBUSY &&
+	    zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT) {
+		u32 maybe_zap_len;
+
+		maybe_zap_len = total_bytes_to_map -  /* All bytes to map */
+				*length + /* Mapped or pending */
+				(pages_remaining * PAGE_SIZE); /* Failed map. */
+		zap_page_range(vma, *address, maybe_zap_len);
+		err = 0;
+	}
+
+	if (!err) {
+		unsigned long leftover_pages = pages_remaining;
+		int bytes_mapped;
+
+		/* We called zap_page_range, try to reinsert. */
+		err = vm_insert_pages(vma, *address,
+				      pending_pages,
+				      &pages_remaining);
+		bytes_mapped = PAGE_SIZE * (leftover_pages - pages_remaining);
+		*seq += bytes_mapped;
+		*address += bytes_mapped;
+	}
+	if (err) {
+		/* Either we were unable to zap, OR we zapped, retried an
+		 * insert, and still had an issue. Either ways, pages_remaining
+		 * is the number of pages we were unable to map, and we unroll
+		 * some state we speculatively touched before.
+		 */
+		const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
+
+		*length -= bytes_not_mapped;
+		zc->recv_skip_hint += bytes_not_mapped;
+	}
+	return err;
+}
+
 static int tcp_zerocopy_vm_insert_batch(struct vm_area_struct *vma,
 					struct page **pages,
-					unsigned long pages_to_map,
-					unsigned long *insert_addr,
-					u32 *length_with_pending,
+					unsigned int pages_to_map,
+					unsigned long *address,
+					u32 *length,
 					u32 *seq,
-					struct tcp_zerocopy_receive *zc)
+					struct tcp_zerocopy_receive *zc,
+					u32 total_bytes_to_map)
 {
 	unsigned long pages_remaining = pages_to_map;
-	int bytes_mapped;
-	int ret;
+	unsigned int pages_mapped;
+	unsigned int bytes_mapped;
+	int err;
 
-	ret = vm_insert_pages(vma, *insert_addr, pages, &pages_remaining);
-	bytes_mapped = PAGE_SIZE * (pages_to_map - pages_remaining);
+	err = vm_insert_pages(vma, *address, pages, &pages_remaining);
+	pages_mapped = pages_to_map - (unsigned int)pages_remaining;
+	bytes_mapped = PAGE_SIZE * pages_mapped;
 	/* Even if vm_insert_pages fails, it may have partially succeeded in
 	 * mapping (some but not all of the pages).
 	 */
 	*seq += bytes_mapped;
-	*insert_addr += bytes_mapped;
-	if (ret) {
-		/* But if vm_insert_pages did fail, we have to unroll some state
-		 * we speculatively touched before.
-		 */
-		const int bytes_not_mapped = PAGE_SIZE * pages_remaining;
-		*length_with_pending -= bytes_not_mapped;
-		zc->recv_skip_hint += bytes_not_mapped;
-	}
-	return ret;
+	*address += bytes_mapped;
+
+	if (likely(!err))
+		return 0;
+
+	/* Error: maybe zap and retry + rollback state for failed inserts. */
+	return tcp_zerocopy_vm_insert_batch_error(vma, pages + pages_mapped,
+		pages_remaining, address, length, seq, zc, total_bytes_to_map,
+		err);
 }
 
+#define TCP_ZEROCOPY_PAGE_BATCH_SIZE 32
 static int tcp_zerocopy_receive(struct sock *sk,
 				struct tcp_zerocopy_receive *zc)
 {
-	u32 length = 0, offset, vma_len, avail_len, aligned_len, copylen = 0;
+	u32 length = 0, offset, vma_len, avail_len, copylen = 0;
 	unsigned long address = (unsigned long)zc->address;
+	struct page *pages[TCP_ZEROCOPY_PAGE_BATCH_SIZE];
 	s32 copybuf_len = zc->copybuf_len;
 	struct tcp_sock *tp = tcp_sk(sk);
-	#define PAGE_BATCH_SIZE 8
-	struct page *pages[PAGE_BATCH_SIZE];
 	const skb_frag_t *frags = NULL;
+	unsigned int pages_to_map = 0;
 	struct vm_area_struct *vma;
 	struct sk_buff *skb = NULL;
-	unsigned long pg_idx = 0;
-	unsigned long curr_addr;
 	u32 seq = tp->copied_seq;
+	u32 total_bytes_to_map;
 	int inq = tcp_inq(sk);
 	int ret;
 
@@ -1987,34 +2037,24 @@ static int tcp_zerocopy_receive(struct sock *sk,
 	}
 	vma_len = min_t(unsigned long, zc->length, vma->vm_end - address);
 	avail_len = min_t(u32, vma_len, inq);
-	aligned_len = avail_len & ~(PAGE_SIZE - 1);
-	if (aligned_len) {
-		zap_page_range(vma, address, aligned_len);
-		zc->length = aligned_len;
+	total_bytes_to_map = avail_len & ~(PAGE_SIZE - 1);
+	if (total_bytes_to_map) {
+		if (!(zc->flags & TCP_RECEIVE_ZEROCOPY_FLAG_TLB_CLEAN_HINT))
+			zap_page_range(vma, address, total_bytes_to_map);
+		zc->length = total_bytes_to_map;
 		zc->recv_skip_hint = 0;
 	} else {
 		zc->length = avail_len;
 		zc->recv_skip_hint = avail_len;
 	}
 	ret = 0;
-	curr_addr = address;
 	while (length + PAGE_SIZE <= zc->length) {
 		int mappable_offset;
+		struct page *page;
 
 		if (zc->recv_skip_hint < PAGE_SIZE) {
 			u32 offset_frag;
 
-			/* If we're here, finish the current batch. */
-			if (pg_idx) {
-				ret = tcp_zerocopy_vm_insert_batch(vma, pages,
-								   pg_idx,
-								   &curr_addr,
-								   &length,
-								   &seq, zc);
-				if (ret)
-					goto out;
-				pg_idx = 0;
-			}
 			if (skb) {
 				if (zc->recv_skip_hint > 0)
 					break;
@@ -2035,24 +2075,31 @@ static int tcp_zerocopy_receive(struct sock *sk,
 			zc->recv_skip_hint = mappable_offset;
 			break;
 		}
-		pages[pg_idx] = skb_frag_page(frags);
-		pg_idx++;
+		page = skb_frag_page(frags);
+		prefetchw(page);
+		pages[pages_to_map++] = page;
 		length += PAGE_SIZE;
 		zc->recv_skip_hint -= PAGE_SIZE;
 		frags++;
-		if (pg_idx == PAGE_BATCH_SIZE) {
-			ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
-							   &curr_addr, &length,
-							   &seq, zc);
+		if (pages_to_map == TCP_ZEROCOPY_PAGE_BATCH_SIZE ||
+		    zc->recv_skip_hint < PAGE_SIZE) {
+			/* Either full batch, or we're about to go to next skb
+			 * (and we cannot unroll failed ops across skbs).
+			 */
+			ret = tcp_zerocopy_vm_insert_batch(vma, pages,
+							   pages_to_map,
+							   &address, &length,
+							   &seq, zc,
+							   total_bytes_to_map);
 			if (ret)
 				goto out;
-			pg_idx = 0;
+			pages_to_map = 0;
 		}
 	}
-	if (pg_idx) {
-		ret = tcp_zerocopy_vm_insert_batch(vma, pages, pg_idx,
-						   &curr_addr, &length, &seq,
-						   zc);
+	if (pages_to_map) {
+		ret = tcp_zerocopy_vm_insert_batch(vma, pages, pages_to_map,
+						   &address, &length, &seq,
+						   zc, total_bytes_to_map);
 	}
 out:
 	mmap_read_unlock(current->mm);
-- 
2.29.2.576.ga3fc446d84-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-02 22:09 ` [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy Arjun Roy
@ 2020-12-03  0:15   ` Stephen Hemminger
  2020-12-03  0:24     ` Arjun Roy
  2020-12-03 23:01     ` David Laight
  0 siblings, 2 replies; 18+ messages in thread
From: Stephen Hemminger @ 2020-12-03  0:15 UTC (permalink / raw)
  To: Arjun Roy; +Cc: davem, netdev, arjunroy, edumazet, soheil

On Wed,  2 Dec 2020 14:09:38 -0800
Arjun Roy <arjunroy.kdev@gmail.com> wrote:

> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index cfcb10b75483..62db78b9c1a0 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
>  	__u32 recv_skip_hint;	/* out: amount of bytes to skip */
>  	__u32 inq; /* out: amount of bytes in read queue */
>  	__s32 err; /* out: socket error */
> +	__u64 copybuf_address;	/* in: copybuf address (small reads) */
> +	__s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
>  };
>  #endif /* _UAPI_LINUX_TCP_H */

You can't safely grow the size of a userspace API without handling the
case of older applications.  Logic in setsockopt() would have to handle
both old and new sizes of the structure.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03  0:15   ` Stephen Hemminger
@ 2020-12-03  0:24     ` Arjun Roy
  2020-12-03 23:01     ` David Laight
  1 sibling, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-03  0:24 UTC (permalink / raw)
  To: stephen
  Cc: Arjun Roy, David Miller, netdev, Eric Dumazet, Soheil Hassas Yeganeh

On Wed, Dec 2, 2020 at 4:15 PM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Wed,  2 Dec 2020 14:09:38 -0800
> Arjun Roy <arjunroy.kdev@gmail.com> wrote:
>
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index cfcb10b75483..62db78b9c1a0 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> >       __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> >       __u32 inq; /* out: amount of bytes in read queue */
> >       __s32 err; /* out: socket error */
> > +     __u64 copybuf_address;  /* in: copybuf address (small reads) */
> > +     __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
> >  };
> >  #endif /* _UAPI_LINUX_TCP_H */
>
> You can't safely grow the size of a userspace API without handling the
> case of older applications.  Logic in setsockopt() would have to handle
> both old and new sizes of the structure.

Acknowledged, but tcp zerocopy receive is a special case: it does not
exist in setsockopt().
Regarding old applications in the getsockopt() case: the current
layout should handle older and newer callers as is (it devolves to
only use the features the application provides buffer space for).

Please note also v3 for this patchset; it's the same as this but I
forgot the signed-off-by statements here in v2.

Thanks,
-Arjun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03  0:15   ` Stephen Hemminger
  2020-12-03  0:24     ` Arjun Roy
@ 2020-12-03 23:01     ` David Laight
  2020-12-03 23:14       ` Eric Dumazet
  2020-12-03 23:19       ` Arjun Roy
  1 sibling, 2 replies; 18+ messages in thread
From: David Laight @ 2020-12-03 23:01 UTC (permalink / raw)
  To: 'Stephen Hemminger', Arjun Roy
  Cc: davem, netdev, arjunroy, edumazet, soheil

From: Stephen Hemminger
> Sent: 03 December 2020 00:15
> 
> On Wed,  2 Dec 2020 14:09:38 -0800
> Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> 
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index cfcb10b75483..62db78b9c1a0 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> >  	__u32 recv_skip_hint;	/* out: amount of bytes to skip */
> >  	__u32 inq; /* out: amount of bytes in read queue */
> >  	__s32 err; /* out: socket error */
> > +	__u64 copybuf_address;	/* in: copybuf address (small reads) */
> > +	__s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */

You need to swap the order of the above fields to avoid padding
and differing alignments for 32bit and 64bit apps.

> >  };
> >  #endif /* _UAPI_LINUX_TCP_H */
> 
> You can't safely grow the size of a userspace API without handling the
> case of older applications.  Logic in setsockopt() would have to handle
> both old and new sizes of the structure.

You also have to allow for old (working) applications being recompiled
with the new headers.
So you cannot rely on the fields being zero even if you are passed
the size of the structure.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03 23:01     ` David Laight
@ 2020-12-03 23:14       ` Eric Dumazet
  2020-12-04  9:02         ` David Laight
  2020-12-03 23:19       ` Arjun Roy
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2020-12-03 23:14 UTC (permalink / raw)
  To: David Laight
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, arjunroy, soheil

On Fri, Dec 4, 2020 at 12:01 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Stephen Hemminger
> > Sent: 03 December 2020 00:15
> >
> > On Wed,  2 Dec 2020 14:09:38 -0800
> > Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> >
> > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > > index cfcb10b75483..62db78b9c1a0 100644
> > > --- a/include/uapi/linux/tcp.h
> > > +++ b/include/uapi/linux/tcp.h
> > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> > >     __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> > >     __u32 inq; /* out: amount of bytes in read queue */
> > >     __s32 err; /* out: socket error */
> > > +   __u64 copybuf_address;  /* in: copybuf address (small reads) */
> > > +   __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
>
> You need to swap the order of the above fields to avoid padding
> and differing alignments for 32bit and 64bit apps.

I do not think so. Please review this patch series carefully.


>
> > >  };
> > >  #endif /* _UAPI_LINUX_TCP_H */
> >
> > You can't safely grow the size of a userspace API without handling the
> > case of older applications.  Logic in setsockopt() would have to handle
> > both old and new sizes of the structure.
>
> You also have to allow for old (working) applications being recompiled
> with the new headers.
> So you cannot rely on the fields being zero even if you are passed
> the size of the structure.

This is too late, there is precedent for this.

We already mentioned this in tcp_mmap.c reference program.



commit bf5525f3a8e3248be5aa5defe5aaadd60e1c1ba1
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue May 5 20:51:06 2020 -0700

    selftests: net: tcp_mmap: clear whole tcp_zerocopy_receive struct

    We added fields in tcp_zerocopy_receive structure,
    so make sure to clear all fields to not pass garbage to the kernel.

    We were lucky because recent additions added 'out' parameters,
    still we need to clean our reference implementation, before folks
    copy/paste it.

    Fixes: c8856c051454 ("tcp-zerocopy: Return inq along with tcp
receive zerocopy.")
    Fixes: 33946518d493 ("tcp-zerocopy: Return sk_err (if set) along
with tcp receive zerocopy.")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Arjun Roy <arjunroy@google.com>
    Cc: Soheil Hassas Yeganeh <soheil@google.com>
    Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/tools/testing/selftests/net/tcp_mmap.c
b/tools/testing/selftests/net/tcp_mmap.c
index 35505b31e5cc092453ea7b72d9dba45bed2d6549..62171fd638c817dabe2d988f3cfae74522112584
100644
--- a/tools/testing/selftests/net/tcp_mmap.c
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -165,9 +165,10 @@ void *child_thread(void *arg)
                        socklen_t zc_len = sizeof(zc);
                        int res;

+                       memset(&zc, 0, sizeof(zc));
                        zc.address = (__u64)((unsigned long)addr);
                        zc.length = chunk_size;
-                       zc.recv_skip_hint = 0;
+
                        res = getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE,
                                         &zc, &zc_len);
                        if (res == -1)




>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03 23:01     ` David Laight
  2020-12-03 23:14       ` Eric Dumazet
@ 2020-12-03 23:19       ` Arjun Roy
  2020-12-03 23:24         ` Arjun Roy
  1 sibling, 1 reply; 18+ messages in thread
From: Arjun Roy @ 2020-12-03 23:19 UTC (permalink / raw)
  To: David Laight
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, edumazet, soheil

On Thu, Dec 3, 2020 at 3:01 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Stephen Hemminger
> > Sent: 03 December 2020 00:15
> >
> > On Wed,  2 Dec 2020 14:09:38 -0800
> > Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> >
> > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > > index cfcb10b75483..62db78b9c1a0 100644
> > > --- a/include/uapi/linux/tcp.h
> > > +++ b/include/uapi/linux/tcp.h
> > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> > >     __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> > >     __u32 inq; /* out: amount of bytes in read queue */
> > >     __s32 err; /* out: socket error */
> > > +   __u64 copybuf_address;  /* in: copybuf address (small reads) */
> > > +   __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
>
> You need to swap the order of the above fields to avoid padding
> and differing alignments for 32bit and 64bit apps.
>

Just to double check, are you referring to the order of
copybuf_address and copybuf_len?
If so, it seems that the current ordering is not creating any
alignment holes, but flipping it would: https://godbolt.org/z/bdxP6b


> > >  };
> > >  #endif /* _UAPI_LINUX_TCP_H */
> >
> > You can't safely grow the size of a userspace API without handling the
> > case of older applications.  Logic in setsockopt() would have to handle
> > both old and new sizes of the structure.
>
> You also have to allow for old (working) applications being recompiled
> with the new headers.
> So you cannot rely on the fields being zero even if you are passed
> the size of the structure.
>

I think this should already be taken care of in the current code; the
full-sized struct with new fields is being zero-initialized, then
we're getting the user-provided optlen, then copying from userspace
only that much data. So the newer fields would be zero in that case,
so this should handle the case of new kernels but old applications.
Does this address the concern, or am I misunderstanding?

Thanks,
-Arjun

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03 23:19       ` Arjun Roy
@ 2020-12-03 23:24         ` Arjun Roy
  2020-12-04  9:03           ` David Laight
  0 siblings, 1 reply; 18+ messages in thread
From: Arjun Roy @ 2020-12-03 23:24 UTC (permalink / raw)
  To: David Laight
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, edumazet, soheil

On Thu, Dec 3, 2020 at 3:19 PM Arjun Roy <arjunroy@google.com> wrote:
>
> On Thu, Dec 3, 2020 at 3:01 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Stephen Hemminger
> > > Sent: 03 December 2020 00:15
> > >
> > > On Wed,  2 Dec 2020 14:09:38 -0800
> > > Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> > >
> > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > > > index cfcb10b75483..62db78b9c1a0 100644
> > > > --- a/include/uapi/linux/tcp.h
> > > > +++ b/include/uapi/linux/tcp.h
> > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> > > >     __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> > > >     __u32 inq; /* out: amount of bytes in read queue */
> > > >     __s32 err; /* out: socket error */
> > > > +   __u64 copybuf_address;  /* in: copybuf address (small reads) */
> > > > +   __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
> >
> > You need to swap the order of the above fields to avoid padding
> > and differing alignments for 32bit and 64bit apps.
> >
>
> Just to double check, are you referring to the order of
> copybuf_address and copybuf_len?
> If so, it seems that the current ordering is not creating any
> alignment holes, but flipping it would: https://godbolt.org/z/bdxP6b
>
>
> > > >  };
> > > >  #endif /* _UAPI_LINUX_TCP_H */
> > >
> > > You can't safely grow the size of a userspace API without handling the
> > > case of older applications.  Logic in setsockopt() would have to handle
> > > both old and new sizes of the structure.
> >
> > You also have to allow for old (working) applications being recompiled
> > with the new headers.
> > So you cannot rely on the fields being zero even if you are passed
> > the size of the structure.
> >
>
> I think this should already be taken care of in the current code; the
> full-sized struct with new fields is being zero-initialized, then
> we're getting the user-provided optlen, then copying from userspace
> only that much data. So the newer fields would be zero in that case,
> so this should handle the case of new kernels but old applications.
> Does this address the concern, or am I misunderstanding?
>

Actually, on closer read, perhaps the following is what you have in
mind for the old application?

struct zerocopy_args args;
args.address = ...;
args.length = ...;
args.recv_skip_hint = ...;
args.inq = ...;
args.err = ...;
getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args));
// sizeof(args) is now bigger when recompiled with new headers, but we
did not explicitly set the new fields to 0, therefore issues

-Arjun

> Thanks,
> -Arjun
>
> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03 23:14       ` Eric Dumazet
@ 2020-12-04  9:02         ` David Laight
  0 siblings, 0 replies; 18+ messages in thread
From: David Laight @ 2020-12-04  9:02 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, arjunroy, soheil

From: Eric
> Sent: 03 December 2020 23:15
> 
> On Fri, Dec 4, 2020 at 12:01 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Stephen Hemminger
> > > Sent: 03 December 2020 00:15
> > >
> > > On Wed,  2 Dec 2020 14:09:38 -0800
> > > Arjun Roy <arjunroy.kdev@gmail.com> wrote:
> > >
> > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > > > index cfcb10b75483..62db78b9c1a0 100644
> > > > --- a/include/uapi/linux/tcp.h
> > > > +++ b/include/uapi/linux/tcp.h
> > > > @@ -349,5 +349,7 @@ struct tcp_zerocopy_receive {
> > > >     __u32 recv_skip_hint;   /* out: amount of bytes to skip */
> > > >     __u32 inq; /* out: amount of bytes in read queue */
> > > >     __s32 err; /* out: socket error */
> > > > +   __u64 copybuf_address;  /* in: copybuf address (small reads) */
> > > > +   __s32 copybuf_len; /* in/out: copybuf bytes avail/used or error */
> >
> > You need to swap the order of the above fields to avoid padding
> > and differing alignments for 32bit and 64bit apps.
> 
> I do not think so. Please review this patch series carefully.

Late at night.
The actual problem is 'tail padding'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-03 23:24         ` Arjun Roy
@ 2020-12-04  9:03           ` David Laight
  2020-12-04 22:37             ` Arjun Roy
  0 siblings, 1 reply; 18+ messages in thread
From: David Laight @ 2020-12-04  9:03 UTC (permalink / raw)
  To: 'Arjun Roy'
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, edumazet, soheil

From: Arjun Roy
> Sent: 03 December 2020 23:25
...
> > > You also have to allow for old (working) applications being recompiled
> > > with the new headers.
> > > So you cannot rely on the fields being zero even if you are passed
> > > the size of the structure.
> > >
> >
> > I think this should already be taken care of in the current code; the
> > full-sized struct with new fields is being zero-initialized, then
> > we're getting the user-provided optlen, then copying from userspace
> > only that much data. So the newer fields would be zero in that case,
> > so this should handle the case of new kernels but old applications.
> > Does this address the concern, or am I misunderstanding?
> >
> 
> Actually, on closer read, perhaps the following is what you have in
> mind for the old application?
> 
> struct zerocopy_args args;
> args.address = ...;
> args.length = ...;
> args.recv_skip_hint = ...;
> args.inq = ...;
> args.err = ...;
> getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args));
> // sizeof(args) is now bigger when recompiled with new headers, but we
> did not explicitly set the new fields to 0, therefore issues

That's the one...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
  2020-12-04  9:03           ` David Laight
@ 2020-12-04 22:37             ` Arjun Roy
  0 siblings, 0 replies; 18+ messages in thread
From: Arjun Roy @ 2020-12-04 22:37 UTC (permalink / raw)
  To: David Laight
  Cc: Stephen Hemminger, Arjun Roy, davem, netdev, edumazet, soheil

On Fri, Dec 4, 2020 at 1:03 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arjun Roy
> > Sent: 03 December 2020 23:25
> ...
> > > > You also have to allow for old (working) applications being recompiled
> > > > with the new headers.
> > > > So you cannot rely on the fields being zero even if you are passed
> > > > the size of the structure.
> > > >
> > >
> > > I think this should already be taken care of in the current code; the
> > > full-sized struct with new fields is being zero-initialized, then
> > > we're getting the user-provided optlen, then copying from userspace
> > > only that much data. So the newer fields would be zero in that case,
> > > so this should handle the case of new kernels but old applications.
> > > Does this address the concern, or am I misunderstanding?
> > >
> >
> > Actually, on closer read, perhaps the following is what you have in
> > mind for the old application?
> >
> > struct zerocopy_args args;
> > args.address = ...;
> > args.length = ...;
> > args.recv_skip_hint = ...;
> > args.inq = ...;
> > args.err = ...;
> > getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, &args, sizeof(args));
> > // sizeof(args) is now bigger when recompiled with new headers, but we
> > did not explicitly set the new fields to 0, therefore issues
>
> That's the one...
>

I'll defer in this case to Eric's previous response in this thread
(paraphrased: that there is precedent for this in tcp_mmap.c).

-Arjun

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-12-04 22:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-02 22:09 [net-next v2 0/8] Perf. optimizations for TCP Recv. Zerocopy Arjun Roy
2020-12-02 22:09 ` [net-next v2 1/8] net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy Arjun Roy
2020-12-03  0:15   ` Stephen Hemminger
2020-12-03  0:24     ` Arjun Roy
2020-12-03 23:01     ` David Laight
2020-12-03 23:14       ` Eric Dumazet
2020-12-04  9:02         ` David Laight
2020-12-03 23:19       ` Arjun Roy
2020-12-03 23:24         ` Arjun Roy
2020-12-04  9:03           ` David Laight
2020-12-04 22:37             ` Arjun Roy
2020-12-02 22:09 ` [net-next v2 2/8] net-tcp: Introduce tcp_recvmsg_locked() Arjun Roy
2020-12-02 22:09 ` [net-next v2 3/8] net-zerocopy: Refactor skb frag fast-forward op Arjun Roy
2020-12-02 22:09 ` [net-next v2 4/8] net-zerocopy: Refactor frag-is-remappable test Arjun Roy
2020-12-02 22:09 ` [net-next v2 5/8] net-zerocopy: Fast return if inq < PAGE_SIZE Arjun Roy
2020-12-02 22:09 ` [net-next v2 6/8] net-zerocopy: Introduce short-circuit small reads Arjun Roy
2020-12-02 22:09 ` [net-next v2 7/8] net-zerocopy: Set zerocopy hint when data is copied Arjun Roy
2020-12-02 22:09 ` [net-next v2 8/8] net-zerocopy: Defer vm zap unless actually needed Arjun Roy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.