All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive
@ 2018-04-25 21:43 Eric Dumazet
  2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-25 21:43 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Eric Dumazet,
	Eric Dumazet

syzbot reported a lockdep issue caused by tcp mmap() support.

I implemented Andy Lutomirski nice suggestions to resolve the
issue and increase scalability as well.

First patch is adding a new setsockopt() operation and changes mmap()
behavior.

Second patch changes tcp_mmap reference program.

v2:
 Added a missing page align of zc->length in tcp_zerocopy_receive()
 Properly clear zc->recv_skip_hint in case user request was completed.

Eric Dumazet (2):
  tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

 include/uapi/linux/tcp.h               |   8 ++
 net/ipv4/tcp.c                         | 189 +++++++++++++------------
 tools/testing/selftests/net/tcp_mmap.c |  63 +++++----
 3 files changed, 142 insertions(+), 118 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  2018-04-25 21:43 [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet
@ 2018-04-25 21:43 ` Eric Dumazet
  2018-04-26 13:40   ` Ka-Cheong Poon
  2018-04-27  8:44     ` kbuild test robot
  2018-04-25 21:43 ` [PATCH v2 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet
  2018-04-26  1:20 ` [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Soheil Hassas Yeganeh
  2 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-25 21:43 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Eric Dumazet,
	Eric Dumazet, Soheil Hassas Yeganeh

When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.

1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
  This operation does not involve any TCP locking.

2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
 the transfert of pages from skbs to one VMA.
  This operation only uses down_read(&current->mm->mmap_sem) after
  holding TCP lock, thus solving the lockdep issue.

This new implementation was suggested by Andy Lutomirski with great details.

Benefits are :

- Better scalability, in case multiple threads reuse VMAS
   (without mmap()/munmap() calls) since mmap_sem wont be write locked.

- Better error recovery.
   The previous mmap() model had to provide the expected size of the
   mapping. If for some reason one part could not be mapped (partial MSS),
   the whole operation had to be aborted.
   With the tcp_zerocopy_receive struct, kernel can report how
   many bytes were successfuly mapped, and how many bytes should
   be read to skip the problematic sequence.

- No more memory allocation to hold an array of page pointers.
  16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/

- skbs are freed while mmap_sem has been released

Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)

Note that memcg might require additional changes.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Cc: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/uapi/linux/tcp.h |   8 ++
 net/ipv4/tcp.c           | 189 ++++++++++++++++++++-------------------
 2 files changed, 106 insertions(+), 91 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@ enum {
 #define TCP_MD5SIG_EXT		32	/* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY	33	/* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
+#define TCP_ZEROCOPY_RECEIVE	35
 
 struct tcp_repair_opt {
 	__u32	opt_code;
@@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
 	__u8	tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
+
+struct tcp_zerocopy_receive {
+	__u64 address;		/* in: address of mapping */
+	__u32 length;		/* in/out: number of bytes to map/mapped */
+	__u32 recv_skip_hint;	/* out: amount of bytes to skip */
+};
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dfd090ea54ad47112fc23c61180b5bf8edd2c736..9b9cbb837ff8d6cdd85515429f699e113c55b37b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,118 +1726,111 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
-/* When user wants to mmap X pages, we first need to perform the mapping
- * before freeing any skbs in receive queue, otherwise user would be unable
- * to fallback to standard recvmsg(). This happens if some data in the
- * requested block is not exactly fitting in a page.
- *
- * We only support order-0 pages for the moment.
- * mmap() on TCP is very strict, there is no point
- * trying to accommodate with pathological layouts.
- */
+static const struct vm_operations_struct tcp_vm_ops = {
+};
+
 int tcp_mmap(struct file *file, struct socket *sock,
 	     struct vm_area_struct *vma)
 {
-	unsigned long size = vma->vm_end - vma->vm_start;
-	unsigned int nr_pages = size >> PAGE_SHIFT;
-	struct page **pages_array = NULL;
-	u32 seq, len, offset, nr = 0;
-	struct sock *sk = sock->sk;
-	const skb_frag_t *frags;
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+		return -EPERM;
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+	/* Instruct vm_insert_page() to not down_read(mmap_sem) */
+	vma->vm_flags |= VM_MIXEDMAP;
+
+	vma->vm_ops = &tcp_vm_ops;
+	return 0;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
+static int tcp_zerocopy_receive(struct sock *sk,
+				struct tcp_zerocopy_receive *zc)
+{
+	unsigned long address = (unsigned long)zc->address;
+	const skb_frag_t *frags = NULL;
+	u32 length = 0, seq, offset;
+	struct vm_area_struct *vma;
+	struct sk_buff *skb = NULL;
 	struct tcp_sock *tp;
-	struct sk_buff *skb;
 	int ret;
 
-	if (vma->vm_pgoff || !nr_pages)
+	if (address & (PAGE_SIZE - 1) || address != zc->address)
 		return -EINVAL;
 
-	if (vma->vm_flags & VM_WRITE)
-		return -EPERM;
-	/* TODO: Maybe the following is not needed if pages are COW */
-	vma->vm_flags &= ~VM_MAYWRITE;
-
-	lock_sock(sk);
-
-	ret = -ENOTCONN;
 	if (sk->sk_state == TCP_LISTEN)
-		goto out;
+		return -ENOTCONN;
 
 	sock_rps_record_flow(sk);
 
-	if (tcp_inq(sk) < size) {
-		ret = sock_flag(sk, SOCK_DONE) ? -EIO : -EAGAIN;
+	down_read(&current->mm->mmap_sem);
+
+	ret = -EINVAL;
+	vma = find_vma(current->mm, address);
+	if (!vma || vma->vm_start > address || vma->vm_ops != &tcp_vm_ops)
 		goto out;
-	}
+	zc->length = min_t(unsigned long, zc->length, vma->vm_end - address);
+
 	tp = tcp_sk(sk);
 	seq = tp->copied_seq;
-	/* Abort if urgent data is in the area */
-	if (unlikely(tp->urg_data)) {
-		u32 urg_offset = tp->urg_seq - seq;
+	zc->length = min_t(u32, zc->length, tcp_inq(sk));
+	zc->length &= ~(PAGE_SIZE - 1);
 
-		ret = -EINVAL;
-		if (urg_offset < size)
-			goto out;
-	}
-	ret = -ENOMEM;
-	pages_array = kvmalloc_array(nr_pages, sizeof(struct page *),
-				     GFP_KERNEL);
-	if (!pages_array)
-		goto out;
-	skb = tcp_recv_skb(sk, seq, &offset);
-	ret = -EINVAL;
-skb_start:
-	/* We do not support anything not in page frags */
-	offset -= skb_headlen(skb);
-	if ((int)offset < 0)
-		goto out;
-	if (skb_has_frag_list(skb))
-		goto out;
-	len = skb->data_len - offset;
-	frags = skb_shinfo(skb)->frags;
-	while (offset) {
-		if (frags->size > offset)
-			goto out;
-		offset -= frags->size;
-		frags++;
-	}
-	while (nr < nr_pages) {
-		if (len) {
-			if (len < PAGE_SIZE)
-				goto out;
-			if (frags->size != PAGE_SIZE || frags->page_offset)
-				goto out;
-			pages_array[nr++] = skb_frag_page(frags);
-			frags++;
-			len -= PAGE_SIZE;
-			seq += PAGE_SIZE;
-			continue;
-		}
-		skb = skb->next;
-		offset = seq - TCP_SKB_CB(skb)->seq;
-		goto skb_start;
-	}
-	/* OK, we have a full set of pages ready to be inserted into vma */
-	for (nr = 0; nr < nr_pages; nr++) {
-		ret = vm_insert_page(vma, vma->vm_start + (nr << PAGE_SHIFT),
-				     pages_array[nr]);
-		if (ret)
-			goto out;
-	}
-	/* operation is complete, we can 'consume' all skbs */
-	tp->copied_seq = seq;
-	tcp_rcv_space_adjust(sk);
-
-	/* Clean up data we have read: This will do ACK frames. */
-	tcp_recv_skb(sk, seq, &offset);
-	tcp_cleanup_rbuf(sk, size);
+	zap_page_range(vma, address, zc->length);
 
+	zc->recv_skip_hint = 0;
 	ret = 0;
+	while (length + PAGE_SIZE <= zc->length) {
+		if (zc->recv_skip_hint < PAGE_SIZE) {
+			if (skb) {
+				skb = skb->next;
+				offset = seq - TCP_SKB_CB(skb)->seq;
+			} else {
+				skb = tcp_recv_skb(sk, seq, &offset);
+			}
+
+			zc->recv_skip_hint = skb->len - offset;
+			offset -= skb_headlen(skb);
+			if ((int)offset < 0 || skb_has_frag_list(skb))
+				break;
+			frags = skb_shinfo(skb)->frags;
+			while (offset) {
+				if (frags->size > offset)
+					goto out;
+				offset -= frags->size;
+				frags++;
+			}
+		}
+		if (frags->size != PAGE_SIZE || frags->page_offset)
+			break;
+		ret = vm_insert_page(vma, address + length,
+				     skb_frag_page(frags));
+		if (ret)
+			break;
+		length += PAGE_SIZE;
+		seq += PAGE_SIZE;
+		zc->recv_skip_hint -= PAGE_SIZE;
+		frags++;
+	}
 out:
-	release_sock(sk);
-	kvfree(pages_array);
+	up_read(&current->mm->mmap_sem);
+	if (length) {
+		tp->copied_seq = seq;
+		tcp_rcv_space_adjust(sk);
+
+		/* Clean up data we have read: This will do ACK frames. */
+		tcp_recv_skb(sk, seq, &offset);
+		tcp_cleanup_rbuf(sk, length);
+		ret = 0;
+		if (length == zc->length)
+			zc->recv_skip_hint = 0;
+	} else {
+		if (!zc->recv_skip_hint && sock_flag(sk, SOCK_DONE))
+			ret = -EIO;
+	}
+	zc->length = length;
 	return ret;
 }
-EXPORT_SYMBOL(tcp_mmap);
 
 static void tcp_update_recv_tstamps(struct sk_buff *skb,
 				    struct scm_timestamping *tss)
@@ -2738,6 +2731,20 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 
 		return tcp_fastopen_reset_cipher(net, sk, key, sizeof(key));
 	}
+	case TCP_ZEROCOPY_RECEIVE: {
+		struct tcp_zerocopy_receive zc;
+
+		if (optlen != sizeof(zc))
+			return -EINVAL;
+		if (copy_from_user(&zc, optval, optlen))
+			return -EFAULT;
+		lock_sock(sk);
+		err = tcp_zerocopy_receive(sk, &zc);
+		release_sock(sk);
+		if (!err && copy_to_user(optval, &zc, optlen))
+			err = -EFAULT;
+		return err;
+	}
 	default:
 		/* fallthru */
 		break;
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE
  2018-04-25 21:43 [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet
  2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
@ 2018-04-25 21:43 ` Eric Dumazet
  2018-04-26  1:20 ` [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Soheil Hassas Yeganeh
  2 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-25 21:43 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Eric Dumazet,
	Eric Dumazet, Soheil Hassas Yeganeh

After prior kernel change, mmap() on TCP socket only reserves VMA.

We have to use setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
to perform the transfert of pages from skbs in TCP receive queue into such VMA.

struct tcp_zerocopy_receive {
	__u64 address;		/* in: address of mapping */
	__u32 length;		/* in/out: number of bytes to map/mapped */
	__u32 recv_skip_hint;	/* out: amount of bytes to skip */
};

After a successful setsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
number of bytes that were mapped, and @recv_skip_hint contains number of bytes
that should be read using conventional read()/recv()/recvmsg() system calls,
to skip a sequence of bytes that can not be mapped, because not properly page
aligned.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
---
 tools/testing/selftests/net/tcp_mmap.c | 63 +++++++++++++++-----------
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/net/tcp_mmap.c b/tools/testing/selftests/net/tcp_mmap.c
index dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..5b381cdbdd6319556ba4e3dad530fae8f13f5a9b 100644
--- a/tools/testing/selftests/net/tcp_mmap.c
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -76,9 +76,10 @@
 #include <time.h>
 #include <sys/time.h>
 #include <netinet/in.h>
-#include <netinet/tcp.h>
 #include <arpa/inet.h>
 #include <poll.h>
+#include <linux/tcp.h>
+#include <assert.h>
 
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY    0x4000000
@@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
 void *child_thread(void *arg)
 {
 	unsigned long total_mmap = 0, total = 0;
+	struct tcp_zerocopy_receive zc;
 	unsigned long delta_usec;
 	int flags = MAP_SHARED;
 	struct timeval t0, t1;
 	char *buffer = NULL;
-	void *oaddr = NULL;
+	void *addr = NULL;
 	double throughput;
 	struct rusage ru;
 	int lu, fd;
@@ -153,41 +155,45 @@ void *child_thread(void *arg)
 		perror("malloc");
 		goto error;
 	}
+	if (zflg) {
+		addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
+		if (addr == (void *)-1)
+			zflg = 0;
+	}
 	while (1) {
 		struct pollfd pfd = { .fd = fd, .events = POLLIN, };
 		int sub;
 
 		poll(&pfd, 1, 10000);
 		if (zflg) {
-			void *naddr;
+			int res;
 
-			naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 0);
-			if (naddr == (void *)-1) {
-				if (errno == EAGAIN) {
-					/* That is if SO_RCVLOWAT is buggy */
-					usleep(1000);
-					continue;
-				}
-				if (errno == EINVAL) {
-					flags = MAP_SHARED;
-					oaddr = NULL;
-					goto fallback;
-				}
-				if (errno != EIO)
-					perror("mmap()");
+			zc.address = (__u64)addr;
+			zc.length = chunk_size;
+			zc.recv_skip_hint = 0;
+			res = setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE,
+					 &zc, sizeof(zc));
+			if (res == -1)
 				break;
+
+			if (zc.length) {
+				assert(zc.length <= chunk_size);
+				total_mmap += zc.length;
+				if (xflg)
+					hash_zone(addr, zc.length);
+				total += zc.length;
 			}
-			total_mmap += chunk_size;
-			if (xflg)
-				hash_zone(naddr, chunk_size);
-			total += chunk_size;
-			if (!keepflag) {
-				flags |= MAP_FIXED;
-				oaddr = naddr;
+			if (zc.recv_skip_hint) {
+				assert(zc.recv_skip_hint <= chunk_size);
+				lu = read(fd, buffer, zc.recv_skip_hint);
+				if (lu > 0) {
+					if (xflg)
+						hash_zone(buffer, lu);
+					total += lu;
+				}
 			}
 			continue;
 		}
-fallback:
 		sub = 0;
 		while (sub < chunk_size) {
 			lu = read(fd, buffer + sub, chunk_size - sub);
@@ -228,6 +234,8 @@ void *child_thread(void *arg)
 error:
 	free(buffer);
 	close(fd);
+	if (zflg)
+		munmap(addr, chunk_size);
 	pthread_exit(0);
 }
 
@@ -371,7 +379,8 @@ int main(int argc, char *argv[])
 		setup_sockaddr(cfg_family, host, &listenaddr);
 
 		if (mss &&
-		    setsockopt(fdlisten, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+		    setsockopt(fdlisten, IPPROTO_TCP, TCP_MAXSEG,
+			       &mss, sizeof(mss)) == -1) {
 			perror("setsockopt TCP_MAXSEG");
 			exit(1);
 		}
@@ -402,7 +411,7 @@ int main(int argc, char *argv[])
 	setup_sockaddr(cfg_family, host, &addr);
 
 	if (mss &&
-	    setsockopt(fd, SOL_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
+	    setsockopt(fd, IPPROTO_TCP, TCP_MAXSEG, &mss, sizeof(mss)) == -1) {
 		perror("setsockopt TCP_MAXSEG");
 		exit(1);
 	}
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive
  2018-04-25 21:43 [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet
  2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
  2018-04-25 21:43 ` [PATCH v2 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet
@ 2018-04-26  1:20 ` Soheil Hassas Yeganeh
  2018-04-26 14:56   ` Eric Dumazet
  2 siblings, 1 reply; 13+ messages in thread
From: Soheil Hassas Yeganeh @ 2018-04-26  1:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Andy Lutomirski, linux-kernel,
	linux-mm, Eric Dumazet

On Wed, Apr 25, 2018 at 5:43 PM, Eric Dumazet <edumazet@google.com> wrote:
> syzbot reported a lockdep issue caused by tcp mmap() support.
>
> I implemented Andy Lutomirski nice suggestions to resolve the
> issue and increase scalability as well.
>
> First patch is adding a new setsockopt() operation and changes mmap()
> behavior.
>
> Second patch changes tcp_mmap reference program.
>
> v2:
>  Added a missing page align of zc->length in tcp_zerocopy_receive()
>  Properly clear zc->recv_skip_hint in case user request was completed.

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thank you Eric for the nice redesign!

> Eric Dumazet (2):
>   tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
>   selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE
>
>  include/uapi/linux/tcp.h               |   8 ++
>  net/ipv4/tcp.c                         | 189 +++++++++++++------------
>  tools/testing/selftests/net/tcp_mmap.c |  63 +++++----
>  3 files changed, 142 insertions(+), 118 deletions(-)
>
> --
> 2.17.0.441.gb46fe60e1d-goog
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
@ 2018-04-26 13:40   ` Ka-Cheong Poon
  2018-04-26 13:47       ` Eric Dumazet
  2018-04-27  8:44     ` kbuild test robot
  1 sibling, 1 reply; 13+ messages in thread
From: Ka-Cheong Poon @ 2018-04-26 13:40 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Eric Dumazet,
	Soheil Hassas Yeganeh

On 04/26/2018 05:43 AM, Eric Dumazet wrote:
> When adding tcp mmap() implementation, I forgot that socket lock
> had to be taken before current->mm->mmap_sem. syzbot eventually caught
> the bug.
> 
> Since we can not lock the socket in tcp mmap() handler we have to
> split the operation in two phases.
> 
> 1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
>    This operation does not involve any TCP locking.
> 
> 2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
>   the transfert of pages from skbs to one VMA.
>    This operation only uses down_read(&current->mm->mmap_sem) after
>    holding TCP lock, thus solving the lockdep issue.


A quick question.  Is it a normal practice to return a result
in setsockopt() given that the optval parameter is supposed to
be a const void *?




-- 
K. Poon
ka-cheong.poon@oracle.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  2018-04-26 13:40   ` Ka-Cheong Poon
@ 2018-04-26 13:47       ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-26 13:47 UTC (permalink / raw)
  To: Ka-Cheong Poon, Eric Dumazet, David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh



On 04/26/2018 06:40 AM, Ka-Cheong Poon wrote:

> A quick question.  Is it a normal practice to return a result
> in setsockopt() given that the optval parameter is supposed to
> be a const void *?

Very good question.

Andy suggested an ioctl() or setsockopt(), and I chose setsockopt() but it looks
like a better choice would have been getsockopt() indeed.

This might even allow future changes in "struct tcp_zerocopy_receive"

Willem suggested to add code in tcp_recvmsg() but I prefer to not bloat this already too complex function.

I will send a v3 using getsockopt() then, thanks !

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
@ 2018-04-26 13:47       ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-26 13:47 UTC (permalink / raw)
  To: Ka-Cheong Poon, Eric Dumazet, David S . Miller
  Cc: netdev, Andy Lutomirski, linux-kernel, linux-mm, Soheil Hassas Yeganeh



On 04/26/2018 06:40 AM, Ka-Cheong Poon wrote:

> A quick question.A  Is it a normal practice to return a result
> in setsockopt() given that the optval parameter is supposed to
> be a const void *?

Very good question.

Andy suggested an ioctl() or setsockopt(), and I chose setsockopt() but it looks
like a better choice would have been getsockopt() indeed.

This might even allow future changes in "struct tcp_zerocopy_receive"

Willem suggested to add code in tcp_recvmsg() but I prefer to not bloat this already too complex function.

I will send a v3 using getsockopt() then, thanks !

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive
  2018-04-26  1:20 ` [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Soheil Hassas Yeganeh
@ 2018-04-26 14:56   ` Eric Dumazet
  2018-04-26 21:16     ` Andy Lutomirski
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2018-04-26 14:56 UTC (permalink / raw)
  To: Soheil Hassas Yeganeh, Eric Dumazet
  Cc: David S . Miller, netdev, Andy Lutomirski, linux-kernel, linux-mm



On 04/25/2018 06:20 PM, Soheil Hassas Yeganeh wrote:
> 
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> 
>

Thanks Soheil for reviewing.

I have changed setsockopt() to getsockopt() so chose to not carry your Acked-by

Please add it back if you agree, thanks !

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive
  2018-04-26 14:56   ` Eric Dumazet
@ 2018-04-26 21:16     ` Andy Lutomirski
  2018-04-26 21:40       ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Andy Lutomirski @ 2018-04-26 21:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Soheil Hassas Yeganeh, Eric Dumazet, David S. Miller,
	Network Development, Andrew Lutomirski, LKML, Linux-MM

At the risk of further muddying the waters, there's another minor tweak
that could improve performance on certain workloads.  Currently you mmap()
a range for a given socket and then getsockopt() to receive.  If you made
it so you could mmap() something once for any number of sockets (by
mmapping /dev/misc/tcp_zero_receive or whatever), then the performance of
the getsockopt() bit would be identical, but you could release the mapping
for many sockets at once with only a single flush.  For some use cases,
this could be a big win.

You could also add this later easily enough, too.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive
  2018-04-26 21:16     ` Andy Lutomirski
@ 2018-04-26 21:40       ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-26 21:40 UTC (permalink / raw)
  To: Andy Lutomirski, Eric Dumazet
  Cc: Soheil Hassas Yeganeh, Eric Dumazet, David S. Miller,
	Network Development, LKML, Linux-MM



On 04/26/2018 02:16 PM, Andy Lutomirski wrote:
> At the risk of further muddying the waters, there's another minor tweak
> that could improve performance on certain workloads.  Currently you mmap()
> a range for a given socket and then getsockopt() to receive.  If you made
> it so you could mmap() something once for any number of sockets (by
> mmapping /dev/misc/tcp_zero_receive or whatever), then the performance of
> the getsockopt() bit would be identical, but you could release the mapping
> for many sockets at once with only a single flush.  For some use cases,
> this could be a big win.
> 
> You could also add this later easily enough, too.
> 

I believe I implemented what you just described.

The getsockopt() call checks that the VMA was created by a mmap() to one TCP socket.

It does not check that the vma was created by mmap() on the same socket,
because we do not need this extra check really.

So you presumably could use mmap() to grab 1GB of virtual space, then split it
as you wish for different sockets.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
@ 2018-04-27  8:44     ` kbuild test robot
  2018-04-27  8:44     ` kbuild test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-04-27  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kbuild-all, David S . Miller, netdev, Andy Lutomirski,
	linux-kernel, linux-mm, Eric Dumazet, Eric Dumazet,
	Soheil Hassas Yeganeh

[-- Attachment #1: Type: text/plain, Size: 895 bytes --]

Hi Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-TCP_ZEROCOPY_RECEIVE-support-for-zerocopy-receive/20180427-122234
config: sh-rsk7269_defconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All errors (new ones prefixed by >>):

   net/ipv4/tcp.o: In function `tcp_setsockopt':
>> tcp.c:(.text+0x3f80): undefined reference to `zap_page_range'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 10487 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
@ 2018-04-27  8:44     ` kbuild test robot
  0 siblings, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-04-27  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kbuild-all, David S . Miller, netdev, Andy Lutomirski,
	linux-kernel, linux-mm, Eric Dumazet, Soheil Hassas Yeganeh

[-- Attachment #1: Type: text/plain, Size: 895 bytes --]

Hi Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-TCP_ZEROCOPY_RECEIVE-support-for-zerocopy-receive/20180427-122234
config: sh-rsk7269_defconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All errors (new ones prefixed by >>):

   net/ipv4/tcp.o: In function `tcp_setsockopt':
>> tcp.c:(.text+0x3f80): undefined reference to `zap_page_range'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 10487 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  2018-04-27  8:44     ` kbuild test robot
  (?)
@ 2018-04-27 13:03     ` Eric Dumazet
  -1 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2018-04-27 13:03 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, David Miller, netdev, Andy Lutomirski, LKML,
	linux-mm, Eric Dumazet, Soheil Hassas Yeganeh

On Fri, Apr 27, 2018 at 1:45 AM kbuild test robot <lkp@intel.com> wrote:

> Hi Eric,

> Thank you for the patch! Yet something to improve:

> [auto build test ERROR on net-next/master]

> url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-TCP_ZEROCOPY_RECEIVE-support-for-zerocopy-receive/20180427-122234
> config: sh-rsk7269_defconfig (attached as .config)
> compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>          wget
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O
~/bin/make.cross
>          chmod +x ~/bin/make.cross
>          # save the attached .config to linux build tree
>          make.cross ARCH=sh

> All errors (new ones prefixed by >>):

>     net/ipv4/tcp.o: In function `tcp_setsockopt':
> >> tcp.c:(.text+0x3f80): undefined reference to `zap_page_range'

I guess this tcp zerocopy stuff depends on CONFIG_MMU

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-04-27 13:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25 21:43 [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Eric Dumazet
2018-04-25 21:43 ` [PATCH v2 net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for " Eric Dumazet
2018-04-26 13:40   ` Ka-Cheong Poon
2018-04-26 13:47     ` Eric Dumazet
2018-04-26 13:47       ` Eric Dumazet
2018-04-27  8:44   ` kbuild test robot
2018-04-27  8:44     ` kbuild test robot
2018-04-27 13:03     ` Eric Dumazet
2018-04-25 21:43 ` [PATCH v2 net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE Eric Dumazet
2018-04-26  1:20 ` [PATCH v2 net-next 0/2] tcp: mmap: rework zerocopy receive Soheil Hassas Yeganeh
2018-04-26 14:56   ` Eric Dumazet
2018-04-26 21:16     ` Andy Lutomirski
2018-04-26 21:40       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.