All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] net: af_unix: zerocopy stream bits
@ 2015-05-20 15:35 Hannes Frederic Sowa
  2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
  To: netdev

This series implements zerocopy support for AF_UNIX SOCK_STREAM sockets.

Hannes Frederic Sowa (4):
  net: skbuff: add skb_append_pagefrags and use it
  net: af_unix: implement stream sendpage support
  net: make skb_splice_bits more configureable
  net: af_unix: implement splice for stream af_unix sockets

 include/linux/skbuff.h |  14 ++-
 net/core/skbuff.c      |  63 +++++++++----
 net/ipv4/ip_output.c   |   8 +-
 net/ipv4/tcp.c         |   5 +-
 net/unix/af_unix.c     | 246 ++++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 287 insertions(+), 49 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it
  2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
  2015-05-20 17:51   ` Cong Wang
  2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
  To: netdev

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/skbuff.h |  3 +++
 net/core/skbuff.c      | 18 ++++++++++++++++++
 net/ipv4/ip_output.c   |  8 ++------
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 40960fe..b9d267b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -860,6 +860,9 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 					int len, int odd, struct sk_buff *skb),
 			    void *from, int length);
 
+int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
+			 int offset, size_t size);
+
 struct skb_seq_state {
 	__u32		lower_offset;
 	__u32		upper_offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f3fe9bd..1d3f88a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2915,6 +2915,24 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(skb_append_datato_frags);
 
+int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
+			 int offset, size_t size)
+{
+	int i = skb_shinfo(skb)->nr_frags;
+
+	if (skb_can_coalesce(skb, i, page, offset)) {
+		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
+	} else if (i < MAX_SKB_FRAGS) {
+		get_page(page);
+		skb_fill_page_desc(skb, i, page, offset, size);
+	} else {
+		return -EMSGSIZE;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(skb_append_pagefrags);
+
 /**
  *	skb_pull_rcsum - pull skb and update receive checksum
  *	@skb: buffer to update
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8d91b92..35ff40f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1292,12 +1292,8 @@ ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
 		i = skb_shinfo(skb)->nr_frags;
 		if (len > size)
 			len = size;
-		if (skb_can_coalesce(skb, i, page, offset)) {
-			skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
-		} else if (i < MAX_SKB_FRAGS) {
-			get_page(page);
-			skb_fill_page_desc(skb, i, page, offset, len);
-		} else {
+
+		if (skb_append_pagefrags(skb, page, offset, len)) {
 			err = -EMSGSIZE;
 			goto error;
 		}
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
  2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
  2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
  2015-05-20 18:40   ` Cong Wang
  2015-05-20 23:21   ` Eric Dumazet
  2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
  2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
  3 siblings, 2 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
  To: netdev

This patch implements sendpage support for AF_UNIX SOCK_STREAM
sockets. This is also required for a complete splice implementation.

The implementation is a bit tricky because we append to already existing
skbs and so have to hold unix_sk->readlock to protect the reading side
from dropping the tail of the sk_receive_queue.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/unix/af_unix.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 104 insertions(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 941b3d2..9bb880a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -518,6 +518,8 @@ static int unix_ioctl(struct socket *, unsigned int, unsigned long);
 static int unix_shutdown(struct socket *, int);
 static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
 static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
+static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
+				    size_t size, int flags);
 static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
 static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
 static int unix_dgram_connect(struct socket *, struct sockaddr *,
@@ -558,7 +560,7 @@ static const struct proto_ops unix_stream_ops = {
 	.sendmsg =	unix_stream_sendmsg,
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
-	.sendpage =	sock_no_sendpage,
+	.sendpage =	unix_stream_sendpage,
 	.set_peek_off =	unix_set_peek_off,
 };
 
@@ -1720,6 +1722,107 @@ out_err:
 	return sent ? : err;
 }
 
+static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
+				    int offset, size_t size, int flags)
+{
+	int err;
+	bool send_sigpipe;
+	struct sock *sk, *other;
+	struct sk_buff *skb, *newskb, *tail;
+
+	err = 0;
+	tail = NULL;
+	newskb = NULL;
+	sk = socket->sk;
+	send_sigpipe = true;
+
+	if (flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	other = unix_peer(sk);
+	if (!other || sk->sk_state != TCP_ESTABLISHED)
+		return -ENOTCONN;
+
+	if (false) {
+alloc_skb:
+		unix_state_unlock(other);
+		mutex_unlock(&unix_sk(other)->readlock);
+		newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
+					      &err, 0);
+		if (!newskb)
+			return err;
+	}
+
+	/* we must acquire readlock as we modify already present
+	 * skbs in the sk_receive_queue and mess with skb->len
+	 */
+	err = mutex_lock_interruptible(&unix_sk(other)->readlock);
+	if (err) {
+		err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
+		send_sigpipe = false;
+		goto err;
+	}
+
+	if (sk->sk_shutdown & SEND_SHUTDOWN) {
+		err = -EPIPE;
+		goto err_unlock;
+	}
+
+	unix_state_lock(other);
+
+	if (sock_flag(other, SOCK_DEAD) ||
+	    other->sk_shutdown & RCV_SHUTDOWN) {
+		err = -EPIPE;
+		goto err_state_unlock;
+	}
+
+	skb = skb_peek_tail(&other->sk_receive_queue);
+	if (tail && tail == skb) {
+		skb = newskb;
+	} else if (!skb) {
+		if (newskb)
+			skb = newskb;
+		else
+			goto alloc_skb;
+	} else if (newskb) {
+		/* this is fast path, we don't necessarily need to
+		 * call to kfree_skb even though with newskb == NULL
+		 * this - does no harm
+		 */
+		consume_skb(newskb);
+	}
+
+	if (skb_append_pagefrags(skb, page, offset, size)) {
+		tail = skb;
+		goto alloc_skb;
+	}
+
+	skb->len += size;
+	skb->data_len += size;
+	skb->truesize += size;
+	atomic_add(size, &sk->sk_wmem_alloc);
+
+	if (newskb)
+		skb_queue_tail(&other->sk_receive_queue, newskb);
+
+	unix_state_unlock(other);
+	mutex_unlock(&unix_sk(other)->readlock);
+
+	other->sk_data_ready(other);
+
+	return size;
+
+err_state_unlock:
+	unix_state_unlock(other);
+err_unlock:
+	mutex_unlock(&unix_sk(other)->readlock);
+err:
+	kfree_skb(newskb);
+	if (send_sigpipe && !(flags & MSG_NOSIGNAL))
+		send_sig(SIGPIPE, current, 0);
+	return err;
+}
+
 static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
 				  size_t len)
 {
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 3/4] net: make skb_splice_bits more configureable
  2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
  2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
  2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
  2015-05-20 23:43   ` Eric Dumazet
  2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
  3 siblings, 1 reply; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
  To: netdev

Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.

AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/skbuff.h | 11 +++++++++--
 net/core/skbuff.c      | 45 ++++++++++++++++++++++++++++-----------------
 net/ipv4/tcp.c         |  5 +++--
 3 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b9d267b..895435c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -35,6 +35,7 @@
 #include <linux/netdev_features.h>
 #include <linux/sched.h>
 #include <net/flow_dissector.h>
+#include <linux/splice.h>
 
 /* A. Checksumming of received packets by device.
  *
@@ -2698,9 +2699,15 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
 int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
 			      int len, __wsum csum);
-int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
+ssize_t skb_socket_splice(struct sock *sk,
+			  struct pipe_inode_info *pipe,
+			  struct splice_pipe_desc *spd);
+int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int len,
-		    unsigned int flags);
+		    unsigned int flags,
+		    ssize_t (*splice_cb)(struct sock *,
+					 struct pipe_inode_info *,
+					 struct splice_pipe_desc *));
 void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
 int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1d3f88a..1fc76a9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1870,15 +1870,39 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
 	return false;
 }
 
+ssize_t skb_socket_splice(struct sock *sk,
+			  struct pipe_inode_info *pipe,
+			  struct splice_pipe_desc *spd)
+{
+	int ret;
+
+	/* Drop the socket lock, otherwise we have reverse
+	 * locking dependencies between sk_lock and i_mutex
+	 * here as compared to sendfile(). We enter here
+	 * with the socket lock held, and splice_to_pipe() will
+	 * grab the pipe inode lock. For sendfile() emulation,
+	 * we call into ->sendpage() with the i_mutex lock held
+	 * and networking will grab the socket lock.
+	 */
+	release_sock(sk);
+	ret = splice_to_pipe(pipe, spd);
+	lock_sock(sk);
+
+	return ret;
+}
+
 /*
  * Map data from the skb to a pipe. Should handle both the linear part,
  * the fragments, and the frag list. It does NOT handle frag lists within
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
-		    unsigned int flags)
+		    unsigned int flags,
+		    ssize_t (*splice_cb)(struct sock *,
+					 struct pipe_inode_info *,
+					 struct splice_pipe_desc *))
 {
 	struct partial_page partial[MAX_SKB_FRAGS];
 	struct page *pages[MAX_SKB_FRAGS];
@@ -1891,7 +1915,6 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		.spd_release = sock_spd_release,
 	};
 	struct sk_buff *frag_iter;
-	struct sock *sk = skb->sk;
 	int ret = 0;
 
 	/*
@@ -1914,20 +1937,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 	}
 
 done:
-	if (spd.nr_pages) {
-		/*
-		 * Drop the socket lock, otherwise we have reverse
-		 * locking dependencies between sk_lock and i_mutex
-		 * here as compared to sendfile(). We enter here
-		 * with the socket lock held, and splice_to_pipe() will
-		 * grab the pipe inode lock. For sendfile() emulation,
-		 * we call into ->sendpage() with the i_mutex lock held
-		 * and networking will grab the socket lock.
-		 */
-		release_sock(sk);
-		ret = splice_to_pipe(pipe, &spd);
-		lock_sock(sk);
-	}
+	if (spd.nr_pages)
+		ret = splice_cb(sk, pipe, &spd);
 
 	return ret;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bb9bb84..67f0a80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -694,8 +694,9 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
 	struct tcp_splice_state *tss = rd_desc->arg.data;
 	int ret;
 
-	ret = skb_splice_bits(skb, offset, tss->pipe, min(rd_desc->count, len),
-			      tss->flags);
+	ret = skb_splice_bits(skb, skb->sk, offset, tss->pipe,
+			      min(rd_desc->count, len), tss->flags,
+			      skb_socket_splice);
 	if (ret > 0)
 		rd_desc->count -= ret;
 	return ret;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
  2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
                   ` (2 preceding siblings ...)
  2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
  2015-05-20 20:59   ` Cong Wang
  2015-05-20 23:50   ` Eric Dumazet
  3 siblings, 2 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
  To: netdev

unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/unix/af_unix.c | 141 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 120 insertions(+), 21 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9bb880a..d2d3ebf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -520,6 +520,9 @@ static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
 static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
 static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
 				    size_t size, int flags);
+static ssize_t unix_stream_splice_read(struct socket *,  loff_t *ppos,
+				       struct pipe_inode_info *, size_t size,
+				       unsigned int flags);
 static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
 static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
 static int unix_dgram_connect(struct socket *, struct sockaddr *,
@@ -561,6 +564,7 @@ static const struct proto_ops unix_stream_ops = {
 	.recvmsg =	unix_stream_recvmsg,
 	.mmap =		sock_no_mmap,
 	.sendpage =	unix_stream_sendpage,
+	.splice_read =	unix_stream_splice_read,
 	.set_peek_off =	unix_set_peek_off,
 };
 
@@ -1963,8 +1967,9 @@ out:
  *	Sleep until more data has arrived. But check for races..
  */
 static long unix_stream_data_wait(struct sock *sk, long timeo,
-				  struct sk_buff *last)
+				  struct sk_buff *last, unsigned int last_len)
 {
+	struct sk_buff *tail;
 	DEFINE_WAIT(wait);
 
 	unix_state_lock(sk);
@@ -1972,7 +1977,9 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
 	for (;;) {
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
-		if (skb_peek_tail(&sk->sk_receive_queue) != last ||
+		tail = skb_peek_tail(&sk->sk_receive_queue);
+		if (tail != last ||
+		    (tail && tail->len != last_len) ||
 		    sk->sk_err ||
 		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
 		    signal_pending(current) ||
@@ -1996,38 +2003,51 @@ static unsigned int unix_skb_len(const struct sk_buff *skb)
 	return skb->len - UNIXCB(skb).consumed;
 }
 
-static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
-			       size_t size, int flags)
+struct unix_stream_read_state {
+	int (*recv_actor)(struct sk_buff *, int, int,
+			  struct unix_stream_read_state *);
+	struct socket *socket;
+	struct msghdr *msg;
+	struct pipe_inode_info *pipe;
+	size_t size;
+	int flags;
+	unsigned int splice_flags;
+};
+
+static __always_inline
+int unix_stream_read_generic(struct unix_stream_read_state *state)
 {
 	struct scm_cookie scm;
+	struct socket *sock = state->socket;
 	struct sock *sk = sock->sk;
 	struct unix_sock *u = unix_sk(sk);
-	DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr, msg->msg_name);
 	int copied = 0;
+	int flags = state->flags;
 	int noblock = flags & MSG_DONTWAIT;
-	int check_creds = 0;
+	bool check_creds = false;
 	int target;
 	int err = 0;
 	long timeo;
 	int skip;
+	size_t size = state->size;
+	unsigned int last_len;
 
 	err = -EINVAL;
 	if (sk->sk_state != TCP_ESTABLISHED)
 		goto out;
 
 	err = -EOPNOTSUPP;
-	if (flags&MSG_OOB)
+	if (flags & MSG_OOB)
 		goto out;
 
-	target = sock_rcvlowat(sk, flags&MSG_WAITALL, size);
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
 	timeo = sock_rcvtimeo(sk, noblock);
 
+	memset(&scm, 0, sizeof(scm));
+
 	/* Lock the socket to prevent queue disordering
 	 * while sleeps in memcpy_tomsg
 	 */
-
-	memset(&scm, 0, sizeof(scm));
-
 	err = mutex_lock_interruptible(&u->readlock);
 	if (unlikely(err)) {
 		/* recvmsg() in non blocking mode is supposed to return -EAGAIN
@@ -2043,6 +2063,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
 
 		unix_state_lock(sk);
 		last = skb = skb_peek(&sk->sk_receive_queue);
+		last_len = last ? last->len : 0;
 again:
 		if (skb == NULL) {
 			unix_sk(sk)->recursion_level = 0;
@@ -2065,16 +2086,17 @@ again:
 				break;
 			mutex_unlock(&u->readlock);
 
-			timeo = unix_stream_data_wait(sk, timeo, last);
+			timeo = unix_stream_data_wait(sk, timeo, last,
+						      last_len);
 
-			if (signal_pending(current)
-			    ||  mutex_lock_interruptible(&u->readlock)) {
+			if (signal_pending(current) ||
+			    mutex_lock_interruptible(&u->readlock)) {
 				err = sock_intr_errno(timeo);
 				goto out;
 			}
 
 			continue;
- unlock:
+unlock:
 			unix_state_unlock(sk);
 			break;
 		}
@@ -2083,6 +2105,7 @@ again:
 		while (skip >= unix_skb_len(skb)) {
 			skip -= unix_skb_len(skb);
 			last = skb;
+			last_len = skb->len;
 			skb = skb_peek_next(skb, &sk->sk_receive_queue);
 			if (!skb)
 				goto again;
@@ -2099,18 +2122,20 @@ again:
 		} else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
 			/* Copy credentials */
 			scm_set_cred(&scm, UNIXCB(skb).pid, UNIXCB(skb).uid, UNIXCB(skb).gid);
-			check_creds = 1;
+			check_creds = true;
 		}
 
 		/* Copy address just once */
-		if (sunaddr) {
-			unix_copy_addr(msg, skb->sk);
+		if (state->msg && state->msg->msg_name) {
+			DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr,
+					 state->msg->msg_name);
+			unix_copy_addr(state->msg, skb->sk);
 			sunaddr = NULL;
 		}
 
 		chunk = min_t(unsigned int, unix_skb_len(skb) - skip, size);
-		if (skb_copy_datagram_msg(skb, UNIXCB(skb).consumed + skip,
-					  msg, chunk)) {
+		chunk = state->recv_actor(skb, skip, chunk, state);
+		if (chunk < 0) {
 			if (copied == 0)
 				copied = -EFAULT;
 			break;
@@ -2148,11 +2173,85 @@ again:
 	} while (size);
 
 	mutex_unlock(&u->readlock);
-	scm_recv(sock, msg, &scm, flags);
+	if (state->msg)
+		scm_recv(sock, state->msg, &scm, flags);
+	else
+		scm_destroy(&scm);
 out:
 	return copied ? : err;
 }
 
+static int unix_stream_read_actor(struct sk_buff *skb,
+				  int skip, int chunk,
+				  struct unix_stream_read_state *state)
+{
+	int ret;
+
+	ret = skb_copy_datagram_msg(skb, UNIXCB(skb).consumed + skip,
+				    state->msg, chunk);
+	return ret ?: chunk;
+}
+
+static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
+			       size_t size, int flags)
+{
+	struct unix_stream_read_state state = {
+		.recv_actor = unix_stream_read_actor,
+		.socket = sock,
+		.msg = msg,
+		.size = size,
+		.flags = flags
+	};
+
+	return unix_stream_read_generic(&state);
+}
+
+static ssize_t skb_unix_socket_splice(struct sock *sk,
+				      struct pipe_inode_info *pipe,
+				      struct splice_pipe_desc *spd)
+{
+	int ret;
+	struct unix_sock *u = unix_sk(sk);
+
+	mutex_unlock(&u->readlock);
+	ret = splice_to_pipe(pipe, spd);
+	mutex_lock(&u->readlock);
+
+	return ret;
+}
+
+static int unix_stream_splice_actor(struct sk_buff *skb,
+				    int skip, int chunk,
+				    struct unix_stream_read_state *state)
+{
+	return skb_splice_bits(skb, state->socket->sk,
+			       UNIXCB(skb).consumed + skip,
+			       state->pipe, chunk, state->splice_flags,
+			       skb_unix_socket_splice);
+}
+
+static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
+				       struct pipe_inode_info *pipe,
+				       size_t size, unsigned int flags)
+{
+	struct unix_stream_read_state state = {
+		.recv_actor = unix_stream_splice_actor,
+		.socket = sock,
+		.pipe = pipe,
+		.size = size,
+		.splice_flags = flags,
+	};
+
+	if (unlikely(*ppos))
+		return -ESPIPE;
+
+	if (sock->file->f_flags & O_NONBLOCK ||
+	    flags & SPLICE_F_NONBLOCK)
+		state.flags = MSG_DONTWAIT;
+
+	return unix_stream_read_generic(&state);
+}
+
 static int unix_shutdown(struct socket *sock, int mode)
 {
 	struct sock *sk = sock->sk;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it
  2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
@ 2015-05-20 17:51   ` Cong Wang
  0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2015-05-20 17:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> +int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
> +                        int offset, size_t size)
> +{
> +       int i = skb_shinfo(skb)->nr_frags;
> +
> +       if (skb_can_coalesce(skb, i, page, offset)) {
> +               skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
> +       } else if (i < MAX_SKB_FRAGS) {
> +               get_page(page);
> +               skb_fill_page_desc(skb, i, page, offset, size);
> +       } else {
> +               return -EMSGSIZE;
> +       }
> +
> +       return 0;
> +}
> +EXPORT_SYMBOL(skb_append_pagefrags);
> +
>  /**
>   *     skb_pull_rcsum - pull skb and update receive checksum
>   *     @skb: buffer to update
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 8d91b92..35ff40f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1292,12 +1292,8 @@ ssize_t  ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
>                 i = skb_shinfo(skb)->nr_frags;
>                 if (len > size)
>                         len = size;
> -               if (skb_can_coalesce(skb, i, page, offset)) {
> -                       skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
> -               } else if (i < MAX_SKB_FRAGS) {
> -                       get_page(page);
> -                       skb_fill_page_desc(skb, i, page, offset, len);
> -               } else {
> +
> +               if (skb_append_pagefrags(skb, page, offset, len)) {
>                         err = -EMSGSIZE;
>                         goto error;
>                 }

The 'i' can be removed now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
  2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
@ 2015-05-20 18:40   ` Cong Wang
  2015-05-20 23:21   ` Eric Dumazet
  1 sibling, 0 replies; 14+ messages in thread
From: Cong Wang @ 2015-05-20 18:40 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
>
> +static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
> +                                   int offset, size_t size, int flags)
> +{
> +       int err;
> +       bool send_sigpipe;
> +       struct sock *sk, *other;
> +       struct sk_buff *skb, *newskb, *tail;
> +
> +       err = 0;
> +       tail = NULL;
> +       newskb = NULL;
> +       sk = socket->sk;
> +       send_sigpipe = true;


Please fold them.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
  2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
@ 2015-05-20 20:59   ` Cong Wang
  2015-05-20 21:47     ` Hannes Frederic Sowa
  2015-05-20 23:50   ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Cong Wang @ 2015-05-20 20:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
>
> -static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
> -                              size_t size, int flags)
> +struct unix_stream_read_state {
> +       int (*recv_actor)(struct sk_buff *, int, int,
> +                         struct unix_stream_read_state *);
> +       struct socket *socket;
> +       struct msghdr *msg;
> +       struct pipe_inode_info *pipe;
> +       size_t size;
> +       int flags;
> +       unsigned int splice_flags;
> +};
> +
> +static __always_inline
> +int unix_stream_read_generic(struct unix_stream_read_state *state)


Why __always_inline here?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
  2015-05-20 20:59   ` Cong Wang
@ 2015-05-20 21:47     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 21:47 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev

Hi Cong,

On Wed, May 20, 2015, at 22:59, Cong Wang wrote:
> On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> >
> > -static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
> > -                              size_t size, int flags)
> > +struct unix_stream_read_state {
> > +       int (*recv_actor)(struct sk_buff *, int, int,
> > +                         struct unix_stream_read_state *);
> > +       struct socket *socket;
> > +       struct msghdr *msg;
> > +       struct pipe_inode_info *pipe;
> > +       size_t size;
> > +       int flags;
> > +       unsigned int splice_flags;
> > +};
> > +
> > +static __always_inline
> > +int unix_stream_read_generic(struct unix_stream_read_state *state)
> 
> 
> Why __always_inline here?

During benchmarking I discovered that the simple ordinary recvmsg case
lost a bit in performance because of the indirection. With
__always_inline -ing unix_stream_read_generic I got it to almost the
same numbers again as without the change. Thus I decided to leave it
there.

Also, thank you for your other feedback. I will address it soon after
letting the patches receiving a bit more feedback.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
  2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
  2015-05-20 18:40   ` Cong Wang
@ 2015-05-20 23:21   ` Eric Dumazet
  2015-05-20 23:47     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> This patch implements sendpage support for AF_UNIX SOCK_STREAM
> sockets. This is also required for a complete splice implementation.
> 
> The implementation is a bit tricky because we append to already existing
> skbs and so have to hold unix_sk->readlock to protect the reading side
> from dropping the tail of the sk_receive_queue.
> 
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  net/unix/af_unix.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 941b3d2..9bb880a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -518,6 +518,8 @@ static int unix_ioctl(struct socket *, unsigned int, unsigned long);
>  static int unix_shutdown(struct socket *, int);
>  static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
>  static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
> +static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
> +				    size_t size, int flags);
>  static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
>  static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
>  static int unix_dgram_connect(struct socket *, struct sockaddr *,
> @@ -558,7 +560,7 @@ static const struct proto_ops unix_stream_ops = {
>  	.sendmsg =	unix_stream_sendmsg,
>  	.recvmsg =	unix_stream_recvmsg,
>  	.mmap =		sock_no_mmap,
> -	.sendpage =	sock_no_sendpage,
> +	.sendpage =	unix_stream_sendpage,
>  	.set_peek_off =	unix_set_peek_off,
>  };
>  
> @@ -1720,6 +1722,107 @@ out_err:
>  	return sent ? : err;
>  }
>  
> +static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
> +				    int offset, size_t size, int flags)
> +{
> +	int err;
> +	bool send_sigpipe;
> +	struct sock *sk, *other;
> +	struct sk_buff *skb, *newskb, *tail;
> +
> +	err = 0;
> +	tail = NULL;
> +	newskb = NULL;
> +	sk = socket->sk;
> +	send_sigpipe = true;
> +
> +	if (flags & MSG_OOB)
> +		return -EOPNOTSUPP;
> +
> +	other = unix_peer(sk);
> +	if (!other || sk->sk_state != TCP_ESTABLISHED)
> +		return -ENOTCONN;
> +
> +	if (false) {
> +alloc_skb:
> +		unix_state_unlock(other);
> +		mutex_unlock(&unix_sk(other)->readlock);
> +		newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
> +					      &err, 0);
> +		if (!newskb)
> +			return err;
> +	}
> +
> +	/* we must acquire readlock as we modify already present
> +	 * skbs in the sk_receive_queue and mess with skb->len
> +	 */
> +	err = mutex_lock_interruptible(&unix_sk(other)->readlock);
> +	if (err) {
> +		err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
> +		send_sigpipe = false;
> +		goto err;
> +	}
> +
> +	if (sk->sk_shutdown & SEND_SHUTDOWN) {
> +		err = -EPIPE;
> +		goto err_unlock;
> +	}
> +
> +	unix_state_lock(other);
> +
> +	if (sock_flag(other, SOCK_DEAD) ||
> +	    other->sk_shutdown & RCV_SHUTDOWN) {
> +		err = -EPIPE;
> +		goto err_state_unlock;
> +	}
> +
> +	skb = skb_peek_tail(&other->sk_receive_queue);
> +	if (tail && tail == skb) {
> +		skb = newskb;
> +	} else if (!skb) {
> +		if (newskb)
> +			skb = newskb;
> +		else
> +			goto alloc_skb;
> +	} else if (newskb) {
> +		/* this is fast path, we don't necessarily need to
> +		 * call to kfree_skb even though with newskb == NULL
> +		 * this - does no harm
> +		 */
> +		consume_skb(newskb);
> +	}
> +
> +	if (skb_append_pagefrags(skb, page, offset, size)) {
> +		tail = skb;
> +		goto alloc_skb;
> +	}
> +
> +	skb->len += size;
> +	skb->data_len += size;
> +	skb->truesize += size;
> +	atomic_add(size, &sk->sk_wmem_alloc);
> +
> +	if (newskb)
> +		skb_queue_tail(&other->sk_receive_queue, newskb);

Are you sure we need the skb_queue_tail() here (taking spinlock) ?

This would tell us there might be a possible race.

A comment would be nice eventually.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] net: make skb_splice_bits more configureable
  2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
@ 2015-05-20 23:43   ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:43 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
> 
> AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
> use a callback to make the locking and unlocking configureable.
> 
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  include/linux/skbuff.h | 11 +++++++++--
>  net/core/skbuff.c      | 45 ++++++++++++++++++++++++++++-----------------
>  net/ipv4/tcp.c         |  5 +++--
>  3 files changed, 40 insertions(+), 21 deletions(-)

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
  2015-05-20 23:21   ` Eric Dumazet
@ 2015-05-20 23:47     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 23:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 21, 2015, at 01:21, Eric Dumazet wrote:
> On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> > This patch implements sendpage support for AF_UNIX SOCK_STREAM
> > +
> > +	if (newskb)
> > +		skb_queue_tail(&other->sk_receive_queue, newskb);
> 
> Are you sure we need the skb_queue_tail() here (taking spinlock) ?
> 
> This would tell us there might be a possible race.
> 
> A comment would be nice eventually.

Hmm, at first sight, I think we can change this to __skb_queue_tail.
sendpage does take state_lock and readlock mutex and thus blocks out
both, recvmsg and sendmsg. unix_stream_connect is also serialized by
state_lock.

I guess I used it because of unix_stream_sendmsg, where it is actually
necessary, as recvmsg does unlink skb without state_lock and sendmsg
doesn't hold reader mutex.

Thanks for the hint!

Bye,
Hannes

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
  2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
  2015-05-20 20:59   ` Cong Wang
@ 2015-05-20 23:50   ` Eric Dumazet
  2015-05-20 23:57     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:50 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev

On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:

> +
> +static int unix_stream_splice_actor(struct sk_buff *skb,
> +				    int skip, int chunk,
> +				    struct unix_stream_read_state *state)
> +{
> +	return skb_splice_bits(skb, state->socket->sk,
> +			       UNIXCB(skb).consumed + skip,
> +			       state->pipe, chunk, state->splice_flags,
> +			       skb_unix_socket_splice);
> +}

I am not sure you added EXPORT_SYMBOL(skb_splice_bits) in one of the
patches ?

Make sure CONFIG_UNIX=m still works.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
  2015-05-20 23:50   ` Eric Dumazet
@ 2015-05-20 23:57     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 23:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Thu, May 21, 2015, at 01:50, Eric Dumazet wrote:
> On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> 
> > +
> > +static int unix_stream_splice_actor(struct sk_buff *skb,
> > +				    int skip, int chunk,
> > +				    struct unix_stream_read_state *state)
> > +{
> > +	return skb_splice_bits(skb, state->socket->sk,
> > +			       UNIXCB(skb).consumed + skip,
> > +			       state->pipe, chunk, state->splice_flags,
> > +			       skb_unix_socket_splice);
> > +}
> 
> I am not sure you added EXPORT_SYMBOL(skb_splice_bits) in one of the
> patches ?
> 
> Make sure CONFIG_UNIX=m still works.

I didn't. Thanks, I will test that.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-05-20 23:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
2015-05-20 17:51   ` Cong Wang
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
2015-05-20 18:40   ` Cong Wang
2015-05-20 23:21   ` Eric Dumazet
2015-05-20 23:47     ` Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
2015-05-20 23:43   ` Eric Dumazet
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
2015-05-20 20:59   ` Cong Wang
2015-05-20 21:47     ` Hannes Frederic Sowa
2015-05-20 23:50   ` Eric Dumazet
2015-05-20 23:57     ` Hannes Frederic Sowa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.