* [PATCH net-next 0/4] net: af_unix: zerocopy stream bits
@ 2015-05-20 15:35 Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
To: netdev
This series implements zerocopy support for AF_UNIX SOCK_STREAM sockets.
Hannes Frederic Sowa (4):
net: skbuff: add skb_append_pagefrags and use it
net: af_unix: implement stream sendpage support
net: make skb_splice_bits more configureable
net: af_unix: implement splice for stream af_unix sockets
include/linux/skbuff.h | 14 ++-
net/core/skbuff.c | 63 +++++++++----
net/ipv4/ip_output.c | 8 +-
net/ipv4/tcp.c | 5 +-
net/unix/af_unix.c | 246 ++++++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 287 insertions(+), 49 deletions(-)
--
2.1.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
2015-05-20 17:51 ` Cong Wang
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
To: netdev
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
include/linux/skbuff.h | 3 +++
net/core/skbuff.c | 18 ++++++++++++++++++
net/ipv4/ip_output.c | 8 ++------
3 files changed, 23 insertions(+), 6 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 40960fe..b9d267b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -860,6 +860,9 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
int len, int odd, struct sk_buff *skb),
void *from, int length);
+int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
+ int offset, size_t size);
+
struct skb_seq_state {
__u32 lower_offset;
__u32 upper_offset;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f3fe9bd..1d3f88a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2915,6 +2915,24 @@ int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
}
EXPORT_SYMBOL(skb_append_datato_frags);
+int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
+ int offset, size_t size)
+{
+ int i = skb_shinfo(skb)->nr_frags;
+
+ if (skb_can_coalesce(skb, i, page, offset)) {
+ skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
+ } else if (i < MAX_SKB_FRAGS) {
+ get_page(page);
+ skb_fill_page_desc(skb, i, page, offset, size);
+ } else {
+ return -EMSGSIZE;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(skb_append_pagefrags);
+
/**
* skb_pull_rcsum - pull skb and update receive checksum
* @skb: buffer to update
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 8d91b92..35ff40f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1292,12 +1292,8 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
i = skb_shinfo(skb)->nr_frags;
if (len > size)
len = size;
- if (skb_can_coalesce(skb, i, page, offset)) {
- skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
- } else if (i < MAX_SKB_FRAGS) {
- get_page(page);
- skb_fill_page_desc(skb, i, page, offset, len);
- } else {
+
+ if (skb_append_pagefrags(skb, page, offset, len)) {
err = -EMSGSIZE;
goto error;
}
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
2015-05-20 18:40 ` Cong Wang
2015-05-20 23:21 ` Eric Dumazet
2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
3 siblings, 2 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
To: netdev
This patch implements sendpage support for AF_UNIX SOCK_STREAM
sockets. This is also required for a complete splice implementation.
The implementation is a bit tricky because we append to already existing
skbs and so have to hold unix_sk->readlock to protect the reading side
from dropping the tail of the sk_receive_queue.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
net/unix/af_unix.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 104 insertions(+), 1 deletion(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 941b3d2..9bb880a 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -518,6 +518,8 @@ static int unix_ioctl(struct socket *, unsigned int, unsigned long);
static int unix_shutdown(struct socket *, int);
static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
+static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
+ size_t size, int flags);
static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
static int unix_dgram_connect(struct socket *, struct sockaddr *,
@@ -558,7 +560,7 @@ static const struct proto_ops unix_stream_ops = {
.sendmsg = unix_stream_sendmsg,
.recvmsg = unix_stream_recvmsg,
.mmap = sock_no_mmap,
- .sendpage = sock_no_sendpage,
+ .sendpage = unix_stream_sendpage,
.set_peek_off = unix_set_peek_off,
};
@@ -1720,6 +1722,107 @@ out_err:
return sent ? : err;
}
+static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
+ int offset, size_t size, int flags)
+{
+ int err;
+ bool send_sigpipe;
+ struct sock *sk, *other;
+ struct sk_buff *skb, *newskb, *tail;
+
+ err = 0;
+ tail = NULL;
+ newskb = NULL;
+ sk = socket->sk;
+ send_sigpipe = true;
+
+ if (flags & MSG_OOB)
+ return -EOPNOTSUPP;
+
+ other = unix_peer(sk);
+ if (!other || sk->sk_state != TCP_ESTABLISHED)
+ return -ENOTCONN;
+
+ if (false) {
+alloc_skb:
+ unix_state_unlock(other);
+ mutex_unlock(&unix_sk(other)->readlock);
+ newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
+ &err, 0);
+ if (!newskb)
+ return err;
+ }
+
+ /* we must acquire readlock as we modify already present
+ * skbs in the sk_receive_queue and mess with skb->len
+ */
+ err = mutex_lock_interruptible(&unix_sk(other)->readlock);
+ if (err) {
+ err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
+ send_sigpipe = false;
+ goto err;
+ }
+
+ if (sk->sk_shutdown & SEND_SHUTDOWN) {
+ err = -EPIPE;
+ goto err_unlock;
+ }
+
+ unix_state_lock(other);
+
+ if (sock_flag(other, SOCK_DEAD) ||
+ other->sk_shutdown & RCV_SHUTDOWN) {
+ err = -EPIPE;
+ goto err_state_unlock;
+ }
+
+ skb = skb_peek_tail(&other->sk_receive_queue);
+ if (tail && tail == skb) {
+ skb = newskb;
+ } else if (!skb) {
+ if (newskb)
+ skb = newskb;
+ else
+ goto alloc_skb;
+ } else if (newskb) {
+ /* this is fast path, we don't necessarily need to
+ * call to kfree_skb even though with newskb == NULL
+ * this - does no harm
+ */
+ consume_skb(newskb);
+ }
+
+ if (skb_append_pagefrags(skb, page, offset, size)) {
+ tail = skb;
+ goto alloc_skb;
+ }
+
+ skb->len += size;
+ skb->data_len += size;
+ skb->truesize += size;
+ atomic_add(size, &sk->sk_wmem_alloc);
+
+ if (newskb)
+ skb_queue_tail(&other->sk_receive_queue, newskb);
+
+ unix_state_unlock(other);
+ mutex_unlock(&unix_sk(other)->readlock);
+
+ other->sk_data_ready(other);
+
+ return size;
+
+err_state_unlock:
+ unix_state_unlock(other);
+err_unlock:
+ mutex_unlock(&unix_sk(other)->readlock);
+err:
+ kfree_skb(newskb);
+ if (send_sigpipe && !(flags & MSG_NOSIGNAL))
+ send_sig(SIGPIPE, current, 0);
+ return err;
+}
+
static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
size_t len)
{
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH net-next 3/4] net: make skb_splice_bits more configureable
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
2015-05-20 23:43 ` Eric Dumazet
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
3 siblings, 1 reply; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
To: netdev
Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
use a callback to make the locking and unlocking configureable.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
include/linux/skbuff.h | 11 +++++++++--
net/core/skbuff.c | 45 ++++++++++++++++++++++++++++-----------------
net/ipv4/tcp.c | 5 +++--
3 files changed, 40 insertions(+), 21 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b9d267b..895435c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -35,6 +35,7 @@
#include <linux/netdev_features.h>
#include <linux/sched.h>
#include <net/flow_dissector.h>
+#include <linux/splice.h>
/* A. Checksumming of received packets by device.
*
@@ -2698,9 +2699,15 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
__wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
int len, __wsum csum);
-int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
+ssize_t skb_socket_splice(struct sock *sk,
+ struct pipe_inode_info *pipe,
+ struct splice_pipe_desc *spd);
+int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int len,
- unsigned int flags);
+ unsigned int flags,
+ ssize_t (*splice_cb)(struct sock *,
+ struct pipe_inode_info *,
+ struct splice_pipe_desc *));
void skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
unsigned int skb_zerocopy_headlen(const struct sk_buff *from);
int skb_zerocopy(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1d3f88a..1fc76a9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1870,15 +1870,39 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe,
return false;
}
+ssize_t skb_socket_splice(struct sock *sk,
+ struct pipe_inode_info *pipe,
+ struct splice_pipe_desc *spd)
+{
+ int ret;
+
+ /* Drop the socket lock, otherwise we have reverse
+ * locking dependencies between sk_lock and i_mutex
+ * here as compared to sendfile(). We enter here
+ * with the socket lock held, and splice_to_pipe() will
+ * grab the pipe inode lock. For sendfile() emulation,
+ * we call into ->sendpage() with the i_mutex lock held
+ * and networking will grab the socket lock.
+ */
+ release_sock(sk);
+ ret = splice_to_pipe(pipe, spd);
+ lock_sock(sk);
+
+ return ret;
+}
+
/*
* Map data from the skb to a pipe. Should handle both the linear part,
* the fragments, and the frag list. It does NOT handle frag lists within
* the frag list, if such a thing exists. We'd probably need to recurse to
* handle that cleanly.
*/
-int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
struct pipe_inode_info *pipe, unsigned int tlen,
- unsigned int flags)
+ unsigned int flags,
+ ssize_t (*splice_cb)(struct sock *,
+ struct pipe_inode_info *,
+ struct splice_pipe_desc *))
{
struct partial_page partial[MAX_SKB_FRAGS];
struct page *pages[MAX_SKB_FRAGS];
@@ -1891,7 +1915,6 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
.spd_release = sock_spd_release,
};
struct sk_buff *frag_iter;
- struct sock *sk = skb->sk;
int ret = 0;
/*
@@ -1914,20 +1937,8 @@ int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
}
done:
- if (spd.nr_pages) {
- /*
- * Drop the socket lock, otherwise we have reverse
- * locking dependencies between sk_lock and i_mutex
- * here as compared to sendfile(). We enter here
- * with the socket lock held, and splice_to_pipe() will
- * grab the pipe inode lock. For sendfile() emulation,
- * we call into ->sendpage() with the i_mutex lock held
- * and networking will grab the socket lock.
- */
- release_sock(sk);
- ret = splice_to_pipe(pipe, &spd);
- lock_sock(sk);
- }
+ if (spd.nr_pages)
+ ret = splice_cb(sk, pipe, &spd);
return ret;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bb9bb84..67f0a80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -694,8 +694,9 @@ static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
struct tcp_splice_state *tss = rd_desc->arg.data;
int ret;
- ret = skb_splice_bits(skb, offset, tss->pipe, min(rd_desc->count, len),
- tss->flags);
+ ret = skb_splice_bits(skb, skb->sk, offset, tss->pipe,
+ min(rd_desc->count, len), tss->flags,
+ skb_socket_splice);
if (ret > 0)
rd_desc->count -= ret;
return ret;
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
` (2 preceding siblings ...)
2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
@ 2015-05-20 15:35 ` Hannes Frederic Sowa
2015-05-20 20:59 ` Cong Wang
2015-05-20 23:50 ` Eric Dumazet
3 siblings, 2 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 15:35 UTC (permalink / raw)
To: netdev
unix_stream_recvmsg is refactored to unix_stream_read_generic in this
patch and enhanced to deal with pipe splicing. The refactoring is
inneglible, we mostly have to deal with a non-existing struct msghdr
argument.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
net/unix/af_unix.c | 141 +++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 120 insertions(+), 21 deletions(-)
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 9bb880a..d2d3ebf 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -520,6 +520,9 @@ static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
size_t size, int flags);
+static ssize_t unix_stream_splice_read(struct socket *, loff_t *ppos,
+ struct pipe_inode_info *, size_t size,
+ unsigned int flags);
static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
static int unix_dgram_connect(struct socket *, struct sockaddr *,
@@ -561,6 +564,7 @@ static const struct proto_ops unix_stream_ops = {
.recvmsg = unix_stream_recvmsg,
.mmap = sock_no_mmap,
.sendpage = unix_stream_sendpage,
+ .splice_read = unix_stream_splice_read,
.set_peek_off = unix_set_peek_off,
};
@@ -1963,8 +1967,9 @@ out:
* Sleep until more data has arrived. But check for races..
*/
static long unix_stream_data_wait(struct sock *sk, long timeo,
- struct sk_buff *last)
+ struct sk_buff *last, unsigned int last_len)
{
+ struct sk_buff *tail;
DEFINE_WAIT(wait);
unix_state_lock(sk);
@@ -1972,7 +1977,9 @@ static long unix_stream_data_wait(struct sock *sk, long timeo,
for (;;) {
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
- if (skb_peek_tail(&sk->sk_receive_queue) != last ||
+ tail = skb_peek_tail(&sk->sk_receive_queue);
+ if (tail != last ||
+ (tail && tail->len != last_len) ||
sk->sk_err ||
(sk->sk_shutdown & RCV_SHUTDOWN) ||
signal_pending(current) ||
@@ -1996,38 +2003,51 @@ static unsigned int unix_skb_len(const struct sk_buff *skb)
return skb->len - UNIXCB(skb).consumed;
}
-static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
- size_t size, int flags)
+struct unix_stream_read_state {
+ int (*recv_actor)(struct sk_buff *, int, int,
+ struct unix_stream_read_state *);
+ struct socket *socket;
+ struct msghdr *msg;
+ struct pipe_inode_info *pipe;
+ size_t size;
+ int flags;
+ unsigned int splice_flags;
+};
+
+static __always_inline
+int unix_stream_read_generic(struct unix_stream_read_state *state)
{
struct scm_cookie scm;
+ struct socket *sock = state->socket;
struct sock *sk = sock->sk;
struct unix_sock *u = unix_sk(sk);
- DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr, msg->msg_name);
int copied = 0;
+ int flags = state->flags;
int noblock = flags & MSG_DONTWAIT;
- int check_creds = 0;
+ bool check_creds = false;
int target;
int err = 0;
long timeo;
int skip;
+ size_t size = state->size;
+ unsigned int last_len;
err = -EINVAL;
if (sk->sk_state != TCP_ESTABLISHED)
goto out;
err = -EOPNOTSUPP;
- if (flags&MSG_OOB)
+ if (flags & MSG_OOB)
goto out;
- target = sock_rcvlowat(sk, flags&MSG_WAITALL, size);
+ target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
timeo = sock_rcvtimeo(sk, noblock);
+ memset(&scm, 0, sizeof(scm));
+
/* Lock the socket to prevent queue disordering
* while sleeps in memcpy_tomsg
*/
-
- memset(&scm, 0, sizeof(scm));
-
err = mutex_lock_interruptible(&u->readlock);
if (unlikely(err)) {
/* recvmsg() in non blocking mode is supposed to return -EAGAIN
@@ -2043,6 +2063,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
unix_state_lock(sk);
last = skb = skb_peek(&sk->sk_receive_queue);
+ last_len = last ? last->len : 0;
again:
if (skb == NULL) {
unix_sk(sk)->recursion_level = 0;
@@ -2065,16 +2086,17 @@ again:
break;
mutex_unlock(&u->readlock);
- timeo = unix_stream_data_wait(sk, timeo, last);
+ timeo = unix_stream_data_wait(sk, timeo, last,
+ last_len);
- if (signal_pending(current)
- || mutex_lock_interruptible(&u->readlock)) {
+ if (signal_pending(current) ||
+ mutex_lock_interruptible(&u->readlock)) {
err = sock_intr_errno(timeo);
goto out;
}
continue;
- unlock:
+unlock:
unix_state_unlock(sk);
break;
}
@@ -2083,6 +2105,7 @@ again:
while (skip >= unix_skb_len(skb)) {
skip -= unix_skb_len(skb);
last = skb;
+ last_len = skb->len;
skb = skb_peek_next(skb, &sk->sk_receive_queue);
if (!skb)
goto again;
@@ -2099,18 +2122,20 @@ again:
} else if (test_bit(SOCK_PASSCRED, &sock->flags)) {
/* Copy credentials */
scm_set_cred(&scm, UNIXCB(skb).pid, UNIXCB(skb).uid, UNIXCB(skb).gid);
- check_creds = 1;
+ check_creds = true;
}
/* Copy address just once */
- if (sunaddr) {
- unix_copy_addr(msg, skb->sk);
+ if (state->msg && state->msg->msg_name) {
+ DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr,
+ state->msg->msg_name);
+ unix_copy_addr(state->msg, skb->sk);
sunaddr = NULL;
}
chunk = min_t(unsigned int, unix_skb_len(skb) - skip, size);
- if (skb_copy_datagram_msg(skb, UNIXCB(skb).consumed + skip,
- msg, chunk)) {
+ chunk = state->recv_actor(skb, skip, chunk, state);
+ if (chunk < 0) {
if (copied == 0)
copied = -EFAULT;
break;
@@ -2148,11 +2173,85 @@ again:
} while (size);
mutex_unlock(&u->readlock);
- scm_recv(sock, msg, &scm, flags);
+ if (state->msg)
+ scm_recv(sock, state->msg, &scm, flags);
+ else
+ scm_destroy(&scm);
out:
return copied ? : err;
}
+static int unix_stream_read_actor(struct sk_buff *skb,
+ int skip, int chunk,
+ struct unix_stream_read_state *state)
+{
+ int ret;
+
+ ret = skb_copy_datagram_msg(skb, UNIXCB(skb).consumed + skip,
+ state->msg, chunk);
+ return ret ?: chunk;
+}
+
+static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
+ size_t size, int flags)
+{
+ struct unix_stream_read_state state = {
+ .recv_actor = unix_stream_read_actor,
+ .socket = sock,
+ .msg = msg,
+ .size = size,
+ .flags = flags
+ };
+
+ return unix_stream_read_generic(&state);
+}
+
+static ssize_t skb_unix_socket_splice(struct sock *sk,
+ struct pipe_inode_info *pipe,
+ struct splice_pipe_desc *spd)
+{
+ int ret;
+ struct unix_sock *u = unix_sk(sk);
+
+ mutex_unlock(&u->readlock);
+ ret = splice_to_pipe(pipe, spd);
+ mutex_lock(&u->readlock);
+
+ return ret;
+}
+
+static int unix_stream_splice_actor(struct sk_buff *skb,
+ int skip, int chunk,
+ struct unix_stream_read_state *state)
+{
+ return skb_splice_bits(skb, state->socket->sk,
+ UNIXCB(skb).consumed + skip,
+ state->pipe, chunk, state->splice_flags,
+ skb_unix_socket_splice);
+}
+
+static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos,
+ struct pipe_inode_info *pipe,
+ size_t size, unsigned int flags)
+{
+ struct unix_stream_read_state state = {
+ .recv_actor = unix_stream_splice_actor,
+ .socket = sock,
+ .pipe = pipe,
+ .size = size,
+ .splice_flags = flags,
+ };
+
+ if (unlikely(*ppos))
+ return -ESPIPE;
+
+ if (sock->file->f_flags & O_NONBLOCK ||
+ flags & SPLICE_F_NONBLOCK)
+ state.flags = MSG_DONTWAIT;
+
+ return unix_stream_read_generic(&state);
+}
+
static int unix_shutdown(struct socket *sock, int mode)
{
struct sock *sk = sock->sk;
--
2.1.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
@ 2015-05-20 17:51 ` Cong Wang
0 siblings, 0 replies; 14+ messages in thread
From: Cong Wang @ 2015-05-20 17:51 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> +int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
> + int offset, size_t size)
> +{
> + int i = skb_shinfo(skb)->nr_frags;
> +
> + if (skb_can_coalesce(skb, i, page, offset)) {
> + skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
> + } else if (i < MAX_SKB_FRAGS) {
> + get_page(page);
> + skb_fill_page_desc(skb, i, page, offset, size);
> + } else {
> + return -EMSGSIZE;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(skb_append_pagefrags);
> +
> /**
> * skb_pull_rcsum - pull skb and update receive checksum
> * @skb: buffer to update
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 8d91b92..35ff40f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1292,12 +1292,8 @@ ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
> i = skb_shinfo(skb)->nr_frags;
> if (len > size)
> len = size;
> - if (skb_can_coalesce(skb, i, page, offset)) {
> - skb_frag_size_add(&skb_shinfo(skb)->frags[i-1], len);
> - } else if (i < MAX_SKB_FRAGS) {
> - get_page(page);
> - skb_fill_page_desc(skb, i, page, offset, len);
> - } else {
> +
> + if (skb_append_pagefrags(skb, page, offset, len)) {
> err = -EMSGSIZE;
> goto error;
> }
The 'i' can be removed now.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
@ 2015-05-20 18:40 ` Cong Wang
2015-05-20 23:21 ` Eric Dumazet
1 sibling, 0 replies; 14+ messages in thread
From: Cong Wang @ 2015-05-20 18:40 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
>
> +static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
> + int offset, size_t size, int flags)
> +{
> + int err;
> + bool send_sigpipe;
> + struct sock *sk, *other;
> + struct sk_buff *skb, *newskb, *tail;
> +
> + err = 0;
> + tail = NULL;
> + newskb = NULL;
> + sk = socket->sk;
> + send_sigpipe = true;
Please fold them.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
@ 2015-05-20 20:59 ` Cong Wang
2015-05-20 21:47 ` Hannes Frederic Sowa
2015-05-20 23:50 ` Eric Dumazet
1 sibling, 1 reply; 14+ messages in thread
From: Cong Wang @ 2015-05-20 20:59 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
>
> -static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
> - size_t size, int flags)
> +struct unix_stream_read_state {
> + int (*recv_actor)(struct sk_buff *, int, int,
> + struct unix_stream_read_state *);
> + struct socket *socket;
> + struct msghdr *msg;
> + struct pipe_inode_info *pipe;
> + size_t size;
> + int flags;
> + unsigned int splice_flags;
> +};
> +
> +static __always_inline
> +int unix_stream_read_generic(struct unix_stream_read_state *state)
Why __always_inline here?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
2015-05-20 20:59 ` Cong Wang
@ 2015-05-20 21:47 ` Hannes Frederic Sowa
0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 21:47 UTC (permalink / raw)
To: Cong Wang; +Cc: netdev
Hi Cong,
On Wed, May 20, 2015, at 22:59, Cong Wang wrote:
> On Wed, May 20, 2015 at 8:35 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> >
> > -static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
> > - size_t size, int flags)
> > +struct unix_stream_read_state {
> > + int (*recv_actor)(struct sk_buff *, int, int,
> > + struct unix_stream_read_state *);
> > + struct socket *socket;
> > + struct msghdr *msg;
> > + struct pipe_inode_info *pipe;
> > + size_t size;
> > + int flags;
> > + unsigned int splice_flags;
> > +};
> > +
> > +static __always_inline
> > +int unix_stream_read_generic(struct unix_stream_read_state *state)
>
>
> Why __always_inline here?
During benchmarking I discovered that the simple ordinary recvmsg case
lost a bit in performance because of the indirection. With
__always_inline -ing unix_stream_read_generic I got it to almost the
same numbers again as without the change. Thus I decided to leave it
there.
Also, thank you for your other feedback. I will address it soon after
letting the patches receiving a bit more feedback.
Thanks,
Hannes
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
2015-05-20 18:40 ` Cong Wang
@ 2015-05-20 23:21 ` Eric Dumazet
2015-05-20 23:47 ` Hannes Frederic Sowa
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:21 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> This patch implements sendpage support for AF_UNIX SOCK_STREAM
> sockets. This is also required for a complete splice implementation.
>
> The implementation is a bit tricky because we append to already existing
> skbs and so have to hold unix_sk->readlock to protect the reading side
> from dropping the tail of the sk_receive_queue.
>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
> net/unix/af_unix.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 941b3d2..9bb880a 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -518,6 +518,8 @@ static int unix_ioctl(struct socket *, unsigned int, unsigned long);
> static int unix_shutdown(struct socket *, int);
> static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
> static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
> +static ssize_t unix_stream_sendpage(struct socket *, struct page *, int offset,
> + size_t size, int flags);
> static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
> static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
> static int unix_dgram_connect(struct socket *, struct sockaddr *,
> @@ -558,7 +560,7 @@ static const struct proto_ops unix_stream_ops = {
> .sendmsg = unix_stream_sendmsg,
> .recvmsg = unix_stream_recvmsg,
> .mmap = sock_no_mmap,
> - .sendpage = sock_no_sendpage,
> + .sendpage = unix_stream_sendpage,
> .set_peek_off = unix_set_peek_off,
> };
>
> @@ -1720,6 +1722,107 @@ out_err:
> return sent ? : err;
> }
>
> +static ssize_t unix_stream_sendpage(struct socket *socket, struct page *page,
> + int offset, size_t size, int flags)
> +{
> + int err;
> + bool send_sigpipe;
> + struct sock *sk, *other;
> + struct sk_buff *skb, *newskb, *tail;
> +
> + err = 0;
> + tail = NULL;
> + newskb = NULL;
> + sk = socket->sk;
> + send_sigpipe = true;
> +
> + if (flags & MSG_OOB)
> + return -EOPNOTSUPP;
> +
> + other = unix_peer(sk);
> + if (!other || sk->sk_state != TCP_ESTABLISHED)
> + return -ENOTCONN;
> +
> + if (false) {
> +alloc_skb:
> + unix_state_unlock(other);
> + mutex_unlock(&unix_sk(other)->readlock);
> + newskb = sock_alloc_send_pskb(sk, 0, 0, flags & MSG_DONTWAIT,
> + &err, 0);
> + if (!newskb)
> + return err;
> + }
> +
> + /* we must acquire readlock as we modify already present
> + * skbs in the sk_receive_queue and mess with skb->len
> + */
> + err = mutex_lock_interruptible(&unix_sk(other)->readlock);
> + if (err) {
> + err = flags & MSG_DONTWAIT ? -EAGAIN : -ERESTARTSYS;
> + send_sigpipe = false;
> + goto err;
> + }
> +
> + if (sk->sk_shutdown & SEND_SHUTDOWN) {
> + err = -EPIPE;
> + goto err_unlock;
> + }
> +
> + unix_state_lock(other);
> +
> + if (sock_flag(other, SOCK_DEAD) ||
> + other->sk_shutdown & RCV_SHUTDOWN) {
> + err = -EPIPE;
> + goto err_state_unlock;
> + }
> +
> + skb = skb_peek_tail(&other->sk_receive_queue);
> + if (tail && tail == skb) {
> + skb = newskb;
> + } else if (!skb) {
> + if (newskb)
> + skb = newskb;
> + else
> + goto alloc_skb;
> + } else if (newskb) {
> + /* this is fast path, we don't necessarily need to
> + * call to kfree_skb even though with newskb == NULL
> + * this - does no harm
> + */
> + consume_skb(newskb);
> + }
> +
> + if (skb_append_pagefrags(skb, page, offset, size)) {
> + tail = skb;
> + goto alloc_skb;
> + }
> +
> + skb->len += size;
> + skb->data_len += size;
> + skb->truesize += size;
> + atomic_add(size, &sk->sk_wmem_alloc);
> +
> + if (newskb)
> + skb_queue_tail(&other->sk_receive_queue, newskb);
Are you sure we need the skb_queue_tail() here (taking spinlock) ?
This would tell us there might be a possible race.
A comment would be nice eventually.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 3/4] net: make skb_splice_bits more configureable
2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
@ 2015-05-20 23:43 ` Eric Dumazet
0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:43 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> Prepare skb_splice_bits to be able to deal with AF_UNIX sockets.
>
> AF_UNIX sockets don't use lock_sock/release_sock and thus we have to
> use a callback to make the locking and unlocking configureable.
>
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
> include/linux/skbuff.h | 11 +++++++++--
> net/core/skbuff.c | 45 ++++++++++++++++++++++++++++-----------------
> net/ipv4/tcp.c | 5 +++--
> 3 files changed, 40 insertions(+), 21 deletions(-)
Acked-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 2/4] net: af_unix: implement stream sendpage support
2015-05-20 23:21 ` Eric Dumazet
@ 2015-05-20 23:47 ` Hannes Frederic Sowa
0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 23:47 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Thu, May 21, 2015, at 01:21, Eric Dumazet wrote:
> On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> > This patch implements sendpage support for AF_UNIX SOCK_STREAM
> > +
> > + if (newskb)
> > + skb_queue_tail(&other->sk_receive_queue, newskb);
>
> Are you sure we need the skb_queue_tail() here (taking spinlock) ?
>
> This would tell us there might be a possible race.
>
> A comment would be nice eventually.
Hmm, at first sight, I think we can change this to __skb_queue_tail.
sendpage does take state_lock and readlock mutex and thus blocks out
both, recvmsg and sendmsg. unix_stream_connect is also serialized by
state_lock.
I guess I used it because of unix_stream_sendmsg, where it is actually
necessary, as recvmsg does unlink skb without state_lock and sendmsg
doesn't hold reader mutex.
Thanks for the hint!
Bye,
Hannes
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
2015-05-20 20:59 ` Cong Wang
@ 2015-05-20 23:50 ` Eric Dumazet
2015-05-20 23:57 ` Hannes Frederic Sowa
1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2015-05-20 23:50 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: netdev
On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
> +
> +static int unix_stream_splice_actor(struct sk_buff *skb,
> + int skip, int chunk,
> + struct unix_stream_read_state *state)
> +{
> + return skb_splice_bits(skb, state->socket->sk,
> + UNIXCB(skb).consumed + skip,
> + state->pipe, chunk, state->splice_flags,
> + skb_unix_socket_splice);
> +}
I am not sure you added EXPORT_SYMBOL(skb_splice_bits) in one of the
patches ?
Make sure CONFIG_UNIX=m still works.
Thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets
2015-05-20 23:50 ` Eric Dumazet
@ 2015-05-20 23:57 ` Hannes Frederic Sowa
0 siblings, 0 replies; 14+ messages in thread
From: Hannes Frederic Sowa @ 2015-05-20 23:57 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On Thu, May 21, 2015, at 01:50, Eric Dumazet wrote:
> On Wed, 2015-05-20 at 17:35 +0200, Hannes Frederic Sowa wrote:
>
> > +
> > +static int unix_stream_splice_actor(struct sk_buff *skb,
> > + int skip, int chunk,
> > + struct unix_stream_read_state *state)
> > +{
> > + return skb_splice_bits(skb, state->socket->sk,
> > + UNIXCB(skb).consumed + skip,
> > + state->pipe, chunk, state->splice_flags,
> > + skb_unix_socket_splice);
> > +}
>
> I am not sure you added EXPORT_SYMBOL(skb_splice_bits) in one of the
> patches ?
>
> Make sure CONFIG_UNIX=m still works.
I didn't. Thanks, I will test that.
Bye,
Hannes
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-05-20 23:57 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-20 15:35 [PATCH net-next 0/4] net: af_unix: zerocopy stream bits Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 1/4] net: skbuff: add skb_append_pagefrags and use it Hannes Frederic Sowa
2015-05-20 17:51 ` Cong Wang
2015-05-20 15:35 ` [PATCH net-next 2/4] net: af_unix: implement stream sendpage support Hannes Frederic Sowa
2015-05-20 18:40 ` Cong Wang
2015-05-20 23:21 ` Eric Dumazet
2015-05-20 23:47 ` Hannes Frederic Sowa
2015-05-20 15:35 ` [PATCH net-next 3/4] net: make skb_splice_bits more configureable Hannes Frederic Sowa
2015-05-20 23:43 ` Eric Dumazet
2015-05-20 15:35 ` [PATCH net-next 4/4] net: af_unix: implement splice for stream af_unix sockets Hannes Frederic Sowa
2015-05-20 20:59 ` Cong Wang
2015-05-20 21:47 ` Hannes Frederic Sowa
2015-05-20 23:50 ` Eric Dumazet
2015-05-20 23:57 ` Hannes Frederic Sowa
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.