All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] udp msg_zerocopy
@ 2018-11-24 19:24 Willem de Bruijn
  2018-11-24 19:24 ` [PATCH net-next 1/2] udp: msg_zerocopy Willem de Bruijn
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Willem de Bruijn @ 2018-11-24 19:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Enable MSG_ZEROCOPY for udp sockets

Patch 1/2 is the main patch, a rebase from RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/2 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf


Willem de Bruijn (2):
  udp: msg_zerocopy
  tools/testing: extend zerocopy tests to udp

 include/linux/skbuff.h                      |  1 +
 net/core/skbuff.c                           | 15 +++++++++++++-
 net/core/sock.c                             |  5 ++++-
 net/ipv4/ip_output.c                        | 22 ++++++++++++++++++++-
 net/ipv6/ip6_output.c                       | 22 ++++++++++++++++++++-
 tools/testing/selftests/net/msg_zerocopy.c  |  3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 +++
 8 files changed, 68 insertions(+), 5 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next 1/2] udp: msg_zerocopy
  2018-11-24 19:24 [PATCH net-next 0/2] udp msg_zerocopy Willem de Bruijn
@ 2018-11-24 19:24 ` Willem de Bruijn
  2018-11-25  1:58   ` David Miller
  2018-11-24 19:24 ` [PATCH net-next 2/2] tools/testing: extend zerocopy tests to udp Willem de Bruijn
  2018-11-25  1:57 ` [PATCH net-next 0/2] udp msg_zerocopy David Miller
  2 siblings, 1 reply; 6+ messages in thread
From: Willem de Bruijn @ 2018-11-24 19:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

The datagram implementation has initial refcnt of 0, instead of 1 for
TCP. The tcp_sendmsg_locked function has to hold a reference on
uarg independent those held by the skbs it generates, because the
skbs can be sent and freed in the function main loop. This is not
needed for other sockets.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

    ipv4 udp -t 1
    tx=191312 (11938 MB) txc=0 zc=n
    rx=191312 (11938 MB)
    ipv4 udp -z -t 1
    tx=304507 (19002 MB) txc=304507 zc=y
    rx=304507 (19002 MB)
    ok
    ipv6 udp -t 1
    tx=174485 (10888 MB) txc=0 zc=n
    rx=174485 (10888 MB)
    ipv6 udp -z -t 1
    tx=294801 (18396 MB) txc=294801 zc=y
    rx=294801 (18396 MB)
    ok

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c      | 15 ++++++++++++++-
 net/core/sock.c        |  5 ++++-
 net/ipv4/ip_output.c   | 22 +++++++++++++++++++++-
 net/ipv6/ip6_output.c  | 22 +++++++++++++++++++++-
 5 files changed, 61 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a2e8297a5b00..7c1a649d229b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9a8a72cefe9b..b7f8894fbcb4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -957,7 +957,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	refcount_set(&uarg->refcnt, 1);
+	refcount_set(&uarg->refcnt, sk->sk_type == SOCK_STREAM ? 1 : 0);
 	sock_hold(sk);
 
 	return uarg;
@@ -1097,6 +1097,13 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
 		atomic_dec(&sk->sk_zckey);
 		uarg->len--;
 
+		/* Stream socks hold a ref for the syscall, as skbs can be sent
+		 * and freed inside the loop, dropping refcnt to 0 inbetween.
+		 * Datagrams do not need this, but sock_zerocopy_put expects it.
+		 */
+		if (sk->sk_type != SOCK_STREAM && !refcount_read(&uarg->refcnt))
+			refcount_set(&uarg->refcnt, 1);
+
 		sock_zerocopy_put(uarg);
 	}
 }
@@ -1105,6 +1112,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 				   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+	return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 			     struct msghdr *msg, int len,
 			     struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 
 	case SO_ZEROCOPY:
 		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-			if (sk->sk_protocol != IPPROTO_TCP)
+			if (!((sk->sk_type == SOCK_STREAM &&
+			       sk->sk_protocol == IPPROTO_TCP) ||
+			      (sk->sk_type == SOCK_DGRAM &&
+			       sk->sk_protocol == IPPROTO_UDP)))
 				ret = -ENOTSUPP;
 		} else if (sk->sk_family != PF_RDS) {
 			ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c09219e7f230..2d1e4b67a1e8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -868,6 +868,7 @@ static int __ip_append_data(struct sock *sk,
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sk_buff *skb;
+	struct ubuf_info *uarg = NULL;
 
 	struct ip_options *opt = cork->opt;
 	int hh_len;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+		if (rt->dst.dev->features & NETIF_F_SG &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	cork->length += length;
 
 	/* So, what's going on in the loop below?
@@ -1005,6 +1019,7 @@ static int __ip_append_data(struct sock *sk,
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
 			tskey = 0;
+			skb_zcopy_set(skb, uarg);
 
 			/*
 			 *	Find where to start putting bytes.
@@ -1067,7 +1082,7 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1097,6 +1112,10 @@ static int __ip_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err < 0)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1109,6 +1128,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 89e0d5118afe..8a49ff985ed8 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1245,6 +1245,7 @@ static int __ip6_append_data(struct sock *sk,
 {
 	struct sk_buff *skb, *skb_prev = NULL;
 	unsigned int maxfraglen, fragheaderlen, mtu, orig_mtu, pmtu;
+	struct ubuf_info *uarg = NULL;
 	int exthdrlen = 0;
 	int dst_exthdrlen = 0;
 	int hh_len;
@@ -1322,6 +1323,19 @@ static int __ip6_append_data(struct sock *sk,
 	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
 		csummode = CHECKSUM_PARTIAL;
 
+	if (flags & MSG_ZEROCOPY && length) {
+		uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+		if (!uarg)
+			return -ENOBUFS;
+		if (rt->dst.dev->features & NETIF_F_SG &&
+		    csummode == CHECKSUM_PARTIAL) {
+			paged = true;
+		} else {
+			uarg->zerocopy = 0;
+			skb_zcopy_set(skb, uarg);
+		}
+	}
+
 	/*
 	 * Let's try using as much space as possible.
 	 * Use MTU if total length of the message fits into the MTU.
@@ -1443,6 +1457,7 @@ static int __ip6_append_data(struct sock *sk,
 			skb_shinfo(skb)->tx_flags = cork->tx_flags;
 			cork->tx_flags = 0;
 			skb_shinfo(skb)->tskey = tskey;
+			skb_zcopy_set(skb, uarg);
 			tskey = 0;
 
 			/*
@@ -1505,7 +1520,7 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else {
+		} else if (!uarg || !uarg->zerocopy) {
 			int i = skb_shinfo(skb)->nr_frags;
 
 			err = -ENOMEM;
@@ -1535,6 +1550,10 @@ static int __ip6_append_data(struct sock *sk,
 			skb->data_len += copy;
 			skb->truesize += copy;
 			wmem_alloc_delta += copy;
+		} else {
+			err = skb_zerocopy_iter_dgram(skb, from, copy);
+			if (err < 0)
+				goto error;
 		}
 		offset += copy;
 		length -= copy;
@@ -1547,6 +1566,7 @@ static int __ip6_append_data(struct sock *sk,
 error_efault:
 	err = -EFAULT;
 error:
+	sock_zerocopy_put_abort(uarg);
 	cork->length -= length;
 	IP6_INC_STATS(sock_net(sk), rt->rt6i_idev, IPSTATS_MIB_OUTDISCARDS);
 	refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next 2/2] tools/testing: extend zerocopy tests to udp
  2018-11-24 19:24 [PATCH net-next 0/2] udp msg_zerocopy Willem de Bruijn
  2018-11-24 19:24 ` [PATCH net-next 1/2] udp: msg_zerocopy Willem de Bruijn
@ 2018-11-24 19:24 ` Willem de Bruijn
  2018-11-25  1:57 ` [PATCH net-next 0/2] udp msg_zerocopy David Miller
  2 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2018-11-24 19:24 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn <willemb@google.com>
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+	const int cfg_receiver_wait_ms = 400;
 	uint64_t tstop;
 	int fd;
 
 	fd = do_setup_rx(domain, type, protocol);
 
-	tstop = gettimeofday_ms() + cfg_runtime_ms;
+	tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
 	do {
 		if (type == SOCK_STREAM)
 			do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
 	$0 4 tcp -t 1
 	$0 6 tcp -t 1
+	$0 4 udp -t 1
+	$0 6 udp -t 1
 	echo "OK. All tests passed"
 	exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
 	echo "udp gso"
 	run_in_netns ${args} -S 0
+
+	echo "udp gso zerocopy"
+	run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 0/2] udp msg_zerocopy
  2018-11-24 19:24 [PATCH net-next 0/2] udp msg_zerocopy Willem de Bruijn
  2018-11-24 19:24 ` [PATCH net-next 1/2] udp: msg_zerocopy Willem de Bruijn
  2018-11-24 19:24 ` [PATCH net-next 2/2] tools/testing: extend zerocopy tests to udp Willem de Bruijn
@ 2018-11-25  1:57 ` David Miller
  2 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2018-11-25  1:57 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Sat, 24 Nov 2018 14:24:29 -0500

> From: Willem de Bruijn <willemb@google.com>
> 
> Enable MSG_ZEROCOPY for udp sockets
> 
> Patch 1/2 is the main patch, a rebase from RFC patch
>   http://patchwork.ozlabs.org/patch/899630/
>   more details in the patch commit message
> 
> Patch 2/2 runs the already existing udp zerocopy tests
>   as part of kselftest
> 
> See also recent Linux Plumbers presentation
>   https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Series looks good, just one coding style to fix up, I'll reply to patch #1.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 1/2] udp: msg_zerocopy
  2018-11-24 19:24 ` [PATCH net-next 1/2] udp: msg_zerocopy Willem de Bruijn
@ 2018-11-25  1:58   ` David Miller
  2018-11-25  2:13     ` Willem de Bruijn
  0 siblings, 1 reply; 6+ messages in thread
From: David Miller @ 2018-11-25  1:58 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: netdev, willemb

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Sat, 24 Nov 2018 14:24:30 -0500

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index c09219e7f230..2d1e4b67a1e8 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -868,6 +868,7 @@ static int __ip_append_data(struct sock *sk,
>  {
>  	struct inet_sock *inet = inet_sk(sk);
>  	struct sk_buff *skb;
> +	struct ubuf_info *uarg = NULL;

Reverse christmas tree, please.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next 1/2] udp: msg_zerocopy
  2018-11-25  1:58   ` David Miller
@ 2018-11-25  2:13     ` Willem de Bruijn
  0 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2018-11-25  2:13 UTC (permalink / raw)
  To: David Miller; +Cc: Network Development, Willem de Bruijn

On Sat, Nov 24, 2018 at 8:58 PM David Miller <davem@davemloft.net> wrote:
>
> From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Date: Sat, 24 Nov 2018 14:24:30 -0500
>
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index c09219e7f230..2d1e4b67a1e8 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -868,6 +868,7 @@ static int __ip_append_data(struct sock *sk,
> >  {
> >       struct inet_sock *inet = inet_sk(sk);
> >       struct sk_buff *skb;
> > +     struct ubuf_info *uarg = NULL;
>
> Reverse christmas tree, please.

Oh right, sorry.

Will wait with v2 for a day or two, in case anyone else has comments.
Thanks for the quick review!

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-25 13:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-24 19:24 [PATCH net-next 0/2] udp msg_zerocopy Willem de Bruijn
2018-11-24 19:24 ` [PATCH net-next 1/2] udp: msg_zerocopy Willem de Bruijn
2018-11-25  1:58   ` David Miller
2018-11-25  2:13     ` Willem de Bruijn
2018-11-24 19:24 ` [PATCH net-next 2/2] tools/testing: extend zerocopy tests to udp Willem de Bruijn
2018-11-25  1:57 ` [PATCH net-next 0/2] udp msg_zerocopy David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.