All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit
@ 2020-09-14 21:52 Soheil Hassas Yeganeh
  2020-09-14 21:52 ` [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg Soheil Hassas Yeganeh
  2020-09-14 23:58 ` [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit David Miller
  0 siblings, 2 replies; 4+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-09-14 21:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: edumazet, Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events.  In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat.  This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.

Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d3781b6087cb..48c351804efc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -564,7 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 			mask |= EPOLLIN | EPOLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-			if (sk_stream_is_writeable(sk)) {
+			if (__sk_stream_is_writeable(sk, 1)) {
 				mask |= EPOLLOUT | EPOLLWRNORM;
 			} else {  /* send SIGIO later */
 				sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -576,7 +576,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 				 * pairs with the input side.
 				 */
 				smp_mb__after_atomic();
-				if (sk_stream_is_writeable(sk))
+				if (__sk_stream_is_writeable(sk, 1))
 					mask |= EPOLLOUT | EPOLLWRNORM;
 			}
 		} else
-- 
2.28.0.618.gf4bc123cb7-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg
  2020-09-14 21:52 [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit Soheil Hassas Yeganeh
@ 2020-09-14 21:52 ` Soheil Hassas Yeganeh
  2020-09-14 23:58   ` David Miller
  2020-09-14 23:58 ` [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit David Miller
  1 sibling, 1 reply; 4+ messages in thread
From: Soheil Hassas Yeganeh @ 2020-09-14 21:52 UTC (permalink / raw)
  To: davem, netdev; +Cc: edumazet, Soheil Hassas Yeganeh

From: Soheil Hassas Yeganeh <soheil@google.com>

For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.

As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:

// One large partial send due to memory allocation failure.
sendmsg(20MB)   = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB)   = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...)    = EAGAIN
// At this point, userspace can assume an EPOLLOUT.

To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.

After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 48c351804efc..65057744fac8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1004,12 +1004,12 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		    !tcp_skb_can_collapse_to(skb)) {
 new_segment:
 			if (!sk_stream_memory_free(sk))
-				goto wait_for_sndbuf;
+				goto wait_for_space;
 
 			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
 					tcp_rtx_and_write_queues_empty(sk));
 			if (!skb)
-				goto wait_for_memory;
+				goto wait_for_space;
 
 #ifdef CONFIG_TLS_DEVICE
 			skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
@@ -1028,7 +1028,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			goto new_segment;
 		}
 		if (!sk_wmem_schedule(sk, copy))
-			goto wait_for_memory;
+			goto wait_for_space;
 
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
@@ -1069,9 +1069,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			tcp_push_one(sk, mss_now);
 		continue;
 
-wait_for_sndbuf:
+wait_for_space:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
 		tcp_push(sk, flags & ~MSG_MORE, mss_now,
 			 TCP_NAGLE_PUSH, size_goal);
 
@@ -1282,7 +1281,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 new_segment:
 			if (!sk_stream_memory_free(sk))
-				goto wait_for_sndbuf;
+				goto wait_for_space;
 
 			if (unlikely(process_backlog >= 16)) {
 				process_backlog = 0;
@@ -1293,7 +1292,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
 						  first_skb);
 			if (!skb)
-				goto wait_for_memory;
+				goto wait_for_space;
 
 			process_backlog++;
 			skb->ip_summed = CHECKSUM_PARTIAL;
@@ -1326,7 +1325,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			struct page_frag *pfrag = sk_page_frag(sk);
 
 			if (!sk_page_frag_refill(sk, pfrag))
-				goto wait_for_memory;
+				goto wait_for_space;
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
@@ -1340,7 +1339,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			copy = min_t(int, copy, pfrag->size - pfrag->offset);
 
 			if (!sk_wmem_schedule(sk, copy))
-				goto wait_for_memory;
+				goto wait_for_space;
 
 			err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
 						       pfrag->page,
@@ -1393,9 +1392,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			tcp_push_one(sk, mss_now);
 		continue;
 
-wait_for_sndbuf:
+wait_for_space:
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-wait_for_memory:
 		if (copied)
 			tcp_push(sk, flags & ~MSG_MORE, mss_now,
 				 TCP_NAGLE_PUSH, size_goal);
-- 
2.28.0.618.gf4bc123cb7-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit
  2020-09-14 21:52 [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit Soheil Hassas Yeganeh
  2020-09-14 21:52 ` [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg Soheil Hassas Yeganeh
@ 2020-09-14 23:58 ` David Miller
  1 sibling, 0 replies; 4+ messages in thread
From: David Miller @ 2020-09-14 23:58 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, edumazet, soheil

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Mon, 14 Sep 2020 17:52:09 -0400

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> If there was any event available on the TCP socket, tcp_poll()
> will be called to retrieve all the events.  In tcp_poll(), we call
> sk_stream_is_writeable() which returns true as long as we are at least
> one byte below notsent_lowat.  This will result in quite a few
> spurious EPLLOUT and frequent tiny sendmsg() calls as a result.
> 
> Similar to sk_stream_write_space(), use __sk_stream_is_writeable
> with a wake value of 1, so that we set EPOLLOUT only if half the
> space is available for write.
> 
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg
  2020-09-14 21:52 ` [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg Soheil Hassas Yeganeh
@ 2020-09-14 23:58   ` David Miller
  0 siblings, 0 replies; 4+ messages in thread
From: David Miller @ 2020-09-14 23:58 UTC (permalink / raw)
  To: soheil.kdev; +Cc: netdev, edumazet, soheil

From: Soheil Hassas Yeganeh <soheil.kdev@gmail.com>
Date: Mon, 14 Sep 2020 17:52:10 -0400

> From: Soheil Hassas Yeganeh <soheil@google.com>
> 
> For EPOLLET, applications must call sendmsg until they get EAGAIN.
> Otherwise, there is no guarantee that EPOLLOUT is sent if there was
> a failure upon memory allocation.
> 
> As a result on high-speed NICs, userspace observes multiple small
> sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
> 1-2 TSOs in between two sendmsg syscalls:
> 
> // One large partial send due to memory allocation failure.
> sendmsg(20MB)   = 2MB
> // Many small sends until EAGAIN.
> sendmsg(18MB)   = 64KB
> sendmsg(17.9MB) = 128KB
> sendmsg(17.8MB) = 64KB
> ...
> sendmsg(...)    = EAGAIN
> // At this point, userspace can assume an EPOLLOUT.
> 
> To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
> to guarantee that we send EPOLLOUT after partial sendmsg.
> 
> After this commit userspace can assume that it will receive an EPOLLOUT
> after the first partial sendmsg. This EPOLLOUT will benefit from
> sk_stream_write_space() logic delaying the EPOLLOUT until significant
> space is available in write queue.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Applied.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-09-14 23:58 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-14 21:52 [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit Soheil Hassas Yeganeh
2020-09-14 21:52 ` [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg Soheil Hassas Yeganeh
2020-09-14 23:58   ` David Miller
2020-09-14 23:58 ` [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.