All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
@ 2016-06-20 21:23 Jason Baron
  2016-06-20 22:29 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Baron @ 2016-06-20 21:23 UTC (permalink / raw)
  To: eric.dumazet, davem; +Cc: netdev

From: Jason Baron <jbaron@akamai.com>

When SO_SNDBUF is set and we are under tcp memory pressure, the effective
write buffer space can be much lower than what was set using SO_SNDBUF. For
example, we may have set the buffer to 100kb, but we may only be able to
write 10kb. In this scenario poll()/select()/epoll(), are going to
continuously return POLLOUT, followed by -EAGAIN from write(), and thus
result in a tight loop. Note that epoll in edge-triggered does not have
this issue since it only triggers once data has been ack'd. There is no
issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
the sk->sndbuf.

Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
socket when we have a short write due to memory pressure. By then testing
for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
non-zero amount of data has been ack'd. In a previous approach:
http://marc.info/?l=linux-netdev&m=143930393211782&w=2, I had introduced a
new field in 'struct sock' to solve this issue, but its undesirable to add
bloat to 'struct sock'. We also could address this issue, by waiting for
the buffer to become completely empty, but that may reduce throughput since
the write buffer would be empty while waiting for subsequent writes. This
change brings us in line with the current epoll edge-trigger behavior for
the poll()/select() and epoll level-trigger.

We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).

I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
server side, to induce tcp memory pressure. A single server thread reduced
its cpu usage from 100% to 19%, while maintaining the same level of
throughput.

Signed-off-by: Jason Baron <jbaron@akamai.com>
---
 include/net/sock.h |  6 ++++++
 net/core/stream.c  |  1 +
 net/ipv4/tcp.c     | 26 +++++++++++++++++++-------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 649d2a8c17fc..616e8e1a5d5d 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -741,6 +741,7 @@ enum sock_flags {
 	SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
 	SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
+	SOCK_SHORT_WRITE, /* Couldn't fill sndbuf due to memory pressure */
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
@@ -1114,6 +1115,11 @@ static inline bool sk_stream_is_writeable(const struct sock *sk)
 	       sk_stream_memory_free(sk);
 }
 
+static inline void sk_set_short_write(struct sock *sk)
+{
+	if (sk->sk_wmem_queued > 0 && sk_stream_is_writeable(sk))
+		sock_set_flag(sk, SOCK_SHORT_WRITE);
+}
 
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
diff --git a/net/core/stream.c b/net/core/stream.c
index 159516a11b7e..ead768b1e95d 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,6 +32,7 @@ void sk_stream_write_space(struct sock *sk)
 
 	if (sk_stream_is_writeable(sk) && sock) {
 		clear_bit(SOCK_NOSPACE, &sock->flags);
+		sock_reset_flag(sk, SOCK_SHORT_WRITE);
 
 		rcu_read_lock();
 		wq = rcu_dereference(sk->sk_wq);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5c7ed147449c..254c7bb6d3d5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -517,7 +517,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 			mask |= POLLIN | POLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-			if (sk_stream_is_writeable(sk)) {
+			if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+			    sk_stream_is_writeable(sk)) {
 				mask |= POLLOUT | POLLWRNORM;
 			} else {  /* send SIGIO later */
 				sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
@@ -529,7 +530,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 				 * pairs with the input side.
 				 */
 				smp_mb__after_atomic();
-				if (sk_stream_is_writeable(sk))
+				if (!sock_flag(sk, SOCK_SHORT_WRITE) &&
+				    sk_stream_is_writeable(sk))
 					mask |= POLLOUT | POLLWRNORM;
 			}
 		} else
@@ -917,8 +919,10 @@ new_segment:
 
 			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation,
 						  skb_queue_empty(&sk->sk_write_queue));
-			if (!skb)
+			if (!skb) {
+				sk_set_short_write(sk);
 				goto wait_for_memory;
+			}
 
 			skb_entail(sk, skb);
 			copy = size_goal;
@@ -933,8 +937,10 @@ new_segment:
 			tcp_mark_push(tp, skb);
 			goto new_segment;
 		}
-		if (!sk_wmem_schedule(sk, copy))
+		if (!sk_wmem_schedule(sk, copy)) {
+			sk_set_short_write(sk);
 			goto wait_for_memory;
+		}
 
 		if (can_coalesce) {
 			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
@@ -1176,8 +1182,10 @@ new_segment:
 						  select_size(sk, sg),
 						  sk->sk_allocation,
 						  skb_queue_empty(&sk->sk_write_queue));
-			if (!skb)
+			if (!skb) {
+				sk_set_short_write(sk);
 				goto wait_for_memory;
+			}
 
 			process_backlog = true;
 			/*
@@ -1214,8 +1222,10 @@ new_segment:
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
 
-			if (!sk_page_frag_refill(sk, pfrag))
+			if (!sk_page_frag_refill(sk, pfrag)) {
+				sk_set_short_write(sk);
 				goto wait_for_memory;
+			}
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
@@ -1228,8 +1238,10 @@ new_segment:
 
 			copy = min_t(int, copy, pfrag->size - pfrag->offset);
 
-			if (!sk_wmem_schedule(sk, copy))
+			if (!sk_wmem_schedule(sk, copy)) {
+				sk_set_short_write(sk);
 				goto wait_for_memory;
+			}
 
 			err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
 						       pfrag->page,
-- 
2.6.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
  2016-06-20 21:23 [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set Jason Baron
@ 2016-06-20 22:29 ` Eric Dumazet
  2016-06-21 15:24   ` Jason Baron
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2016-06-20 22:29 UTC (permalink / raw)
  To: Jason Baron; +Cc: davem, netdev

On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
> From: Jason Baron <jbaron@akamai.com>
> 
> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
> write buffer space can be much lower than what was set using SO_SNDBUF. For
> example, we may have set the buffer to 100kb, but we may only be able to
> write 10kb. In this scenario poll()/select()/epoll(), are going to
> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
> result in a tight loop. Note that epoll in edge-triggered does not have
> this issue since it only triggers once data has been ack'd. There is no
> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
> the sk->sndbuf.
> 
> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
> socket when we have a short write due to memory pressure. By then testing
> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
> non-zero amount of data has been ack'd. In a previous approach:
> http://marc.info/?l=linux-netdev&m=143930393211782&w=2, I had introduced a
> new field in 'struct sock' to solve this issue, but its undesirable to add
> bloat to 'struct sock'. We also could address this issue, by waiting for
> the buffer to become completely empty, but that may reduce throughput since
> the write buffer would be empty while waiting for subsequent writes. This
> change brings us in line with the current epoll edge-trigger behavior for
> the poll()/select() and epoll level-trigger.
> 
> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
> 
> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
> server side, to induce tcp memory pressure. A single server thread reduced
> its cpu usage from 100% to 19%, while maintaining the same level of
> throughput.
> 
> Signed-off-by: Jason Baron <jbaron@akamai.com>
> ---


When this bug was added, and by which commit ?

This looks serious, but your patch looks way too complex.

This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
is expensive, because it dirties sk_flags which is not supposed to
written often. This field is located in a read mostly cache line.

I believe my suggestion was not to add a per socket bit,
but have a global state like tcp_memory_pressure. (or reuse it)

Ie , when under global memory pressure, tcp_poll() should apply a
strategy to give POLLOUT only on sockets having less than 4K
(tcp_wmem[0]) of memory consumed in their write queue.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
  2016-06-20 22:29 ` Eric Dumazet
@ 2016-06-21 15:24   ` Jason Baron
  2016-06-21 23:48     ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Baron @ 2016-06-21 15:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev



On 06/20/2016 06:29 PM, Eric Dumazet wrote:
> On Mon, 2016-06-20 at 17:23 -0400, Jason Baron wrote:
>> From: Jason Baron <jbaron@akamai.com>
>>
>> When SO_SNDBUF is set and we are under tcp memory pressure, the effective
>> write buffer space can be much lower than what was set using SO_SNDBUF. For
>> example, we may have set the buffer to 100kb, but we may only be able to
>> write 10kb. In this scenario poll()/select()/epoll(), are going to
>> continuously return POLLOUT, followed by -EAGAIN from write(), and thus
>> result in a tight loop. Note that epoll in edge-triggered does not have
>> this issue since it only triggers once data has been ack'd. There is no
>> issue here when SO_SNDBUF is not set, since the tcp layer will auto tune
>> the sk->sndbuf.
>>
>> Introduce a new socket flag, SOCK_SHORT_WRITE, such that we can mark the
>> socket when we have a short write due to memory pressure. By then testing
>> for SOCK_SHORT_WRITE in tcp_poll(), we hold off the POLLOUT until a
>> non-zero amount of data has been ack'd. In a previous approach:
>> http://marc.info/?l=linux-netdev&m=143930393211782&w=2, I had introduced a
>> new field in 'struct sock' to solve this issue, but its undesirable to add
>> bloat to 'struct sock'. We also could address this issue, by waiting for
>> the buffer to become completely empty, but that may reduce throughput since
>> the write buffer would be empty while waiting for subsequent writes. This
>> change brings us in line with the current epoll edge-trigger behavior for
>> the poll()/select() and epoll level-trigger.
>>
>> We guarantee that SOCK_SHORT_WRITE will eventually clear, since when we set
>> SOCK_SHORT_WRITE, we make sure that sk_wmem_queued is non-zero and
>> SOCK_NOSPACE is set as well (in sk_stream_wait_memory()).
>>
>> I tested this patch using 10,000 sockets, and setting SO_SNDBUF on the
>> server side, to induce tcp memory pressure. A single server thread reduced
>> its cpu usage from 100% to 19%, while maintaining the same level of
>> throughput.
>>
>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>> ---
> 
> 
> When this bug was added, and by which commit ?
> 

I think we have always had this behavior, I've seen it at least back
to 3.10. It requires SO_SNDBUF, lots of sockets and memory pressure...
That said its quite easy to reproduce.

> This looks serious, but your patch looks way too complex.
> 
> This SOCK_SHORT_WRITE bit constantly cleared in sk_stream_write_space()
> is expensive, because it dirties sk_flags which is not supposed to
> written often. This field is located in a read mostly cache line.
> 
> I believe my suggestion was not to add a per socket bit,
> but have a global state like tcp_memory_pressure. (or reuse it)
> 
> Ie , when under global memory pressure, tcp_poll() should apply a
> strategy to give POLLOUT only on sockets having less than 4K
> (tcp_wmem[0]) of memory consumed in their write queue.
> 
> 
> 
> 

ok, I think you're saying to just add:

if (sk_under_memory_pressure(sk) && sk->sk_wmem_queued > tcp_wmem[0])
	return false;

to the beginning of sk_stream_is_writeable().

My concern is that it then allows the sndbuf to basically completely
empty before we are triggered to write again. The patch I presented
attempts to write against as soon as any new space is available for
the flow.

I also didn't necessarily want to affect all flows. If so_sndbuf is
not set, we generally don't need to change the wakeup behavior b/c
the sndbuf shrinks to the point where the normal write wakeup does
not cause these tight poll()/write() loops, where we don't make
any progress. We also don't need to change the behavior if we are
under memory pressure but but have written enough (filled more than 2/3
of the buffer), such that the normal wakeups again would work fine.
So we could incorporate some of this logic into the above check, but
it becomes more complex.

In my patch, I could replace the:

sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);

in tcp_check_space() with something like:

sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE));

Since we are already writing to sk_flags there this should have very
minimal overhead. And then remove the clear in sk_stream_write_space().

Thanks,

-Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set
  2016-06-21 15:24   ` Jason Baron
@ 2016-06-21 23:48     ` Eric Dumazet
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2016-06-21 23:48 UTC (permalink / raw)
  To: Jason Baron; +Cc: davem, netdev

On Tue, 2016-06-21 at 11:24 -0400, Jason Baron wrote:

> in tcp_check_space() with something like:
> 
> sk->sk_flags &= ~((1UL << SOCK_QUEUE_SHRUNK) | (1UL << SOCK_SHORT_WRITE));
> 
> Since we are already writing to sk_flags there this should have very
> minimal overhead. And then remove the clear in sk_stream_write_space().

Interesting. You added in 3c7151275c0c9 a smp_mb__after_atomic()
in tcp_check_space(), but there is no atomic operation to begin with ;)

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-06-21 23:48 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 21:23 [PATCH net-next] tcp: reduce cpu usage when SO_SNDBUF is set Jason Baron
2016-06-20 22:29 ` Eric Dumazet
2016-06-21 15:24   ` Jason Baron
2016-06-21 23:48     ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.