netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
@ 2013-07-23  3:27 Eric Dumazet
  2013-07-23  3:52 ` Hannes Frederic Sowa
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-07-23  3:27 UTC (permalink / raw)
  To: David Miller
  Cc: Rick Jones, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

From: Eric Dumazet <edumazet@google.com>

Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.


This changes poll()/select()/epoll() to report POLLOUT 
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB  

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches                                            

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                             %     Method %      Method                          
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB  

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches                                            

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
v3: use the sk_stream_is_writeable() helper and fix the too many wakeup issue
v2: title/changelog fix (TCP_NOSENT_LOWAT -> TCP_NOTSENT_LOWAT)

 Documentation/networking/ip-sysctl.txt |   13 +++++++++++++
 include/linux/tcp.h                    |    1 +
 include/net/sock.h                     |   19 +++++++++++++------
 include/net/tcp.h                      |   14 ++++++++++++++
 include/uapi/linux/tcp.h               |    1 +
 net/ipv4/sysctl_net_ipv4.c             |    7 +++++++
 net/ipv4/tcp.c                         |    7 +++++++
 net/ipv4/tcp_ipv4.c                    |    1 +
 net/ipv4/tcp_output.c                  |    3 +++
 net/ipv6/tcp_ipv6.c                    |    1 +
 10 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 1074290..53cea9b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -516,6 +516,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max
 	this value is ignored.
 	Default: between 64K and 4MB, depending on RAM size.
 
+tcp_notsent_lowat - UNSIGNED INTEGER
+	A TCP socket can control the amount of unsent bytes in its write queue,
+	thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
+	reports POLLOUT events if the amount of unsent bytes is below a per
+	socket value, and if the write queue is not full. sendmsg() will
+	also not add new buffers if the limit is hit.
+
+	This global variable controls the amount of unsent data for
+	sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
+	to the global variable has immediate effect.
+
+	Default: UINT_MAX (0xFFFFFFFF)
+
 tcp_workaround_signed_windows - BOOLEAN
 	If set, assume no receipt of a window scaling option means the
 	remote TCP is broken and treats the window as a signed quantity.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 472120b..9640803 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -238,6 +238,7 @@ struct tcp_sock {
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
+	u32	notsent_lowat;	/* TCP_NOTSENT_LOWAT */
 	u32	pushed_seq;	/* Last pushed seq, required to talk to windows */
 	u32	lost_out;	/* Lost packets			*/
 	u32	sacked_out;	/* SACK'd packets			*/
diff --git a/include/net/sock.h b/include/net/sock.h
index d0b5fde..b9f2b09 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -746,11 +746,6 @@ static inline int sk_stream_wspace(const struct sock *sk)
 
 extern void sk_stream_write_space(struct sock *sk);
 
-static inline bool sk_stream_memory_free(const struct sock *sk)
-{
-	return sk->sk_wmem_queued < sk->sk_sndbuf;
-}
-
 /* OOB backlog add */
 static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
@@ -950,6 +945,7 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	bool			(*stream_memory_free)(const struct sock *sk);
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
 	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
@@ -1088,11 +1084,22 @@ static inline struct cg_proto *parent_cg_proto(struct proto *proto,
 }
 #endif
 
+static inline bool sk_stream_memory_free(const struct sock *sk)
+{
+	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
+		return false;
+
+	return sk->sk_prot->stream_memory_free ?
+		sk->sk_prot->stream_memory_free(sk) : true;
+}
+
 static inline bool sk_stream_is_writeable(const struct sock *sk)
 {
-	return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk);
+	return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
+	       sk_stream_memory_free(sk);
 }
 
+
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
 	return sk->sk_prot->memory_pressure != NULL;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c586847..18fc999 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -284,6 +284,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern unsigned int sysctl_tcp_notsent_lowat;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -1539,6 +1540,19 @@ extern int tcp_gro_complete(struct sk_buff *skb);
 extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr,
 				__be32 daddr);
 
+static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp)
+{
+	return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat;
+}
+
+static inline bool tcp_stream_memory_free(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u32 notsent_bytes = tp->write_seq - tp->snd_nxt;
+
+	return notsent_bytes < tcp_notsent_lowat(tp);
+}
+
 #ifdef CONFIG_PROC_FS
 extern int tcp4_proc_init(void);
 extern void tcp4_proc_exit(void);
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 8d776eb..377f1e5 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -111,6 +111,7 @@ enum {
 #define TCP_REPAIR_OPTIONS	22
 #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
 #define TCP_TIMESTAMP		24
+#define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b2c123c..69ed203 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -555,6 +555,13 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &one,
 	},
 	{
+		.procname	= "tcp_notsent_lowat",
+		.data		= &sysctl_tcp_notsent_lowat,
+		.maxlen		= sizeof(sysctl_tcp_notsent_lowat),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "tcp_rmem",
 		.data		= &sysctl_tcp_rmem,
 		.maxlen		= sizeof(sysctl_tcp_rmem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5eca906..c27e813 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2631,6 +2631,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		else
 			tp->tsoffset = val - tcp_time_stamp;
 		break;
+	case TCP_NOTSENT_LOWAT:
+		tp->notsent_lowat = val;
+		sk->sk_write_space(sk);
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2847,6 +2851,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_TIMESTAMP:
 		val = tcp_time_stamp + tp->tsoffset;
 		break;
+	case TCP_NOTSENT_LOWAT:
+		val = tp->notsent_lowat;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 2e3f129..2a5d5c4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2800,6 +2800,7 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
+	.stream_memory_free	= tcp_stream_memory_free,
 	.sockets_allocated	= &tcp_sockets_allocated,
 	.orphan_count		= &tcp_orphan_count,
 	.memory_allocated	= &tcp_memory_allocated,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 92fde8d..884efff 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+
 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			   int push_one, gfp_t gfp);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 80fe69e..b792e87 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1924,6 +1924,7 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
+	.stream_memory_free	= tcp_stream_memory_free,
 	.sockets_allocated	= &tcp_sockets_allocated,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23  3:27 [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option Eric Dumazet
@ 2013-07-23  3:52 ` Hannes Frederic Sowa
  2013-07-23 15:26 ` Rick Jones
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-23  3:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Rick Jones, netdev, Yuchung Cheng, Neal Cardwell,
	Michael Kerrisk

On Mon, Jul 22, 2013 at 08:27:07PM -0700, Eric Dumazet wrote:
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)

Oh, I had another understanding of SO_SNDLOWAT in my head: The minimum amount
of free write space in the socket buffer so that select/poll reports POLLOUT.

In my previous mail I was specifically referring to the optimization in
sk_stream_write_space() and not to the whole TCP_NOTSEND_LOWAT knob.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23  3:27 [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option Eric Dumazet
  2013-07-23  3:52 ` Hannes Frederic Sowa
@ 2013-07-23 15:26 ` Rick Jones
  2013-07-23 15:44   ` Eric Dumazet
  2013-07-23 18:24 ` Yuchung Cheng
  2013-07-25  0:55 ` David Miller
  3 siblings, 1 reply; 11+ messages in thread
From: Rick Jones @ 2013-07-23 15:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On 07/22/2013 08:27 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
>
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
>
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
>
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
>
> This patch adds two ways to set the limit :
>
> 1) Per socket option TCP_NOTSENT_LOWAT
>
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
>
>
> This changes poll()/select()/epoll() to report POLLOUT
> only if number of unsent bytes is below tp->nosent_lowat
>
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
>
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>   Specify the minimum number of bytes in the buffer until
>   the socket layer will pass the data to the protocol)
>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput or cpu usage
> of a single flow, although there is an increase of context switches.
>
> A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies of ACK processing.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
>
>   Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>             412,514 context-switches
>
>       200.034645535 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
>
>   Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>           2,675,818 context-switches
>
>       200.029651391 seconds time elapsed


I see that now the service demand increase is more like 8%, though there 
is no longer a throughput increase.  Whether an 8% increase is not a bad 
effect on the CPU usage of a single flow is probably in the eye of the 
beholder.

Anyway, on a more "how to use netperf" theme, while the final confidence 
interval width wasn't reported, given the combination of -l 20, -i 10,3 
and perf stat reporting an elapsed time of 200 seconds, we can conclude 
that the test went the full 10 iterations and so probably didn't 
actually hit the desired confidence interval of 5% wide at 99% probability.

17321.16 Mbit/s is ~132150 16 KB sends per second.  There were roughly 
13,379 context switches per second, so not quite 10 sends per context 
switch (~161831 bytes , that then is something like 161831 KB per 
context switch.  Does that then imply you could have achieved nearly the 
same performance with test-specific -s 160K -S 160K -m 16K ? (perhaps a 
bit more than that socket buffer size for contingencies and or what was 
"stored"/sent in the pipe?)  Or, given that the SO_SNDBUF grew to 
1593240 bytes, was there really a need for  ~ 1593240 - 131072 or 
~1462168 sent bytes in flight most of the time?

happy benchmarking,

rick jones

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23 15:26 ` Rick Jones
@ 2013-07-23 15:44   ` Eric Dumazet
  2013-07-23 16:20     ` Rick Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-07-23 15:44 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Tue, 2013-07-23 at 08:26 -0700, Rick Jones wrote:

> I see that now the service demand increase is more like 8%, though there 
> is no longer a throughput increase.  Whether an 8% increase is not a bad 
> effect on the CPU usage of a single flow is probably in the eye of the 
> beholder.

Again, it seems you didn't understand the goal of this patch.

It's not trying to get lower cpu usage, but lower memory usage, _and_
proper logical splitting of the write queue.

> 
> Anyway, on a more "how to use netperf" theme, while the final confidence 
> interval width wasn't reported, given the combination of -l 20, -i 10,3 
> and perf stat reporting an elapsed time of 200 seconds, we can conclude 
> that the test went the full 10 iterations and so probably didn't 
> actually hit the desired confidence interval of 5% wide at 99% probability.
> 
> 17321.16 Mbit/s is ~132150 16 KB sends per second.  There were roughly 
> 13,379 context switches per second, so not quite 10 sends per context 
> switch (~161831 bytes , that then is something like 161831 KB per 
> context switch.  Does that then imply you could have achieved nearly the 
> same performance with test-specific -s 160K -S 160K -m 16K ? (perhaps a 
> bit more than that socket buffer size for contingencies and or what was 
> "stored"/sent in the pipe?)  Or, given that the SO_SNDBUF grew to 
> 1593240 bytes, was there really a need for  ~ 1593240 - 131072 or 
> ~1462168 sent bytes in flight most of the time?
> 

Heh, you are trying the old crap again ;)

Why should we care of setting buffer sizes at all, when we have
autotuning ;)

RTT can vary from 50us to 200ms, rate can vary dynamically as well, some
AQM can trigger with whatever policy, you can have sudden reorders
because some router chose to apply per packet load balancing :

- You do not want to hard code buffer sizes, but instead let TCP stack
tune it properly.

Sure, I can probably can find out what are the optimal settings for a
given workload and given network to get minimal cpu usage.

But the point is having the stack finds this automatically.

Further tweaks can be done to avoid a context switch per TSO packet for
example. If we allow 10 notsent packets, we can probably  wait to have 5
packets before doing a wakeup.

 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23 15:44   ` Eric Dumazet
@ 2013-07-23 16:20     ` Rick Jones
  2013-07-23 16:48       ` Eric Dumazet
  2013-07-23 17:18       ` Eric Dumazet
  0 siblings, 2 replies; 11+ messages in thread
From: Rick Jones @ 2013-07-23 16:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On 07/23/2013 08:44 AM, Eric Dumazet wrote:
> On Tue, 2013-07-23 at 08:26 -0700, Rick Jones wrote:
>
>> I see that now the service demand increase is more like 8%, though there
>> is no longer a throughput increase.  Whether an 8% increase is not a bad
>> effect on the CPU usage of a single flow is probably in the eye of the
>> beholder.
>
> Again, it seems you didn't understand the goal of this patch.
>
> It's not trying to get lower cpu usage, but lower memory usage, _and_
> proper logical splitting of the write queue.

Right - I am questioning whether it is worth the CPU increase.

> Heh, you are trying the old crap again ;)

Yes - why do you seem to be resisting?-)

> Why should we care of setting buffer sizes at all, when we have
> autotuning ;)

Because it keeps growing the buffer too large?-)

> RTT can vary from 50us to 200ms, rate can vary dynamically as well, some
> AQM can trigger with whatever policy, you can have sudden reorders
> because some router chose to apply per packet load balancing :
>
> - You do not want to hard code buffer sizes, but instead let TCP stack
> tune it properly.

I agree that is far nicer if it can be counted upon to work well.

> Sure, I can probably can find out what are the optimal settings for a
> given workload and given network to get minimal cpu usage.
>
> But the point is having the stack finds this automatically.
>
> Further tweaks can be done to avoid a context switch per TSO packet for
> example. If we allow 10 notsent packets, we can probably  wait to have 5
> packets before doing a wakeup.

Isn't this change really just trying to paper-over the autotuning's 
over-growing of the socket buffers?  Or are you considering it an 
extension of the auto-tuning heuristics?

If your 20Gbit test setup needed only 256KB socket buffers (figure 
pulled form the ether) to get to 17 Gbit/s, isn't the autotuning's 
growing them to several MB a bug in the autotuning?

rick

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23 16:20     ` Rick Jones
@ 2013-07-23 16:48       ` Eric Dumazet
  2013-07-23 17:18       ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-07-23 16:48 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Tue, 2013-07-23 at 09:20 -0700, Rick Jones wrote:

> Right - I am questioning whether it is worth the CPU increase.

There is no cpu increase for common workloads, and hosts can save GB of
precious memory thanks to this patch.

There is a cpu increase only for 'netperf' kind of program, relying on
blocking sendmsg() and using one thread per socket, _if_ and only _if_
they set a crazy notsent_lowat value.

Remember I forced nobody to do that. Its like forcing SO_SNDBUF with one
byte, and SO_RCVBUF with one byte, and expecting good line rate
performance !

This patch changes the threshold to get the 'socket is writeable'
POLLOUT event, and avoid filling socket write queues with too many
packets.

Like all thresholds, it has to be properly used.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23 16:20     ` Rick Jones
  2013-07-23 16:48       ` Eric Dumazet
@ 2013-07-23 17:18       ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-07-23 17:18 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Tue, 2013-07-23 at 09:20 -0700, Rick Jones wrote:

> Isn't this change really just trying to paper-over the autotuning's 
> over-growing of the socket buffers?  Or are you considering it an 
> extension of the auto-tuning heuristics?
> 
> If your 20Gbit test setup needed only 256KB socket buffers (figure 
> pulled form the ether) to get to 17 Gbit/s, isn't the autotuning's 
> growing them to several MB a bug in the autotuning?


As long as we limit the number of unsent bytes, there is no longer an
over provisioning problem.

TCP stack will be able to use the large windows if _needed_ by current
network conditions, receiver (in)ability to drain the data, and if
allowed by congestion control constraints.

If now you are complaining that TCP congestion controls are bad, thats a
completely different story, and this patch does not claim to solve this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23  3:27 [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option Eric Dumazet
  2013-07-23  3:52 ` Hannes Frederic Sowa
  2013-07-23 15:26 ` Rick Jones
@ 2013-07-23 18:24 ` Yuchung Cheng
  2013-07-25  0:55 ` David Miller
  3 siblings, 0 replies; 11+ messages in thread
From: Yuchung Cheng @ 2013-07-23 18:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Rick Jones, netdev, Neal Cardwell, Michael Kerrisk

On Mon, Jul 22, 2013 at 8:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
>
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
>
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
>
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
>
> This patch adds two ways to set the limit :
>
> 1) Per socket option TCP_NOTSENT_LOWAT
>
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
>
>
> This changes poll()/select()/epoll() to report POLLOUT
> only if number of unsent bytes is below tp->nosent_lowat
>
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
>
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)
>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput or cpu usage
> of a single flow, although there is an increase of context switches.
>
> A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies of ACK processing.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>            412,514 context-switches
>
>      200.034645535 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>          2,675,818 context-switches
>
>      200.029651391 seconds time elapsed
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>

Sorry I acked the wrong (v2) patch previously.

> ---
> v3: use the sk_stream_is_writeable() helper and fix the too many wakeup issue
> v2: title/changelog fix (TCP_NOSENT_LOWAT -> TCP_NOTSENT_LOWAT)
>
>  Documentation/networking/ip-sysctl.txt |   13 +++++++++++++
>  include/linux/tcp.h                    |    1 +
>  include/net/sock.h                     |   19 +++++++++++++------
>  include/net/tcp.h                      |   14 ++++++++++++++
>  include/uapi/linux/tcp.h               |    1 +
>  net/ipv4/sysctl_net_ipv4.c             |    7 +++++++
>  net/ipv4/tcp.c                         |    7 +++++++
>  net/ipv4/tcp_ipv4.c                    |    1 +
>  net/ipv4/tcp_output.c                  |    3 +++
>  net/ipv6/tcp_ipv6.c                    |    1 +
>  10 files changed, 61 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 1074290..53cea9b 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -516,6 +516,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max
>         this value is ignored.
>         Default: between 64K and 4MB, depending on RAM size.
>
> +tcp_notsent_lowat - UNSIGNED INTEGER
> +       A TCP socket can control the amount of unsent bytes in its write queue,
> +       thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
> +       reports POLLOUT events if the amount of unsent bytes is below a per
> +       socket value, and if the write queue is not full. sendmsg() will
> +       also not add new buffers if the limit is hit.
> +
> +       This global variable controls the amount of unsent data for
> +       sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
> +       to the global variable has immediate effect.
> +
> +       Default: UINT_MAX (0xFFFFFFFF)
> +
>  tcp_workaround_signed_windows - BOOLEAN
>         If set, assume no receipt of a window scaling option means the
>         remote TCP is broken and treats the window as a signed quantity.
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 472120b..9640803 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -238,6 +238,7 @@ struct tcp_sock {
>
>         u32     rcv_wnd;        /* Current receiver window              */
>         u32     write_seq;      /* Tail(+1) of data held in tcp send buffer */
> +       u32     notsent_lowat;  /* TCP_NOTSENT_LOWAT */
>         u32     pushed_seq;     /* Last pushed seq, required to talk to windows */
>         u32     lost_out;       /* Lost packets                 */
>         u32     sacked_out;     /* SACK'd packets                       */
> diff --git a/include/net/sock.h b/include/net/sock.h
> index d0b5fde..b9f2b09 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -746,11 +746,6 @@ static inline int sk_stream_wspace(const struct sock *sk)
>
>  extern void sk_stream_write_space(struct sock *sk);
>
> -static inline bool sk_stream_memory_free(const struct sock *sk)
> -{
> -       return sk->sk_wmem_queued < sk->sk_sndbuf;
> -}
> -
>  /* OOB backlog add */
>  static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
> @@ -950,6 +945,7 @@ struct proto {
>         unsigned int            inuse_idx;
>  #endif
>
> +       bool                    (*stream_memory_free)(const struct sock *sk);
>         /* Memory pressure */
>         void                    (*enter_memory_pressure)(struct sock *sk);
>         atomic_long_t           *memory_allocated;      /* Current allocated memory. */
> @@ -1088,11 +1084,22 @@ static inline struct cg_proto *parent_cg_proto(struct proto *proto,
>  }
>  #endif
>
> +static inline bool sk_stream_memory_free(const struct sock *sk)
> +{
> +       if (sk->sk_wmem_queued >= sk->sk_sndbuf)
> +               return false;
> +
> +       return sk->sk_prot->stream_memory_free ?
> +               sk->sk_prot->stream_memory_free(sk) : true;
> +}
> +
>  static inline bool sk_stream_is_writeable(const struct sock *sk)
>  {
> -       return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk);
> +       return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
> +              sk_stream_memory_free(sk);
>  }
>
> +
>  static inline bool sk_has_memory_pressure(const struct sock *sk)
>  {
>         return sk->sk_prot->memory_pressure != NULL;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c586847..18fc999 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -284,6 +284,7 @@ extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
> +extern unsigned int sysctl_tcp_notsent_lowat;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> @@ -1539,6 +1540,19 @@ extern int tcp_gro_complete(struct sk_buff *skb);
>  extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr,
>                                 __be32 daddr);
>
> +static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp)
> +{
> +       return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat;
> +}
> +
> +static inline bool tcp_stream_memory_free(const struct sock *sk)
> +{
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       u32 notsent_bytes = tp->write_seq - tp->snd_nxt;
> +
> +       return notsent_bytes < tcp_notsent_lowat(tp);
> +}
> +
>  #ifdef CONFIG_PROC_FS
>  extern int tcp4_proc_init(void);
>  extern void tcp4_proc_exit(void);
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 8d776eb..377f1e5 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -111,6 +111,7 @@ enum {
>  #define TCP_REPAIR_OPTIONS     22
>  #define TCP_FASTOPEN           23      /* Enable FastOpen on listeners */
>  #define TCP_TIMESTAMP          24
> +#define TCP_NOTSENT_LOWAT      25      /* limit number of unsent bytes in write queue */
>
>  struct tcp_repair_opt {
>         __u32   opt_code;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index b2c123c..69ed203 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -555,6 +555,13 @@ static struct ctl_table ipv4_table[] = {
>                 .extra1         = &one,
>         },
>         {
> +               .procname       = "tcp_notsent_lowat",
> +               .data           = &sysctl_tcp_notsent_lowat,
> +               .maxlen         = sizeof(sysctl_tcp_notsent_lowat),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
> +       {
>                 .procname       = "tcp_rmem",
>                 .data           = &sysctl_tcp_rmem,
>                 .maxlen         = sizeof(sysctl_tcp_rmem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 5eca906..c27e813 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2631,6 +2631,10 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>                 else
>                         tp->tsoffset = val - tcp_time_stamp;
>                 break;
> +       case TCP_NOTSENT_LOWAT:
> +               tp->notsent_lowat = val;
> +               sk->sk_write_space(sk);
> +               break;
>         default:
>                 err = -ENOPROTOOPT;
>                 break;
> @@ -2847,6 +2851,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>         case TCP_TIMESTAMP:
>                 val = tcp_time_stamp + tp->tsoffset;
>                 break;
> +       case TCP_NOTSENT_LOWAT:
> +               val = tp->notsent_lowat;
> +               break;
>         default:
>                 return -ENOPROTOOPT;
>         }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 2e3f129..2a5d5c4 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2800,6 +2800,7 @@ struct proto tcp_prot = {
>         .unhash                 = inet_unhash,
>         .get_port               = inet_csk_get_port,
>         .enter_memory_pressure  = tcp_enter_memory_pressure,
> +       .stream_memory_free     = tcp_stream_memory_free,
>         .sockets_allocated      = &tcp_sockets_allocated,
>         .orphan_count           = &tcp_orphan_count,
>         .memory_allocated       = &tcp_memory_allocated,
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 92fde8d..884efff 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
>  /* By default, RFC2861 behavior.  */
>  int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
>
> +unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
> +EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
> +
>  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                            int push_one, gfp_t gfp);
>
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 80fe69e..b792e87 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1924,6 +1924,7 @@ struct proto tcpv6_prot = {
>         .unhash                 = inet_unhash,
>         .get_port               = inet_csk_get_port,
>         .enter_memory_pressure  = tcp_enter_memory_pressure,
> +       .stream_memory_free     = tcp_stream_memory_free,
>         .sockets_allocated      = &tcp_sockets_allocated,
>         .memory_allocated       = &tcp_memory_allocated,
>         .memory_pressure        = &tcp_memory_pressure,
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
  2013-07-23  3:27 [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option Eric Dumazet
                   ` (2 preceding siblings ...)
  2013-07-23 18:24 ` Yuchung Cheng
@ 2013-07-25  0:55 ` David Miller
  3 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-07-25  0:55 UTC (permalink / raw)
  To: eric.dumazet; +Cc: rick.jones2, netdev, ycheng, ncardwell, mtk.manpages

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 22 Jul 2013 20:27:07 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
> 
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
> 
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
> 
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
> 
> This patch adds two ways to set the limit :
> 
> 1) Per socket option TCP_NOTSENT_LOWAT
> 
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
> 
> 
> This changes poll()/select()/epoll() to report POLLOUT 
> only if number of unsent bytes is below tp->nosent_lowat
> 
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
> 
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)
> 
> Tested:
 ...
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
@ 2013-07-23 19:28 Neal Cardwell
  0 siblings, 0 replies; 11+ messages in thread
From: Neal Cardwell @ 2013-07-23 19:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Rick Jones, netdev, Yuchung Cheng, Michael Kerrisk

On Mon, Jul 22, 2013 at 11:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
>
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
>
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
>
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
>
> This patch adds two ways to set the limit :
>
> 1) Per socket option TCP_NOTSENT_LOWAT
>
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
>
>
> This changes poll()/select()/epoll() to report POLLOUT
> only if number of unsent bytes is below tp->nosent_lowat
>
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
>
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)
>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput or cpu usage
> of a single flow, although there is an increase of context switches.
>
> A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies of ACK processing.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>            412,514 context-switches
>
>      200.034645535 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>          2,675,818 context-switches
>
>      200.029651391 seconds time elapsed
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>


Acked-by: Neal Cardwell <ncardwell@google.com>


[apologies for the dup, patchworks didn't seem to like my last email,
I think perhaps due to a missing final newline...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option
@ 2013-07-23 19:19 Neal Cardwell
  0 siblings, 0 replies; 11+ messages in thread
From: Neal Cardwell @ 2013-07-23 19:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Rick Jones, netdev, Yuchung Cheng, Michael Kerrisk

On Mon, Jul 22, 2013 at 11:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Idea of this patch is to add optional limitation of number of
> unsent bytes in TCP sockets, to reduce usage of kernel memory.
>
> TCP receiver might announce a big window, and TCP sender autotuning
> might allow a large amount of bytes in write queue, but this has little
> performance impact if a large part of this buffering is wasted :
>
> Write queue needs to be large only to deal with large BDP, not
> necessarily to cope with scheduling delays (incoming ACKS make room
> for the application to queue more bytes)
>
> For most workloads, using a value of 128 KB or less is OK to give
> applications enough time to react to POLLOUT events in time
> (or being awaken in a blocking sendmsg())
>
> This patch adds two ways to set the limit :
>
> 1) Per socket option TCP_NOTSENT_LOWAT
>
> 2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
> not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
> Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
>
>
> This changes poll()/select()/epoll() to report POLLOUT
> only if number of unsent bytes is below tp->nosent_lowat
>
> Note this might increase number of sendmsg()/sendfile() calls
> when using non blocking sockets,
> and increase number of context switches for blocking sockets.
>
> Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
> defined as :
>  Specify the minimum number of bytes in the buffer until
>  the socket layer will pass the data to the protocol)
>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput or cpu usage
> of a single flow, although there is an increase of context switches.
>
> A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies of ACK processing.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>            412,514 context-switches
>
>      200.034645535 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
>
>  Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
>
>          2,675,818 context-switches
>
>      200.029651391 seconds time elapsed
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-07-25  0:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-23  3:27 [PATCH v3 net-next 2/2] tcp: TCP_NOTSENT_LOWAT socket option Eric Dumazet
2013-07-23  3:52 ` Hannes Frederic Sowa
2013-07-23 15:26 ` Rick Jones
2013-07-23 15:44   ` Eric Dumazet
2013-07-23 16:20     ` Rick Jones
2013-07-23 16:48       ` Eric Dumazet
2013-07-23 17:18       ` Eric Dumazet
2013-07-23 18:24 ` Yuchung Cheng
2013-07-25  0:55 ` David Miller
2013-07-23 19:19 Neal Cardwell
2013-07-23 19:28 Neal Cardwell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).