netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
@ 2013-07-22 19:13 Eric Dumazet
  2013-07-22 19:28 ` Eric Dumazet
  2013-07-22 20:43 ` Rick Jones
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-22 19:13 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

From: Eric Dumazet <edumazet@google.com>

Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :
1) Per socket option TCP_NOSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.


This changes poll()/select()/epoll() to report POLLOUT 
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg() calls when using non
blocking sockets, and increase number of context switches for
blocking sockets.

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

Even in the absence of shallow queues, we get a benefit.

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput of a single flow, although
there is an increase of cpu time as sendmsg() calls trigger more
context switches. A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H lpq84 -t omni -l 20 -Cc
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84 () port 0 AF_INET
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
2097152     6000000     16384  20.00   16509.68   10^6bits/s  3.05  S      4.50   S      0.363   0.536   usec/KB

 Performance counter stats for './netperf -H lpq84 -t omni -l 20 -Cc':

            30,141 context-switches

      20.006308407 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H lpq84 -t omni -l 20 -Cc
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84 () port 0 AF_INET
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1911888     6000000     16384  20.00   17412.51   10^6bits/s  3.94  S      4.39   S      0.444   0.496   usec/KB

 Performance counter stats for './netperf -H lpq84 -t omni -l 20 -Cc':

           284,669 context-switches

      20.005294656 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 Documentation/networking/ip-sysctl.txt |   13 +++++++++++++
 include/linux/tcp.h                    |    1 +
 include/net/sock.h                     |   15 ++++++++++-----
 include/net/tcp.h                      |   14 ++++++++++++++
 include/uapi/linux/tcp.h               |    1 +
 net/ipv4/sysctl_net_ipv4.c             |    7 +++++++
 net/ipv4/tcp.c                         |   12 ++++++++++--
 net/ipv4/tcp_ipv4.c                    |    1 +
 net/ipv4/tcp_output.c                  |    3 +++
 net/ipv6/tcp_ipv6.c                    |    1 +
 10 files changed, 61 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 1074290..53cea9b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -516,6 +516,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max
 	this value is ignored.
 	Default: between 64K and 4MB, depending on RAM size.
 
+tcp_notsent_lowat - UNSIGNED INTEGER
+	A TCP socket can control the amount of unsent bytes in its write queue,
+	thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
+	reports POLLOUT events if the amount of unsent bytes is below a per
+	socket value, and if the write queue is not full. sendmsg() will
+	also not add new buffers if the limit is hit.
+
+	This global variable controls the amount of unsent data for
+	sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
+	to the global variable has immediate effect.
+
+	Default: UINT_MAX (0xFFFFFFFF)
+
 tcp_workaround_signed_windows - BOOLEAN
 	If set, assume no receipt of a window scaling option means the
 	remote TCP is broken and treats the window as a signed quantity.
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 472120b..9640803 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -238,6 +238,7 @@ struct tcp_sock {
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
+	u32	notsent_lowat;	/* TCP_NOTSENT_LOWAT */
 	u32	pushed_seq;	/* Last pushed seq, required to talk to windows */
 	u32	lost_out;	/* Lost packets			*/
 	u32	sacked_out;	/* SACK'd packets			*/
diff --git a/include/net/sock.h b/include/net/sock.h
index 95a5a2c..7be0b22 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -746,11 +746,6 @@ static inline int sk_stream_wspace(const struct sock *sk)
 
 extern void sk_stream_write_space(struct sock *sk);
 
-static inline bool sk_stream_memory_free(const struct sock *sk)
-{
-	return sk->sk_wmem_queued < sk->sk_sndbuf;
-}
-
 /* OOB backlog add */
 static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
@@ -950,6 +945,7 @@ struct proto {
 	unsigned int		inuse_idx;
 #endif
 
+	bool			(*stream_memory_free)(const struct sock *sk);
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
 	atomic_long_t		*memory_allocated;	/* Current allocated memory. */
@@ -1089,6 +1085,15 @@ static inline struct cg_proto *parent_cg_proto(struct proto *proto,
 #endif
 
 
+static inline bool sk_stream_memory_free(const struct sock *sk)
+{
+	if (sk->sk_wmem_queued >= sk->sk_sndbuf)
+		return false;
+
+	return sk->sk_prot->stream_memory_free ?
+		sk->sk_prot->stream_memory_free(sk) : true;
+}
+
 static inline bool sk_has_memory_pressure(const struct sock *sk)
 {
 	return sk->sk_prot->memory_pressure != NULL;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d198005..ff58714 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -284,6 +284,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern unsigned int sysctl_tcp_notsent_lowat;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -1549,6 +1550,19 @@ extern int tcp_gro_complete(struct sk_buff *skb);
 extern void __tcp_v4_send_check(struct sk_buff *skb, __be32 saddr,
 				__be32 daddr);
 
+static inline u32 tcp_notsent_lowat(const struct tcp_sock *tp)
+{
+	return tp->notsent_lowat ?: sysctl_tcp_notsent_lowat;
+}
+
+static inline bool tcp_stream_memory_free(const struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u32 notsent_bytes = tp->write_seq - tp->snd_nxt;
+
+	return notsent_bytes < tcp_notsent_lowat(tp);
+}
+
 #ifdef CONFIG_PROC_FS
 extern int tcp4_proc_init(void);
 extern void tcp4_proc_exit(void);
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 8d776eb..377f1e5 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -111,6 +111,7 @@ enum {
 #define TCP_REPAIR_OPTIONS	22
 #define TCP_FASTOPEN		23	/* Enable FastOpen on listeners */
 #define TCP_TIMESTAMP		24
+#define TCP_NOTSENT_LOWAT	25	/* limit number of unsent bytes in write queue */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b2c123c..69ed203 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -555,6 +555,13 @@ static struct ctl_table ipv4_table[] = {
 		.extra1		= &one,
 	},
 	{
+		.procname	= "tcp_notsent_lowat",
+		.data		= &sysctl_tcp_notsent_lowat,
+		.maxlen		= sizeof(sysctl_tcp_notsent_lowat),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "tcp_rmem",
 		.data		= &sysctl_tcp_rmem,
 		.maxlen		= sizeof(sysctl_tcp_rmem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5423223..5792302 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -499,7 +499,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 			mask |= POLLIN | POLLRDNORM;
 
 		if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
-			if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)) {
+			if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
+			    tcp_stream_memory_free(sk)) {
 				mask |= POLLOUT | POLLWRNORM;
 			} else {  /* send SIGIO later */
 				set_bit(SOCK_ASYNC_NOSPACE,
@@ -510,7 +511,8 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 				 * wspace test but before the flags are set,
 				 * IO signal will be lost.
 				 */
-				if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk))
+				if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
+				    tcp_stream_memory_free(sk))
 					mask |= POLLOUT | POLLWRNORM;
 			}
 		} else
@@ -2631,6 +2633,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		else
 			tp->tsoffset = val - tcp_time_stamp;
 		break;
+	case TCP_NOTSENT_LOWAT:
+		tp->notsent_lowat = val;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2847,6 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_TIMESTAMP:
 		val = tcp_time_stamp + tp->tsoffset;
 		break;
+	case TCP_NOTSENT_LOWAT:
+		val = tp->notsent_lowat;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b74628e..8390bff 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2806,6 +2806,7 @@ struct proto tcp_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
+	.stream_memory_free	= tcp_stream_memory_free,
 	.sockets_allocated	= &tcp_sockets_allocated,
 	.orphan_count		= &tcp_orphan_count,
 	.memory_allocated	= &tcp_memory_allocated,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 92fde8d..884efff 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -65,6 +65,9 @@ int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+unsigned int sysctl_tcp_notsent_lowat __read_mostly = UINT_MAX;
+EXPORT_SYMBOL(sysctl_tcp_notsent_lowat);
+
 static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			   int push_one, gfp_t gfp);
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f0d6363..0030cfd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1927,6 +1927,7 @@ struct proto tcpv6_prot = {
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
+	.stream_memory_free	= tcp_stream_memory_free,
 	.sockets_allocated	= &tcp_sockets_allocated,
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-22 19:13 [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option Eric Dumazet
@ 2013-07-22 19:28 ` Eric Dumazet
  2013-07-22 20:43 ` Rick Jones
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-22 19:28 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Mon, 2013-07-22 at 12:13 -0700, Eric Dumazet wrote:
> TCP_NOSENT_LOWAT

For an unknown reason, this was spelled incorrectly

I'll send a V2 with TCP_NOTSENT_LOWAT

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-22 19:13 [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option Eric Dumazet
  2013-07-22 19:28 ` Eric Dumazet
@ 2013-07-22 20:43 ` Rick Jones
  2013-07-22 22:44   ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Rick Jones @ 2013-07-22 20:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On 07/22/2013 12:13 PM, Eric Dumazet wrote:

>
> Tested:
>
> netperf sessions, and watching /proc/net/protocols "memory" column for TCP
>
> Even in the absence of shallow queues, we get a benefit.
>
> With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
> used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
> TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
> TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
>
> Using 128KB has no bad effect on the throughput of a single flow, although
> there is an increase of cpu time as sendmsg() calls trigger more
> context switches. A bonus is that we hold socket lock for a shorter amount
> of time and should improve latencies.
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H lpq84 -t omni -l 20 -Cc
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84 () port 0 AF_INET
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 2097152     6000000     16384  20.00   16509.68   10^6bits/s  3.05  S      4.50   S      0.363   0.536   usec/KB
>
>   Performance counter stats for './netperf -H lpq84 -t omni -l 20 -Cc':
>
>              30,141 context-switches
>
>        20.006308407 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat -e context-switches ./netperf -H lpq84 -t omni -l 20 -Cc
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84 () port 0 AF_INET
> Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                             %     Method %      Method
> 1911888     6000000     16384  20.00   17412.51   10^6bits/s  3.94  S      4.39   S      0.444   0.496   usec/KB
>
>   Performance counter stats for './netperf -H lpq84 -t omni -l 20 -Cc':
>
>             284,669 context-switches
>
>        20.005294656 seconds time elapsed

Netperf is perhaps a "best case" for this as it has no think time and 
will not itself build-up a queue of data internally.

The 18% increase in service demand is troubling.

It would be good to hit that with the confidence intervals (eg -i 30,3 
and perhaps -i 99,<somthing other than the default of 5>) or do many 
separate runs to get an idea of the variation.  Presumably remote 
service demand is not of interest, so for the confidence intervals bit 
you might drop the -C and keep only the -c in which case, netperf will 
not be trying to hit the confidence interval remote CPU utilization 
along with local CPU and throughput

Why are there more context switches with the lowat set to 128KB?  Is the 
SO_SNDBUF growth in the first case the reason? Otherwise I would have 
thought that netperf would have been context switching back and forth at 
at "socket full" just as often as "at 128KB." You might then also 
compare before and after with a fixed socket buffer size

Anything interesting happen when the send size is larger than the lowat?

rick jones

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-22 20:43 ` Rick Jones
@ 2013-07-22 22:44   ` Eric Dumazet
  2013-07-22 23:08     ` Rick Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2013-07-22 22:44 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

Hi Rick

> Netperf is perhaps a "best case" for this as it has no think time and 
> will not itself build-up a queue of data internally.
> 
> The 18% increase in service demand is troubling.

Its not troubling at such high speed. (Note also I had better throughput
in my (single) test)

Process scheduler cost is abysmal (Or more exactly when cpu enters idle
mode I presume).

Adding a context switch for every TSO packet is obviously not something
you want if you want to pump 20Gbps on a single tcp socket. I guess that
real application would not use 16KB send()s either.

I chose extreme parameters to show that the patch had acceptable impact.
(128KB are only 2 TSO packets)

The main targets of this patch are servers handling hundred to million
of sockets, or any machine with RAM constraints. This would also permit
better autotuning in the future. Our current 4MB limit is a bit small in
some cases.

Allowing the socket write queue to queue more bytes is better for
throughput/cpu cycles, as long as you have enough RAM.


> 
> It would be good to hit that with the confidence intervals (eg -i 30,3 
> and perhaps -i 99,<somthing other than the default of 5>) or do many 
> separate runs to get an idea of the variation.  Presumably remote 
> service demand is not of interest, so for the confidence intervals bit 
> you might drop the -C and keep only the -c in which case, netperf will 
> not be trying to hit the confidence interval remote CPU utilization 
> along with local CPU and throughput
> 

Well, I am sure a lot of netperf tests can be done, thanks for the
input ! I am removing the -C ;)

The -i30,3 runs are usually very very very slow :(

> Why are there more context switches with the lowat set to 128KB?  Is the 
> SO_SNDBUF growth in the first case the reason? Otherwise I would have 
> thought that netperf would have been context switching back and forth at 
> at "socket full" just as often as "at 128KB." You might then also 
> compare before and after with a fixed socket buffer size

It seems to me normal to get one context switch per TSO packet, instead
of _no_ context switches when the cpu is so busy it never has to put the
netperf thread to sleep. softirq handling is removing packets from write
queue at the same speed than application can add new ones ;)

> 
> Anything interesting happen when the send size is larger than the lowat?

Let's see ;)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                              %     Method %      Method                          
3056328     6291456     262144  20.00   16311.69   10^6bits/s  2.97  S      -1.00  U      0.359   -1.000  usec/KB  

 Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':

      89301.211847 task-clock                #    0.446 CPUs utilized          
           349,509 context-switches          #    0.004 M/sec                  
               179 CPU-migrations            #    0.002 K/sec                  
               453 page-faults               #    0.005 K/sec                  
   242,819,453,514 cycles                    #    2.719 GHz                     [81.82%]
   199,273,454,019 stalled-cycles-frontend   #   82.07% frontend cycles idle    [84.27%]
    50,268,984,648 stalled-cycles-backend    #   20.70% backend  cycles idle    [67.76%]
    53,781,450,212 instructions              #    0.22  insns per cycle        
                                             #    3.71  stalled cycles per insn [83.77%]
     8,738,372,177 branches                  #   97.853 M/sec                   [82.99%]
       119,158,960 branch-misses             #    1.36% of all branches         [83.17%]

     200.032331409 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf. 
Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service  
Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand   
Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units    
Final       Final                                              %     Method %      Method                          
1862520     6291456     262144  20.00   17464.08   10^6bits/s  3.98  S      -1.00  U      0.448   -1.000  usec/KB  

 Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':

     111290.768845 task-clock                #    0.556 CPUs utilized          
         2,818,205 context-switches          #    0.025 M/sec                  
               201 CPU-migrations            #    0.002 K/sec                  
               453 page-faults               #    0.004 K/sec                  
   297,763,550,604 cycles                    #    2.676 GHz                     [83.35%]
   246,839,427,685 stalled-cycles-frontend   #   82.90% frontend cycles idle    [83.25%]
    75,450,669,370 stalled-cycles-backend    #   25.34% backend  cycles idle    [66.69%]
    63,464,955,178 instructions              #    0.21  insns per cycle        
                                             #    3.89  stalled cycles per insn [83.38%]
    10,564,139,626 branches                  #   94.924 M/sec                   [83.39%]
       248,015,797 branch-misses             #    2.35% of all branches         [83.32%]

     200.028775802 seconds time elapsed


14.091 context switches per second...

Interesting how it actually increases throughput !

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-22 22:44   ` Eric Dumazet
@ 2013-07-22 23:08     ` Rick Jones
  2013-07-23  0:13       ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Rick Jones @ 2013-07-22 23:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On 07/22/2013 03:44 PM, Eric Dumazet wrote:
> Hi Rick
>
>> Netperf is perhaps a "best case" for this as it has no think time and
>> will not itself build-up a queue of data internally.
>>
>> The 18% increase in service demand is troubling.
>
> Its not troubling at such high speed. (Note also I had better throughput
> in my (single) test)

Yes, you did, but that was only 5.4%, and it may be in an area where 
there is non-trivial run to run variation.

I would think an increase in service demand is even more troubling at 
high speeds than low speeds.  Particularly when I'm still not at link-rate.

In theory anyway, the service demand is independent of the transfer 
rate.  Of course, practice dictates that different algorithms have 
different behaviours at different speeds, but in slightly sweeping 
handwaving, if the service demand went up 18% that cut your maximum 
aggregate throughput for the "infinitely fast link" or collection of 
finitely fast links in the system by 18%.

I suppose that brings up the question of what the aggregate throughput 
and CPU utilization was for your 200 concurrent netperf TCP_STREAM sessions.

> Process scheduler cost is abysmal (Or more exactly when cpu enters idle
> mode I presume).
>
> Adding a context switch for every TSO packet is obviously not something
> you want if you want to pump 20Gbps on a single tcp socket.

You wouldn't want it if you were pumping 20 Gbit/s down multiple TCP 
sockets either I'd think.

> I guess that real application would not use 16KB send()s either.

You can use a larger send in netperf - the 16 KB is only because that is 
the default initial SO_SNDBUF size under Linux :)

> I chose extreme parameters to show that the patch had acceptable impact.
> (128KB are only 2 TSO packets)
>
> The main targets of this patch are servers handling hundred to million
> of sockets, or any machine with RAM constraints. This would also permit
> better autotuning in the future. Our current 4MB limit is a bit small in
> some cases.
>
> Allowing the socket write queue to queue more bytes is better for
> throughput/cpu cycles, as long as you have enough RAM.

So, netperf doesn't queue internally - what happens when the application 
does queue internally?  Admittedly, it will be user-space memory (I 
assume) rather than kernel memory, which I suppose is better since it 
can be paged and whatnot.  But if we drop the qualifiers, it is still 
the same quantity of memory overall right?

By the way, does this affect sendfile() or splice()?

>> It would be good to hit that with the confidence intervals (eg -i 30,3
>> and perhaps -i 99,<somthing other than the default of 5>) or do many
>> separate runs to get an idea of the variation.  Presumably remote
>> service demand is not of interest, so for the confidence intervals bit
>> you might drop the -C and keep only the -c in which case, netperf will
>> not be trying to hit the confidence interval remote CPU utilization
>> along with local CPU and throughput
>>
>
> Well, I am sure a lot of netperf tests can be done, thanks for the
> input ! I am removing the -C ;)
>
> The -i30,3 runs are usually very very very slow :(

Well, systems aren't as consistent as they once were.

Some of the additional strategies I employ with varying degrees of 
success in getting a single stream -i 30,3 (DON"T use that with 
aggregates) to complete closer to the 3 than the 30:

*) Bind all the IRQs of the NIC to a single CPU, which then makes it 
possible to:

*) Bind netperf (and/or netserver) to that CPU with the -T option or 
taskset.  Or you may want to bind to a peer CPU associated with the same 
L3 data cache if you have a NIC that needs more than a single CPU's 
worth of "oomph" to get (near to) link rate.

*) There may also be some value in setting the system into a 
fixed-frequency mode.

>
>> Why are there more context switches with the lowat set to 128KB?  Is the
>> SO_SNDBUF growth in the first case the reason? Otherwise I would have
>> thought that netperf would have been context switching back and forth at
>> at "socket full" just as often as "at 128KB." You might then also
>> compare before and after with a fixed socket buffer size
>
> It seems to me normal to get one context switch per TSO packet, instead
> of _no_ context switches when the cpu is so busy it never has to put the
> netperf thread to sleep. softirq handling is removing packets from write
> queue at the same speed than application can add new ones ;)
>
>>
>> Anything interesting happen when the send size is larger than the lowat?
>
> Let's see ;)
>
> lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                              %     Method %      Method
> 3056328     6291456     262144  20.00   16311.69   10^6bits/s  2.97  S      -1.00  U      0.359   -1.000  usec/KB
>
>   Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':
>
>        89301.211847 task-clock                #    0.446 CPUs utilized
>             349,509 context-switches          #    0.004 M/sec
>                 179 CPU-migrations            #    0.002 K/sec
>                 453 page-faults               #    0.005 K/sec
>     242,819,453,514 cycles                    #    2.719 GHz                     [81.82%]
>     199,273,454,019 stalled-cycles-frontend   #   82.07% frontend cycles idle    [84.27%]
>      50,268,984,648 stalled-cycles-backend    #   20.70% backend  cycles idle    [67.76%]
>      53,781,450,212 instructions              #    0.22  insns per cycle
>                                               #    3.71  stalled cycles per insn [83.77%]
>       8,738,372,177 branches                  #   97.853 M/sec                   [82.99%]
>         119,158,960 branch-misses             #    1.36% of all branches         [83.17%]
>
>       200.032331409 seconds time elapsed
>
> lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
> lpq83:~# perf stat ./netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K
> OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
> Local       Remote      Local   Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
> Send Socket Recv Socket Send    Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
> Size        Size        Size    (sec)                          Util  Util   Util   Util   Demand  Demand  Units
> Final       Final                                              %     Method %      Method
> 1862520     6291456     262144  20.00   17464.08   10^6bits/s  3.98  S      -1.00  U      0.448   -1.000  usec/KB
>
>   Performance counter stats for './netperf -t omni -l 20 -H 7.7.7.84 -c -i 10,3 -- -m 256K':
>
>       111290.768845 task-clock                #    0.556 CPUs utilized
>           2,818,205 context-switches          #    0.025 M/sec
>                 201 CPU-migrations            #    0.002 K/sec
>                 453 page-faults               #    0.004 K/sec
>     297,763,550,604 cycles                    #    2.676 GHz                     [83.35%]
>     246,839,427,685 stalled-cycles-frontend   #   82.90% frontend cycles idle    [83.25%]
>      75,450,669,370 stalled-cycles-backend    #   25.34% backend  cycles idle    [66.69%]
>      63,464,955,178 instructions              #    0.21  insns per cycle
>                                               #    3.89  stalled cycles per insn [83.38%]
>      10,564,139,626 branches                  #   94.924 M/sec                   [83.39%]
>         248,015,797 branch-misses             #    2.35% of all branches         [83.32%]
>
>       200.028775802 seconds time elapsed

Side warning about the omni test path - it does not emit the "You didn't 
hit the confidence interval" warnings like the classic/migrated path 
did/does.  To see the actual width of the confidence interval you need 
to use the omni output selectors:

$ netperf -- -O ? | grep CONFID
CONFIDENCE_LEVEL
CONFIDENCE_INTERVAL
CONFIDENCE_ITERATION
THROUGHPUT_CONFID
LOCAL_CPU_CONFID
REMOTE_CPU_CONFID

You may want to see CONFIDENCE_ITERATION (how many times did it repeat 
the test) and then THROUGHPUT_CONFID and LOCAL_CPU_CONFID.  You may also 
find:

$ netperf -- -O ? | grep PEAK
LOCAL_CPU_PEAK_UTIL
LOCAL_CPU_PEAK_ID
REMOTE_CPU_PEAK_UTIL
REMOTE_CPU_PEAK_ID

interesting - those will the the utilizations and IDs of the most 
utilized CPUs on the system.

>
> 14.091 context switches per second...
>
> Interesting how it actually increases throughput !

And the service demand went up almost 20% this time :) (19.8)  That it 
has happened again lends credence to it being a real difference.

If it causes smaller-on-average TSO sends, perhaps it is triggering 
greater parallelism between the NIC(s) and the host(s)?

happy benchmarking,

rick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-22 23:08     ` Rick Jones
@ 2013-07-23  0:13       ` Eric Dumazet
  2013-07-23  0:40         ` Eric Dumazet
  2013-07-23 15:25         ` Rick Jones
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-23  0:13 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Mon, 2013-07-22 at 16:08 -0700, Rick Jones wrote:
> On 07/22/2013 03:44 PM, Eric Dumazet wrote:
> > Hi Rick
> >
> >> Netperf is perhaps a "best case" for this as it has no think time and
> >> will not itself build-up a queue of data internally.
> >>
> >> The 18% increase in service demand is troubling.
> >
> > Its not troubling at such high speed. (Note also I had better throughput
> > in my (single) test)
> 
> Yes, you did, but that was only 5.4%, and it may be in an area where 
> there is non-trivial run to run variation.
> 
> I would think an increase in service demand is even more troubling at 
> high speeds than low speeds.  Particularly when I'm still not at link-rate.
> 

If I wanter link-rate, I would use TCP_SENDFILE, and unfortunately be
slowed down by the receiver ;)

> In theory anyway, the service demand is independent of the transfer 
> rate.  Of course, practice dictates that different algorithms have 
> different behaviours at different speeds, but in slightly sweeping 
> handwaving, if the service demand went up 18% that cut your maximum 
> aggregate throughput for the "infinitely fast link" or collection of 
> finitely fast links in the system by 18%.
> 
> I suppose that brings up the question of what the aggregate throughput 
> and CPU utilization was for your 200 concurrent netperf TCP_STREAM sessions.

I am not sure I want to add 1000 lines in the changelog with a detailed
netperf results. Even so, they would be meaningful for my lab machines.


> 
> > Process scheduler cost is abysmal (Or more exactly when cpu enters idle
> > mode I presume).
> >
> > Adding a context switch for every TSO packet is obviously not something
> > you want if you want to pump 20Gbps on a single tcp socket.
> 
> You wouldn't want it if you were pumping 20 Gbit/s down multiple TCP 
> sockets either I'd think.

No difference as a matter of fact, as each netperf _will_ schedule
anyway, as a queue builds in Qdisc layer.



> 
> > I guess that real application would not use 16KB send()s either.
> 
> You can use a larger send in netperf - the 16 KB is only because that is 
> the default initial SO_SNDBUF size under Linux :)
> 
> > I chose extreme parameters to show that the patch had acceptable impact.
> > (128KB are only 2 TSO packets)
> >
> > The main targets of this patch are servers handling hundred to million
> > of sockets, or any machine with RAM constraints. This would also permit
> > better autotuning in the future. Our current 4MB limit is a bit small in
> > some cases.
> >
> > Allowing the socket write queue to queue more bytes is better for
> > throughput/cpu cycles, as long as you have enough RAM.
> 
> So, netperf doesn't queue internally - what happens when the application 
> does queue internally?  Admittedly, it will be user-space memory (I 
> assume) rather than kernel memory, which I suppose is better since it 
> can be paged and whatnot.  But if we drop the qualifiers, it is still 
> the same quantity of memory overall right?
> 
> By the way, does this affect sendfile() or splice()?

Sure : Patch intercepts sk_stream_memory_free() for all its callers.

10Gb link 'experiment with sendfile()' :

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.17.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    20.00      9372.56   1.69     -1.00    0.355   -1.000 

 Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':

            16,188 context-switches                                            

      20.006998098 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.17.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    20.00      9408.33   1.75     -1.00    0.366   -1.000 

 Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':

           714,395 context-switches                                            

      20.004409659 seconds time elapsed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23  0:13       ` Eric Dumazet
@ 2013-07-23  0:40         ` Eric Dumazet
  2013-07-23  1:20           ` Hannes Frederic Sowa
  2013-07-23  2:32           ` Eric Dumazet
  2013-07-23 15:25         ` Rick Jones
  1 sibling, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-23  0:40 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Mon, 2013-07-22 at 17:13 -0700, Eric Dumazet wrote:

> 
>  Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':
> 
>            714,395 context-switches                                            

Hmm, actually I need to send a v3, because sk_stream_write_space() is
waking sockets too often.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23  0:40         ` Eric Dumazet
@ 2013-07-23  1:20           ` Hannes Frederic Sowa
  2013-07-23  1:33             ` Eric Dumazet
  2013-07-23  2:32           ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Hannes Frederic Sowa @ 2013-07-23  1:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rick Jones, David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael Kerrisk

On Mon, Jul 22, 2013 at 05:40:13PM -0700, Eric Dumazet wrote:
> On Mon, 2013-07-22 at 17:13 -0700, Eric Dumazet wrote:
> 
> > 
> >  Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':
> > 
> >            714,395 context-switches                                            
> 
> Hmm, actually I need to send a v3, because sk_stream_write_space() is
> waking sockets too often.

Do you implement SO_SNDLOWAT? :)

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23  1:20           ` Hannes Frederic Sowa
@ 2013-07-23  1:33             ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-23  1:33 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Rick Jones, David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael Kerrisk

On Tue, 2013-07-23 at 03:20 +0200, Hannes Frederic Sowa wrote:
> On Mon, Jul 22, 2013 at 05:40:13PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-07-22 at 17:13 -0700, Eric Dumazet wrote:
> > 
> > > 
> > >  Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':
> > > 
> > >            714,395 context-switches                                            
> > 
> > Hmm, actually I need to send a v3, because sk_stream_write_space() is
> > waking sockets too often.
> 
> Do you implement SO_SNDLOWAT? :)

Not exactly.

This is going to be a FAQ I guess ;)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23  0:40         ` Eric Dumazet
  2013-07-23  1:20           ` Hannes Frederic Sowa
@ 2013-07-23  2:32           ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-23  2:32 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Mon, 2013-07-22 at 17:40 -0700, Eric Dumazet wrote:
> On Mon, 2013-07-22 at 17:13 -0700, Eric Dumazet wrote:
> 
> > 
> >  Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':
> > 
> >            714,395 context-switches                                            
> 
> Hmm, actually I need to send a v3, because sk_stream_write_space() is
> waking sockets too often.
> 

Yep, new results are more the expected ones :

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.17.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    20.00      9408.18   1.60     -1.00    0.333   -1.000 

 Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':

            49,543 context-switches                                            

      20.005432791 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.17.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    20.00      9407.71   1.55     -1.00    0.323   -1.000 

 Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':

           345,670 context-switches                                            

      20.004435166 seconds time elapsed


And the receiver disables LRO/GRO, no real difference this time :

lpq83:~# perf stat -e context-switches ./netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c
TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.17.84 () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    20.00      9382.90   2.14     -1.00    0.448   -1.000 

 Performance counter stats for './netperf -H 10.246.17.84 -l 20 -t TCP_SENDFILE -c':

           336,501 context-switches                                            

      20.004579650 seconds time elapsed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23  0:13       ` Eric Dumazet
  2013-07-23  0:40         ` Eric Dumazet
@ 2013-07-23 15:25         ` Rick Jones
  2013-07-23 15:28           ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Rick Jones @ 2013-07-23 15:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On 07/22/2013 05:13 PM, Eric Dumazet wrote:
> On Mon, 2013-07-22 at 16:08 -0700, Rick Jones wrote:
>> By the way, does this affect sendfile() or splice()?
>
> Sure : Patch intercepts sk_stream_memory_free() for all its callers.
>
> 10Gb link 'experiment with sendfile()' :

Why not the same 20 Gb (?) link you used with the other experiments?

rick

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option
  2013-07-23 15:25         ` Rick Jones
@ 2013-07-23 15:28           ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2013-07-23 15:28 UTC (permalink / raw)
  To: Rick Jones
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell, Michael Kerrisk

On Tue, 2013-07-23 at 08:25 -0700, Rick Jones wrote:
> On 07/22/2013 05:13 PM, Eric Dumazet wrote:
> > On Mon, 2013-07-22 at 16:08 -0700, Rick Jones wrote:
> >> By the way, does this affect sendfile() or splice()?
> >
> > Sure : Patch intercepts sk_stream_memory_free() for all its callers.
> >
> > 10Gb link 'experiment with sendfile()' :
> 
> Why not the same 20 Gb (?) link you used with the other experiments?

Why not ? I have different kind of links ;)

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-07-23 15:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-22 19:13 [PATCH net-next] tcp: TCP_NOSENT_LOWAT socket option Eric Dumazet
2013-07-22 19:28 ` Eric Dumazet
2013-07-22 20:43 ` Rick Jones
2013-07-22 22:44   ` Eric Dumazet
2013-07-22 23:08     ` Rick Jones
2013-07-23  0:13       ` Eric Dumazet
2013-07-23  0:40         ` Eric Dumazet
2013-07-23  1:20           ` Hannes Frederic Sowa
2013-07-23  1:33             ` Eric Dumazet
2013-07-23  2:32           ` Eric Dumazet
2013-07-23 15:25         ` Rick Jones
2013-07-23 15:28           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).