netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
@ 2011-08-15  5:38 Jun.Kondo
  2011-08-15  5:47 ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jun.Kondo @ 2011-08-15  5:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: omega-g1, notsuki, Kozaki, Motokazu, Hajime Taira, netdev,
	TomohikoTAKAHASHI, Kotaro Sakai, ken sugawara

CTC had the following demand;

1. to ensure high throughput from the beginning of
tcp connection at normal times by acquiring large
default transmission buffer value

2. to limit the block time of the write in order to
prevent the timeout of upper layer applications
even when the connection has low throughput, such
as low rate streaming


The root of the issue;

2 can not be achieved with the configuration that
satisfies 1.

The current behavior is as follows;

Write is blocked when tcp transmission buffer (wmem)
becomes full.
In order to write again after that, one third of the
transmission buffer (sk_wmem_queued/2) must be freed.

When the throughput is low, timeout occurs by the time
when the free buffer space is created, which affects
streaming service.


The effect of the patch;

By putting xxx into the variable yyy, the portion of
the transmission buffer becomes zzz, thus timeout will
not occur in the low throughput network environment.

xxx → integer(e.g. 4)
yyy → "sysctl_tcp_lowat"
zzz → "sk_wmem_queued >> 4"

Also, we think one third of the transmission buffer
(sk_wmem_queued/2) is too deterministic, and it should
be configurable.

--------------------------------------------------
--- linux-mainline/include/net/sock.h.orig	2011-07-27 14:26:43.000000000 +0900
+++ linux-mainline/include/net/sock.h	2011-08-15 11:40:20.000000000 +0900
@@ -604,9 +604,11 @@ static inline int sk_acceptq_is_full(str
 /*
  * Compute minimal free write space needed to queue new packets.
  */
+extern __u32 sysctl_tcp_lowat;
+
 static inline int sk_stream_min_wspace(struct sock *sk)
 {
-	return sk->sk_wmem_queued >> 1;
+	return sk->sk_wmem_queued >> sysctl_tcp_lowat;
 }
 
 static inline int sk_stream_wspace(struct sock *sk)
--- linux-mainline/net/core/sock.c.orig	2011-07-24 05:04:06.000000000 +0900
+++ linux-mainline/net/core/sock.c	2011-08-15 11:34:27.000000000 +0900
@@ -217,6 +217,9 @@ __u32 sysctl_rmem_max __read_mostly = SK
 __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
 __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 
+__u32 sysctl_tcp_lowat = 1;
+EXPORT_SYMBOL(sysctl_tcp_lowat);
+
 /* Maximal space eaten by iovec or ancillary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
@@ -1330,6 +1333,8 @@ void __init sk_init(void)
 		sysctl_wmem_max = 131071;
 		sysctl_rmem_max = 131071;
 	}
+
+	sysctl_tcp_lowat = 1;
 }
 
 /*
--- linux-mainline/net/core/sysctl_net_core.c.orig	2011-05-29 06:01:16.000000000 +0900
+++ linux-mainline/net/core/sysctl_net_core.c	2011-08-15 11:05:38.000000000 +0900
@@ -168,6 +168,13 @@ static struct ctl_table net_core_table[]
 		.proc_handler	= rps_sock_flow_sysctl
 	},
 #endif
+	{
+		.procname	= "tcp_lowat",
+		.data		= &sysctl_tcp_lowat,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 #endif /* CONFIG_NET */
 	{
 		.procname	= "netdev_budget",

--------------------------------------------------

------------------------------------------
Jun.Kondo
ITOCHU TECHNO-SOLUTIONS Corporation(CTC)
tel:+81-3-6238-6607
fax:+81-3-5226-2369
------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-15  5:38 [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace() Jun.Kondo
@ 2011-08-15  5:47 ` David Miller
  2011-08-19  9:28   ` [omega-g1:10937] " Jun.Kondo
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2011-08-15  5:47 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Mon, 15 Aug 2011 14:38:11 +0900

> 2. to limit the block time of the write in order to
> prevent the timeout of upper layer applications
> even when the connection has low throughput, such
> as low rate streaming

Use non-blocking writes if you want this behavior.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:10937] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-15  5:47 ` David Miller
@ 2011-08-19  9:28   ` Jun.Kondo
  2011-08-19  9:43     ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jun.Kondo @ 2011-08-19  9:28 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

You suggested to use non-blocking writes, but we think
we have to rewrite the Apache code if doing so.
That is, we have to make a modification to Apache that
depends on the architecture.
By using this patch, it can be handled by changing the
configuration a little bit on the kernel side for such
applications that it is difficult to do so on application
side.



(2011/08/15 14:47), David Miller wrote:
> From: "Jun.Kondo"<jun.kondo@ctc-g.co.jp>
> Date: Mon, 15 Aug 2011 14:38:11 +0900
>
>> 2. to limit the block time of the write in order to
>> prevent the timeout of upper layer applications
>> even when the connection has low throughput, such
>> as low rate streaming
> Use non-blocking writes if you want this behavior.
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:10937] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-19  9:28   ` [omega-g1:10937] " Jun.Kondo
@ 2011-08-19  9:43     ` David Miller
  2011-08-22  0:33       ` [omega-g1:11072] " Jun.Kondo
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2011-08-19  9:43 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Fri, 19 Aug 2011 18:28:45 +0900

> You suggested to use non-blocking writes, but we think
> we have to rewrite the Apache code if doing so.
> That is, we have to make a modification to Apache that
> depends on the architecture.
> By using this patch, it can be handled by changing the
> configuration a little bit on the kernel side for such
> applications that it is difficult to do so on application
> side.

The kernel provides the facilities necessary to achieve your
goals.  It is a userspace problem.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11072] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-19  9:43     ` David Miller
@ 2011-08-22  0:33       ` Jun.Kondo
  2011-08-22 14:21         ` Hagen Paul Pfeifer
  2011-08-22 18:35         ` David Miller
  0 siblings, 2 replies; 11+ messages in thread
From: Jun.Kondo @ 2011-08-22  0:33 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

By using this patch, we want to prevent "timeout occured over the network that is low throughput but available".

But in the current implementation, both blocking and non-blocking,
user processes can't recognize the reason in detail
when failed to write to socket buffer, we think.

is it (really) network problem ?
or is wmem not enough free to write?

As stated above, we think it is difficult for user processes to handle timeout of writing socket buffer,
when wmem is configured large value.(to ensure high throughput over the high ralency network, like 3G).


(2011/08/19 18:43), David Miller wrote:
> From: "Jun.Kondo"<jun.kondo@ctc-g.co.jp>
> Date: Fri, 19 Aug 2011 18:28:45 +0900
>
>> You suggested to use non-blocking writes, but we think
>> we have to rewrite the Apache code if doing so.
>> That is, we have to make a modification to Apache that
>> depends on the architecture.
>> By using this patch, it can be handled by changing the
>> configuration a little bit on the kernel side for such
>> applications that it is difficult to do so on application
>> side.
> The kernel provides the facilities necessary to achieve your
> goals.  It is a userspace problem.
>


-- 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
近藤 潤
v 伊藤忠テクノソリューションズ株式会社(CTC)
v システム技術第1部 技術第4課
v 個人:03-6757-2144
v FAX:03-5800-2256
v
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11072] Re: [PATCH] net: configurable sysctl parameter  "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-22  0:33       ` [omega-g1:11072] " Jun.Kondo
@ 2011-08-22 14:21         ` Hagen Paul Pfeifer
  2011-08-22 18:35         ` David Miller
  1 sibling, 0 replies; 11+ messages in thread
From: Hagen Paul Pfeifer @ 2011-08-22 14:21 UTC (permalink / raw)
  To: jun.kondo
  Cc: David Miller, linux-kernel, omega-g1, notsuki, motokazu.kozaki,
	htaira, netdev, tomohiko.takahashi, kotaro.sakai, ken.sugawara


On Mon, 22 Aug 2011 09:33:52 +0900, "Jun.Kondo" wrote:

> By using this patch, we want to prevent "timeout occured over the

network

> that is low throughput but available".

> 

> But in the current implementation, both blocking and non-blocking,

> user processes can't recognize the reason in detail

> when failed to write to socket buffer, we think.



For your application it should not matter WHY the data can be written to

the peer. It can be happened that the peer close the window, some

scheduling bottleneck or whatever else. A blocking socket means for you

that some data is in the pipe, waiting for transmit. This is the knowledge

that you require, and you should deal with it. A blocking socket does not

mean FAILED, a failure is returned via ECONNRESET or otherwise. So

everything is fine when your socket blocks. Probably you should adjust your

Apache timeouts or other parts of the program logic.



> As stated above, we think it is difficult for user processes to handle

> timeout of writing socket buffer,

> when wmem is configured large value.(to ensure high throughput over the

> high ralency network, like 3G).



No, you should adjust your code and account that the socket has data in

the pipe. That's all.



Changing tcp_lowat

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11072] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-22  0:33       ` [omega-g1:11072] " Jun.Kondo
  2011-08-22 14:21         ` Hagen Paul Pfeifer
@ 2011-08-22 18:35         ` David Miller
  2011-08-25  4:46           ` [omega-g1:11110] " Jun.Kondo
  1 sibling, 1 reply; 11+ messages in thread
From: David Miller @ 2011-08-22 18:35 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Mon, 22 Aug 2011 09:33:52 +0900

> is it (really) network problem ?
> or is wmem not enough free to write?

Oh yes you can indeed make this determination, by using the socket
timeouts via the SO_RCVTIMEO and SO_SNDTIMEO socket options.

Timeouts, when hit, will return -EINTR, whereas lack of buffer space
on a non-blocking socket will return -EAGAIN.

I think you simply are unaware of the facilities available in the BSD
socket API.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11110] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-22 18:35         ` David Miller
@ 2011-08-25  4:46           ` Jun.Kondo
  2011-08-25  5:00             ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jun.Kondo @ 2011-08-25  4:46 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

Currently, once the transmission buffer becomes full, it is not
possible to write again unless there is one third of free space
in the transmission buffer.

Our modification request is not intending to change the behavior
of the OS itself, but making the value "one third" to be
configurable, not fixed.

Thus it would be still possible to set the value to 1/3.

So, could you please tell us why it is not acceptable to make
it configurable, and what is the persistence with the value of
1/3?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11110] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-25  4:46           ` [omega-g1:11110] " Jun.Kondo
@ 2011-08-25  5:00             ` David Miller
  2011-09-09  1:33               ` Jun.Kondo
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2011-08-25  5:00 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Thu, 25 Aug 2011 13:46:58 +0900

> Currently, once the transmission buffer becomes full, it is not
> possible to write again unless there is one third of free space
> in the transmission buffer.

Then use a non-blocking socket if you don't want to block.

We're talking in circles, and will walk down the same discussions
again.  You have still not shown what real limitation is created
by the way things work currently.

I've said everything that I can, and I will thus recuse myself from
the rest of this discussion since I really can't add anything more.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11110] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-08-25  5:00             ` David Miller
@ 2011-09-09  1:33               ` Jun.Kondo
  2011-09-09  2:17                 ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Jun.Kondo @ 2011-09-09  1:33 UTC (permalink / raw)
  To: David Miller
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

The client of this system is cellular phone, and the
status of the communication line with a client varies
widely according to its place or congestion situation.

In terms of the line speed, it can be around 9Mbps
when it is fast, but 8kbps when it is slow.

Requirement from customer is to provide stable service
in both situation.

- In normal situation, acquire large default transmission
   buffer value, and ensure high throughput from the
   beginning of tcp connection

- On the other hand, even when the connection has low
   throughput, such as low rate streaming, transmit data
   without timeout

However, when the throughput is low, it takes much time
for the transmission buffer to be freed, and timeout
will occur during that period.

Of course, the connection will not be disconnected when
the timeout of application is extended, but end user
would not wait patiently as long as 1 minute.
Therefore, we do not want to extend the timeout value.

By making the threshold, which makes write possible after
the buffer is blocked once, configurable, and set it to a
small value, it will be possible to return data to client
without making timeout occur.

So, we think the issue can be solved with this
modification.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [omega-g1:11110] Re: [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace()
  2011-09-09  1:33               ` Jun.Kondo
@ 2011-09-09  2:17                 ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2011-09-09  2:17 UTC (permalink / raw)
  To: jun.kondo
  Cc: linux-kernel, omega-g1, notsuki, motokazu.kozaki, htaira, netdev,
	tomohiko.takahashi, kotaro.sakai, ken.sugawara

From: "Jun.Kondo" <jun.kondo@ctc-g.co.jp>
Date: Fri, 09 Sep 2011 10:33:58 +0900

> - In normal situation, acquire large default transmission
>   buffer value, and ensure high throughput from the
>   beginning of tcp connection

You should never do this.  You should use the default buffer sizes and
as a result the kernel's TCP stack automatically adjusts the send and
receive buffers in response to the link characteristics.

When you set explicit buffer sizes, this turns off the TCP stack's
auto-tuning mechanism.

Every argument made in support of your proposed feature is based upon
a false premise of one kind of another, and this is yet another example
of this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-09-09  2:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-15  5:38 [PATCH] net: configurable sysctl parameter "net.core.tcp_lowat" for sk_stream_min_wspace() Jun.Kondo
2011-08-15  5:47 ` David Miller
2011-08-19  9:28   ` [omega-g1:10937] " Jun.Kondo
2011-08-19  9:43     ` David Miller
2011-08-22  0:33       ` [omega-g1:11072] " Jun.Kondo
2011-08-22 14:21         ` Hagen Paul Pfeifer
2011-08-22 18:35         ` David Miller
2011-08-25  4:46           ` [omega-g1:11110] " Jun.Kondo
2011-08-25  5:00             ` David Miller
2011-09-09  1:33               ` Jun.Kondo
2011-09-09  2:17                 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).