netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH next resend] tcp: use zero-window when free_space is low
@ 2014-02-13 11:52 Florian Westphal
  2014-02-13 14:58 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Florian Westphal @ 2014-02-13 11:52 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Neal Cardwell, Eric Dumazet, Yuchung Cheng

Currently the kernel tries to announce a zero window when free_space
is below the current receiver mss estimate.

When a sender is transmitting small packets and reader consumes data
slowly (or not at all), receiver might be unable to shrink the receive
win because

a) we cannot withdraw already-commited receive window, and,
b) we have to round the current rwin up to a multiple of the wscale
   factor, else we would shrink the current window.

This causes the receive buffer to fill up until the rmem limit is hit.
When this happens, we start dropping packets.

Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
even if socket is not being read from.

As we cannot avoid the "current_win is rounded up to multiple of mss"
issue [we would violate a) above] at least try to prevent the receive buf
growth towards tcp_rmem[2] limit by attempting to move to zero-window
announcement when free_space becomes less than 1/16 of the current
allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
increase our chances to get a zero-window announcement out in time.

Reproducer:
On server:
$ nc -l -p 12345
<suspend it: CTRL-Z>

Client:
#!/usr/bin/env python
import socket
import time

sock = socket.socket()
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.connect(("192.168.4.1", 12345));
while True:
   sock.send('A' * 23)
   time.sleep(0.005)


socket buffer on server-side will grow until tcp_rmem[2] is hit,
at which point the client rexmits data until -EDTIMEOUT:

tcp_data_queue invokes tcp_try_rmem_schedule which will call
tcp_prune_queue which calls tcp_clamp_window().  And that function will
grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].

Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 V1 of this patch was deferred, resending to get discussion going again.
 Changes since v1:
  - add reproducer to commit message

 Unfortunately I couldn't come up with something that has no magic
 ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
 tput limitations in my 'full-mss-sized, steady state' netcat tests.

 Maybe someone has better idea?

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2a69f42..fd8d821 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,7 +2145,8 @@ u32 __tcp_select_window(struct sock *sk)
 	 */
 	int mss = icsk->icsk_ack.rcv_mss;
 	int free_space = tcp_space(sk);
-	int full_space = min_t(int, tp->window_clamp, tcp_full_space(sk));
+	int allowed_space = tcp_full_space(sk);
+	int full_space = min_t(int, tp->window_clamp, allowed_space);
 	int window;
 
 	if (mss > full_space)
@@ -2158,7 +2159,19 @@ u32 __tcp_select_window(struct sock *sk)
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
-		if (free_space < mss)
+		/* free_space might become our new window, make sure we don't
+		 * increase it due to wscale.
+		 */
+		free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+
+		/* if free space is less than mss estimate, or is below 1/16th
+		 * of the maximum allowed, try to move to zero-window, else
+		 * tcp_clamp_window() will grow rcv buf up to tcp_rmem[2], and
+		 * new incoming data is dropped due to memory limits.
+		 * With large window, mss test triggers way too late in order
+		 * to announce zero window in time before rmem limit kicks in.
+		 */
+		if (free_space < (allowed_space >> 4) || free_space < mss)
 			return 0;
 	}
 
-- 
1.8.1.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-13 11:52 [PATCH next resend] tcp: use zero-window when free_space is low Florian Westphal
@ 2014-02-13 14:58 ` Eric Dumazet
  2014-02-13 15:34   ` Florian Westphal
  2014-02-13 17:18 ` Rick Jones
  2014-02-17 19:34 ` David Miller
  2 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-02-13 14:58 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Neal Cardwell, Yuchung Cheng

On Thu, 2014-02-13 at 12:52 +0100, Florian Westphal wrote:
> Currently the kernel tries to announce a zero window when free_space
> is below the current receiver mss estimate.
> 
> When a sender is transmitting small packets and reader consumes data
> slowly (or not at all), receiver might be unable to shrink the receive
> win because
> 
> a) we cannot withdraw already-commited receive window, and,
> b) we have to round the current rwin up to a multiple of the wscale
>    factor, else we would shrink the current window.
> 
> This causes the receive buffer to fill up until the rmem limit is hit.
> When this happens, we start dropping packets.
> 
> Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
> even if socket is not being read from.
> 
> As we cannot avoid the "current_win is rounded up to multiple of mss"
> issue [we would violate a) above] at least try to prevent the receive buf
> growth towards tcp_rmem[2] limit by attempting to move to zero-window
> announcement when free_space becomes less than 1/16 of the current
> allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
> increase our chances to get a zero-window announcement out in time.
> 
> Reproducer:
> On server:
> $ nc -l -p 12345
> <suspend it: CTRL-Z>
> 
> Client:
> #!/usr/bin/env python
> import socket
> import time
> 
> sock = socket.socket()
> sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
> sock.connect(("192.168.4.1", 12345));
> while True:
>    sock.send('A' * 23)
>    time.sleep(0.005)
> 
> 
> socket buffer on server-side will grow until tcp_rmem[2] is hit,
> at which point the client rexmits data until -EDTIMEOUT:
> 
> tcp_data_queue invokes tcp_try_rmem_schedule which will call
> tcp_prune_queue which calls tcp_clamp_window().  And that function will
> grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
> 
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  V1 of this patch was deferred, resending to get discussion going again.
>  Changes since v1:
>   - add reproducer to commit message
> 
>  Unfortunately I couldn't come up with something that has no magic
>  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> 
>  Maybe someone has better idea?

Thanks a lot Florian looking at this.

Do we have one SNMP counter tracking number of time we took the decision
to send a 0 window ?

Would you mind waiting we run our packetdrill tests before acknowledging
this patch, because I suspect this might have some impact ?

Thanks !

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-13 14:58 ` Eric Dumazet
@ 2014-02-13 15:34   ` Florian Westphal
  2014-02-13 16:19     ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2014-02-13 15:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netdev, Neal Cardwell, Yuchung Cheng

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Do we have one SNMP counter tracking number of time we took the decision
> to send a 0 window ?

No.

> Would you mind waiting we run our packetdrill tests before acknowledging
> this patch, because I suspect this might have some impact ?

Of course not.  I am very happy that you folks have these kinds of tests
and are willing to double-check.  Take all time you need, there is no
need to haste.

Many Thanks Eric.

Do you think it makes sense to add counters for this?

The caveat is that decision to send 0 window doesn't mean we end
up sending one, since we cannot shrink already offered window.

static u16 tcp_select_window(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    u32 cur_win = tcp_receive_window(tp);
    u32 new_win = __tcp_select_window(sk);

    /* Never shrink the offered window */
    if (new_win < cur_win) {

Would you add SNMP counter for '__tcp_select_window() wants 0 window'
or for 'tcp_select_window() does pick 0 window' ?

[ or even different counters for both ? ]

Cheers,
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-13 15:34   ` Florian Westphal
@ 2014-02-13 16:19     ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2014-02-13 16:19 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, Neal Cardwell, Yuchung Cheng

On Thu, 2014-02-13 at 16:34 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Would you mind waiting we run our packetdrill tests before acknowledging
> > this patch, because I suspect this might have some impact ?
> 
> Of course not.  I am very happy that you folks have these kinds of tests
> and are willing to double-check.  Take all time you need, there is no
> need to haste.
> 


We have today 348 different packetdrill tests, and this number keeps
increasing ;)

> Many Thanks Eric.
> 
> Do you think it makes sense to add counters for this?
> 
> The caveat is that decision to send 0 window doesn't mean we end
> up sending one, since we cannot shrink already offered window.
> 
> static u16 tcp_select_window(struct sock *sk)
> {
>     struct tcp_sock *tp = tcp_sk(sk);
>     u32 cur_win = tcp_receive_window(tp);
>     u32 new_win = __tcp_select_window(sk);
> 
>     /* Never shrink the offered window */
>     if (new_win < cur_win) {
> 
> Would you add SNMP counter for '__tcp_select_window() wants 0 window'
> or for 'tcp_select_window() does pick 0 window' ?
> 
> [ or even different counters for both ? ]

Ideally, having counters of transitions would be nice.

This would help us to spot regressions in TCP stacks or network drivers,
or wrong application tunings.

One counter tracking number of times a socket went from a non zero
window to a zero window (as you said, I am referring to effective window
being sent)

One counter tracking the reverse.

If it proves being too complex or expensive, a single counter tracking
"win 0" sent to the peers.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-13 11:52 [PATCH next resend] tcp: use zero-window when free_space is low Florian Westphal
  2014-02-13 14:58 ` Eric Dumazet
@ 2014-02-13 17:18 ` Rick Jones
  2014-02-17 19:34 ` David Miller
  2 siblings, 0 replies; 10+ messages in thread
From: Rick Jones @ 2014-02-13 17:18 UTC (permalink / raw)
  To: Florian Westphal, netdev; +Cc: Neal Cardwell, Eric Dumazet, Yuchung Cheng

On 02/13/2014 03:52 AM, Florian Westphal wrote:
>   Maybe someone has better idea?

Does tcp_rmem have to be a "hard" limit, or could it allowed to be fuzzy?

rick jones

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-13 11:52 [PATCH next resend] tcp: use zero-window when free_space is low Florian Westphal
  2014-02-13 14:58 ` Eric Dumazet
  2014-02-13 17:18 ` Rick Jones
@ 2014-02-17 19:34 ` David Miller
  2014-02-17 20:52   ` Florian Westphal
  2 siblings, 1 reply; 10+ messages in thread
From: David Miller @ 2014-02-17 19:34 UTC (permalink / raw)
  To: fw; +Cc: netdev, ncardwell, eric.dumazet, ycheng

From: Florian Westphal <fw@strlen.de>
Date: Thu, 13 Feb 2014 12:52:30 +0100

>  V1 of this patch was deferred, resending to get discussion going again.
>  Changes since v1:
>   - add reproducer to commit message
> 
>  Unfortunately I couldn't come up with something that has no magic
>  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> 
>  Maybe someone has better idea?

I know this is going to be frustrating, but I've marked this 'deferred'
again.  Please resubmit after the testing and further discussions have
been worked out.

Thank you.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-17 19:34 ` David Miller
@ 2014-02-17 20:52   ` Florian Westphal
  2014-02-18 17:30     ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2014-02-17 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, ncardwell, eric.dumazet, ycheng

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Thu, 13 Feb 2014 12:52:30 +0100
> 
> >  V1 of this patch was deferred, resending to get discussion going again.
> >  Changes since v1:
> >   - add reproducer to commit message
> > 
> >  Unfortunately I couldn't come up with something that has no magic
> >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > 
> >  Maybe someone has better idea?
> 
> I know this is going to be frustrating, but I've marked this 'deferred'
> again.  Please resubmit after the testing and further discussions have
> been worked out.

Thanks for letting me know.  I understand why you are reluctant to just
apply this.

I will submit another patch shortly that introduces snmp counter for zero-window
(what Eric suggested), perhaps it helps him or others to find a better solution
in the scenario.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-17 20:52   ` Florian Westphal
@ 2014-02-18 17:30     ` Eric Dumazet
  2014-02-18 23:12       ` Eric Dumazet
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-02-18 17:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, ncardwell, ycheng

On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Florian Westphal <fw@strlen.de>
> > Date: Thu, 13 Feb 2014 12:52:30 +0100
> > 
> > >  V1 of this patch was deferred, resending to get discussion going again.
> > >  Changes since v1:
> > >   - add reproducer to commit message
> > > 
> > >  Unfortunately I couldn't come up with something that has no magic
> > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > > 
> > >  Maybe someone has better idea?
> > 
> > I know this is going to be frustrating, but I've marked this 'deferred'
> > again.  Please resubmit after the testing and further discussions have
> > been worked out.
> 
> Thanks for letting me know.  I understand why you are reluctant to just
> apply this.
> 
> I will submit another patch shortly that introduces snmp counter for zero-window
> (what Eric suggested), perhaps it helps him or others to find a better solution
> in the scenario.

Sorry for the delay guys, I was on vacation.

I started the tests this morning, results in about 2/3 hours.

Thanks !

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-18 17:30     ` Eric Dumazet
@ 2014-02-18 23:12       ` Eric Dumazet
  2014-02-19 21:36         ` David Miller
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Dumazet @ 2014-02-18 23:12 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, ncardwell, ycheng

On Tue, 2014-02-18 at 09:30 -0800, Eric Dumazet wrote:
> On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
> > David Miller <davem@davemloft.net> wrote:
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Thu, 13 Feb 2014 12:52:30 +0100
> > > 
> > > >  V1 of this patch was deferred, resending to get discussion going again.
> > > >  Changes since v1:
> > > >   - add reproducer to commit message
> > > > 
> > > >  Unfortunately I couldn't come up with something that has no magic
> > > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> > > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > > > 
> > > >  Maybe someone has better idea?
> > > 
> > > I know this is going to be frustrating, but I've marked this 'deferred'
> > > again.  Please resubmit after the testing and further discussions have
> > > been worked out.
> > 
> > Thanks for letting me know.  I understand why you are reluctant to just
> > apply this.
> > 
> > I will submit another patch shortly that introduces snmp counter for zero-window
> > (what Eric suggested), perhaps it helps him or others to find a better solution
> > in the scenario.
> 
> Sorry for the delay guys, I was on vacation.
> 
> I started the tests this morning, results in about 2/3 hours.

No regression found, feel free to add my :

Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH next resend] tcp: use zero-window when free_space is low
  2014-02-18 23:12       ` Eric Dumazet
@ 2014-02-19 21:36         ` David Miller
  0 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2014-02-19 21:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: fw, netdev, ncardwell, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 18 Feb 2014 15:12:13 -0800

> On Tue, 2014-02-18 at 09:30 -0800, Eric Dumazet wrote:
>> On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
>> > David Miller <davem@davemloft.net> wrote:
>> > > From: Florian Westphal <fw@strlen.de>
>> > > Date: Thu, 13 Feb 2014 12:52:30 +0100
>> > > 
>> > > >  V1 of this patch was deferred, resending to get discussion going again.
>> > > >  Changes since v1:
>> > > >   - add reproducer to commit message
>> > > > 
>> > > >  Unfortunately I couldn't come up with something that has no magic
>> > > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>> > > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
>> > > > 
>> > > >  Maybe someone has better idea?
>> > > 
>> > > I know this is going to be frustrating, but I've marked this 'deferred'
>> > > again.  Please resubmit after the testing and further discussions have
>> > > been worked out.
>> > 
>> > Thanks for letting me know.  I understand why you are reluctant to just
>> > apply this.
>> > 
>> > I will submit another patch shortly that introduces snmp counter for zero-window
>> > (what Eric suggested), perhaps it helps him or others to find a better solution
>> > in the scenario.
>> 
>> Sorry for the delay guys, I was on vacation.
>> 
>> I started the tests this morning, results in about 2/3 hours.
> 
> No regression found, feel free to add my :
> 
> Acked-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Eric Dumazet <edumazet@google.com>

Florian please resend with Eric's tags and I'll apply.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-02-19 21:36 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-13 11:52 [PATCH next resend] tcp: use zero-window when free_space is low Florian Westphal
2014-02-13 14:58 ` Eric Dumazet
2014-02-13 15:34   ` Florian Westphal
2014-02-13 16:19     ` Eric Dumazet
2014-02-13 17:18 ` Rick Jones
2014-02-17 19:34 ` David Miller
2014-02-17 20:52   ` Florian Westphal
2014-02-18 17:30     ` Eric Dumazet
2014-02-18 23:12       ` Eric Dumazet
2014-02-19 21:36         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).