All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-2.6 0/3]: Three TCP fixes
@ 2007-12-04 16:48 Ilpo Järvinen
  2007-12-04 16:48 ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust Ilpo Järvinen
  2007-12-04 18:42 ` [PATCH net-2.6 0/3]: Three TCP fixes John Heffner
  0 siblings, 2 replies; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi Dave,

Here are couple of fixes to net-2.6. The first one is to FRTO
which may fix some corner-case bug if the hand-coded check and
the !tcp_may_send_now disagree. I didn't check the differences
that carefully so they might agree after considering what FRTO
overrides from there, however, it makes the code much much more
obvious.

...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
as lower bound even though the ssthresh was already halved, 
so snd_ssthresh should suffice.

...While doing this, I had to did read a lot of related history,
thanks everyone about commit messages you have bothered to write
back then :-).

-- 
 i.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust
  2007-12-04 16:48 [PATCH net-2.6 0/3]: Three TCP fixes Ilpo Järvinen
@ 2007-12-04 16:48 ` Ilpo Järvinen
  2007-12-04 16:48   ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place Ilpo Järvinen
  2007-12-05 10:21   ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust David Miller
  2007-12-04 18:42 ` [PATCH net-2.6 0/3]: Three TCP fixes John Heffner
  1 sibling, 2 replies; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

Though there's little need for everything that tcp_may_send_now
does (actually, even the state had to be adjusted to pass some
checks FRTO does not want to occur), it's more robust to let it
make the decision if sending is allowed. State adjustments
needed:
- Make sure snd_cwnd limit is not hit in there
- Disable nagle (if necessary) through the frto_counter == 2

The result of check for frto_counter in argument to call for
tcp_enter_frto_loss can just be open coded, therefore there
isn't need to store the previous frto_counter past
tcp_may_send_now.

In addition, returns can then be combined.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   14 +++++---------
 1 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 227977d..5f93d88 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3118,17 +3118,13 @@ static int tcp_process_frto(struct sock *sk, int flag)
 	}
 
 	if (tp->frto_counter == 1) {
-		/* Sending of the next skb must be allowed or no F-RTO */
-		if (!tcp_send_head(sk) ||
-		    after(TCP_SKB_CB(tcp_send_head(sk))->end_seq,
-				     tp->snd_una + tp->snd_wnd)) {
-			tcp_enter_frto_loss(sk, (tp->frto_counter == 1 ? 2 : 3),
-					    flag);
-			return 1;
-		}
-
+		/* tcp_may_send_now needs to see updated state */
 		tp->snd_cwnd = tcp_packets_in_flight(tp) + 2;
 		tp->frto_counter = 2;
+
+		if (!tcp_may_send_now(sk))
+			tcp_enter_frto_loss(sk, 2, flag);
+
 		return 1;
 	} else {
 		switch (sysctl_tcp_frto_response) {
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place
  2007-12-04 16:48 ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust Ilpo Järvinen
@ 2007-12-04 16:48   ` Ilpo Järvinen
  2007-12-04 16:48     ` [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around Ilpo Järvinen
  2007-12-05 10:21     ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place David Miller
  2007-12-05 10:21   ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust David Miller
  1 sibling, 2 replies; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

The previous location is after sacktag processing, which affects
counters tcp_packets_in_flight depends on. This may manifest as
wrong behavior if new SACK blocks are present and all is clear
for call to tcp_cong_avoid, which in the case of
tcp_reno_cong_avoid bails out early because it thinks that
TCP is not limited by cwnd.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5f93d88..fef1191 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3180,6 +3180,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	}
 
 	prior_fackets = tp->fackets_out;
+	prior_in_flight = tcp_packets_in_flight(tp);
 
 	if (!(flag&FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
 		/* Window is constant, pure forward advance.
@@ -3219,8 +3220,6 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	if (!prior_packets)
 		goto no_queue;
 
-	prior_in_flight = tcp_packets_in_flight(tp);
-
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, &seq_rtt, prior_fackets);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around
  2007-12-04 16:48   ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place Ilpo Järvinen
@ 2007-12-04 16:48     ` Ilpo Järvinen
  2007-12-05 10:26       ` David Miller
  2007-12-05 10:21     ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@helsinki.fi>

The comment in tcp_nagle_test suggests that. This bug is very
very old, even 2.4.0 seems to have it.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1d83c65..6110459 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1157,8 +1157,7 @@ int tcp_may_send_now(struct sock *sk)
 	return (skb &&
 		tcp_snd_test(sk, skb, tcp_current_mss(sk, 1),
 			     (tcp_skb_is_last(sk, skb) ?
-			      TCP_NAGLE_PUSH :
-			      tp->nonagle)));
+			      tp->nonagle : TCP_NAGLE_PUSH)));
 }
 
 /* Trim TSO SKB to LEN bytes, put the remaining data into a new packet
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 16:48 [PATCH net-2.6 0/3]: Three TCP fixes Ilpo Järvinen
  2007-12-04 16:48 ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust Ilpo Järvinen
@ 2007-12-04 18:42 ` John Heffner
  2007-12-04 21:10   ` Ilpo Järvinen
  2007-12-05 10:30   ` David Miller
  1 sibling, 2 replies; 18+ messages in thread
From: John Heffner @ 2007-12-04 18:42 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, netdev

Ilpo Järvinen wrote:
> ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
> as lower bound even though the ssthresh was already halved, 
> so snd_ssthresh should suffice.

I remember this coming up at least once before, so it's probably worth a 
comment in the code.  Rate-halving attempts to actually reduce cwnd to 
half the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound 
on how far rate-halving can reduce cwnd.  See the "Bounding Parameters" 
section of <http://www.psc.edu/networking/papers/FACKnotes/current/>.

   -John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 18:42 ` [PATCH net-2.6 0/3]: Three TCP fixes John Heffner
@ 2007-12-04 21:10   ` Ilpo Järvinen
  2007-12-04 21:17     ` John Heffner
  2007-12-05 10:30   ` David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 21:10 UTC (permalink / raw)
  To: John Heffner; +Cc: David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1524 bytes --]

On Tue, 4 Dec 2007, John Heffner wrote:

> Ilpo Järvinen wrote:
> > ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
> > as lower bound even though the ssthresh was already halved, so snd_ssthresh
> > should suffice.
> 
> I remember this coming up at least once before, so it's probably worth a
> comment in the code.  Rate-halving attempts to actually reduce cwnd to half
> the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound on how far
> rate-halving can reduce cwnd.  See the "Bounding Parameters" section of
> <http://www.psc.edu/networking/papers/FACKnotes/current/>.

Thanks for the info! Sadly enough it makes NewReno recovery quite 
inefficient when there are enough losses and high BDP link (in my case 
384k/200ms, BDP sized buffer). There might be yet another bug in it as 
well (it is still a bit unclear how tcp variables behaved during my 
scenario and I'll investigate further) but reduction in the transfer 
rate is going to last longer than a short moment (which is used as 
motivation in those FACK notes). In fact, if I just use RFC2581 like 
setting w/o rate-halving (and experience the initial "pause" in sending), 
the ACK clock to send out new data works very nicely beating rate halving 
fair and square. For SACK/FACK it works much nicer because recovery is 
finished much earlier and slow start recovers cwnd quickly.


...Mind if I ask another similar one, any idea why prior_ssthresh is 
smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?


-- 
 i.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 21:10   ` Ilpo Järvinen
@ 2007-12-04 21:17     ` John Heffner
  2007-12-04 21:26       ` Ilpo Järvinen
  2007-12-05  2:13       ` Matt Mathis
  0 siblings, 2 replies; 18+ messages in thread
From: John Heffner @ 2007-12-04 21:17 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, Netdev, Matt Mathis

Ilpo Järvinen wrote:
> On Tue, 4 Dec 2007, John Heffner wrote:
> 
>> Ilpo Järvinen wrote:
>>> ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
>>> as lower bound even though the ssthresh was already halved, so snd_ssthresh
>>> should suffice.
>> I remember this coming up at least once before, so it's probably worth a
>> comment in the code.  Rate-halving attempts to actually reduce cwnd to half
>> the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound on how far
>> rate-halving can reduce cwnd.  See the "Bounding Parameters" section of
>> <http://www.psc.edu/networking/papers/FACKnotes/current/>.
> 
> Thanks for the info! Sadly enough it makes NewReno recovery quite 
> inefficient when there are enough losses and high BDP link (in my case 
> 384k/200ms, BDP sized buffer). There might be yet another bug in it as 
> well (it is still a bit unclear how tcp variables behaved during my 
> scenario and I'll investigate further) but reduction in the transfer 
> rate is going to last longer than a short moment (which is used as 
> motivation in those FACK notes). In fact, if I just use RFC2581 like 
> setting w/o rate-halving (and experience the initial "pause" in sending), 
> the ACK clock to send out new data works very nicely beating rate halving 
> fair and square. For SACK/FACK it works much nicer because recovery is 
> finished much earlier and slow start recovers cwnd quickly.

I believe this is exactly the reason why Matt (CC'd) and Jamshid 
abandoned this line of work in the late 90's.  In my opinion, it's 
probably not such a bad idea to use cwnd/2 as the bound.  In some 
situations, the current rate-halving code will work better, but as you 
point out, in others the cwnd is lowered too much.


> ...Mind if I ask another similar one, any idea why prior_ssthresh is 
> smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?

Not sure on that one.  I'm not aware of any publications this is based 
on.  Maybe Alexey knows?

   -John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 21:17     ` John Heffner
@ 2007-12-04 21:26       ` Ilpo Järvinen
  2007-12-05 11:17         ` Alexey Kuznetsov
  2007-12-05  2:13       ` Matt Mathis
  1 sibling, 1 reply; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-04 21:26 UTC (permalink / raw)
  To: John Heffner, Alexey Kuznetsov; +Cc: David Miller, Netdev, Matt Mathis

[-- Attachment #1: Type: TEXT/PLAIN, Size: 626 bytes --]

On Tue, 4 Dec 2007, John Heffner wrote:

> Ilpo Järvinen wrote:
> 
> > ...Mind if I ask another similar one, any idea why prior_ssthresh is smaller
> > (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?
> 
> Not sure on that one.  I'm not aware of any publications this is based on.

My theory is that it could relate to tcp_cwnd_restart and 
tcp_cwnd_application_limited using it and the others are just then 
accidently changed as well. Perhaps I'll have to dig once again to 
changelog history to see if there's some clue (unless Alexey shed 
some light to this)...

> Maybe Alexey knows?

Added Alexey.


-- 
 i.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 21:17     ` John Heffner
  2007-12-04 21:26       ` Ilpo Järvinen
@ 2007-12-05  2:13       ` Matt Mathis
  1 sibling, 0 replies; 18+ messages in thread
From: Matt Mathis @ 2007-12-05  2:13 UTC (permalink / raw)
  To: =?X-UNKNOWN?Q?Ilpo_J=E4rvinen?=; +Cc: David Miller, Netdev

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN; FORMAT=flowed, Size: 4163 bytes --]

I can shed light one one detail:  ratehalving w/bounding parameters uses 
snd_cwnd/4 to be appropriately conservative during slowstart.  Ideally cwnd 
would be saved for every transmitted segment, and during recovery, ssthresh 
and min_cwnd would be set to saved_cwnd/2.  However since cwnd is not saved 
and during slowstart it might be doubled each RTT (depending on what other 
algorithms are in effect), at the time the loss is detected cwnd/2 might 
already be at the maximum window size for the path (pipe+queue size).  This 
was observed to do some bad things on real networks, so we included another 
factor of two reduction, which is really only correct for slowstart. 
However, this does not usually hurt the normal loss in congestion avoidence 
case, unless something else unexpected goes wrong.

We considered some logic to attempt to estimate cwnd at the time a lost 
segment was sent, using the current cwnd and ssthresh (i.e. when the loss is 
detected), but it was cumbersome, not well behaved, impossible to model, and 
didn't make enough difference.

Actually the reason we abandoned RH is that is it sets cwnd from flight size 
in a really bad way.  When you have large windows and burst losses, including 
possibly a few segments with lost retransmissions, you are likely to depress 
the flight size because either you run out of receiver window or the sending 
app can't refill the socket quickly enough when snd_una advances.  The
consequence is that cwnd is pulled down not by losses, but other non-path 
"congestion", which takes potentially thousands of RTTs for AIMD to recover.

Unfortunately idea of setting cwnd from flightsize has been standardized in 
RFC 2851....

This is one of the problems that I would really like to revisit, if I had the
time.

Thanks,
--MM--
-------------------------------------------
Matt Mathis      http://www.psc.edu/~mathis
Work:412.268.3319    Home/Cell:412.654.7529
-------------------------------------------

On Tue, 4 Dec 2007, John Heffner wrote:

> Ilpo Järvinen wrote:
>> On Tue, 4 Dec 2007, John Heffner wrote:
>> 
>>> Ilpo Järvinen wrote:
>>>> ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
>>>> as lower bound even though the ssthresh was already halved, so 
>>>> snd_ssthresh
>>>> should suffice.
>>> I remember this coming up at least once before, so it's probably worth a
>>> comment in the code.  Rate-halving attempts to actually reduce cwnd to 
>>> half
>>> the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound on how 
>>> far
>>> rate-halving can reduce cwnd.  See the "Bounding Parameters" section of
>>> <http://www.psc.edu/networking/papers/FACKnotes/current/>.
>> 
>> Thanks for the info! Sadly enough it makes NewReno recovery quite 
>> inefficient when there are enough losses and high BDP link (in my case 
>> 384k/200ms, BDP sized buffer). There might be yet another bug in it as well 
>> (it is still a bit unclear how tcp variables behaved during my scenario and 
>> I'll investigate further) but reduction in the transfer rate is going to 
>> last longer than a short moment (which is used as motivation in those FACK 
>> notes). In fact, if I just use RFC2581 like setting w/o rate-halving (and 
>> experience the initial "pause" in sending), the ACK clock to send out new 
>> data works very nicely beating rate halving fair and square. For SACK/FACK 
>> it works much nicer because recovery is finished much earlier and slow 
>> start recovers cwnd quickly.
>
> I believe this is exactly the reason why Matt (CC'd) and Jamshid abandoned 
> this line of work in the late 90's.  In my opinion, it's probably not such a 
> bad idea to use cwnd/2 as the bound.  In some situations, the current 
> rate-halving code will work better, but as you point out, in others the cwnd 
> is lowered too much.
>
>
>> ...Mind if I ask another similar one, any idea why prior_ssthresh is 
>> smaller (3/4 of it) than cwnd used to be (see tcp_current_ssthresh)?
>
> Not sure on that one.  I'm not aware of any publications this is based on. 
> Maybe Alexey knows?
>
>  -John
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust
  2007-12-04 16:48 ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust Ilpo Järvinen
  2007-12-04 16:48   ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place Ilpo Järvinen
@ 2007-12-05 10:21   ` David Miller
  1 sibling, 0 replies; 18+ messages in thread
From: David Miller @ 2007-12-05 10:21 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue,  4 Dec 2007 18:48:48 +0200

> Though there's little need for everything that tcp_may_send_now
> does (actually, even the state had to be adjusted to pass some
> checks FRTO does not want to occur), it's more robust to let it
> make the decision if sending is allowed. State adjustments
> needed:
> - Make sure snd_cwnd limit is not hit in there
> - Disable nagle (if necessary) through the frto_counter == 2
> 
> The result of check for frto_counter in argument to call for
> tcp_enter_frto_loss can just be open coded, therefore there
> isn't need to store the previous frto_counter past
> tcp_may_send_now.
> 
> In addition, returns can then be combined.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place
  2007-12-04 16:48   ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place Ilpo Järvinen
  2007-12-04 16:48     ` [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around Ilpo Järvinen
@ 2007-12-05 10:21     ` David Miller
  1 sibling, 0 replies; 18+ messages in thread
From: David Miller @ 2007-12-05 10:21 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue,  4 Dec 2007 18:48:49 +0200

> The previous location is after sacktag processing, which affects
> counters tcp_packets_in_flight depends on. This may manifest as
> wrong behavior if new SACK blocks are present and all is clear
> for call to tcp_cong_avoid, which in the case of
> tcp_reno_cong_avoid bails out early because it thinks that
> TCP is not limited by cwnd.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around
  2007-12-04 16:48     ` [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around Ilpo Järvinen
@ 2007-12-05 10:26       ` David Miller
  2007-12-05 11:18         ` Ilpo Järvinen
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2007-12-05 10:26 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue,  4 Dec 2007 18:48:50 +0200

> The comment in tcp_nagle_test suggests that. This bug is very
> very old, even 2.4.0 seems to have it.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Wow, I can't believed I missed this during the TSO output
path rewrite.

I will have to double check the history, because I stated
at these conditionals many times during that work and I
wonder if I inverted a "!" somewhere along the line.

Thanks for the fix, applied!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 18:42 ` [PATCH net-2.6 0/3]: Three TCP fixes John Heffner
  2007-12-04 21:10   ` Ilpo Järvinen
@ 2007-12-05 10:30   ` David Miller
  2007-12-05 11:30     ` Ilpo Järvinen
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2007-12-05 10:30 UTC (permalink / raw)
  To: jheffner; +Cc: ilpo.jarvinen, netdev

From: John Heffner <jheffner@psc.edu>
Date: Tue, 04 Dec 2007 13:42:41 -0500

> Ilpo Järvinen wrote:
> > ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
> > as lower bound even though the ssthresh was already halved, 
> > so snd_ssthresh should suffice.
> 
> I remember this coming up at least once before, so it's probably worth a 
> comment in the code.  Rate-halving attempts to actually reduce cwnd to 
> half the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound 
> on how far rate-halving can reduce cwnd.  See the "Bounding Parameters" 
> section of <http://www.psc.edu/networking/papers/FACKnotes/current/>.

I assume we're talking about the tcp_cwnd_min() usage in
tcp_cwnd_down(), I don't see where it's dividing by two
there.

Anyways, someone please enlighten me and please also cook
up a patch to add the descriptive comment :-)

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-04 21:26       ` Ilpo Järvinen
@ 2007-12-05 11:17         ` Alexey Kuznetsov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Kuznetsov @ 2007-12-05 11:17 UTC (permalink / raw)
  To: Ilpo J?rvinen; +Cc: John Heffner, David Miller, Netdev, Matt Mathis

Hello!

> My theory is that it could relate to tcp_cwnd_restart and 
> tcp_cwnd_application_limited using it and the others are just then 
> accidently changed as well. Perhaps I'll have to dig once again to 
> changelog history to see if there's some clue (unless Alexey shed 
> some light to this)...

Yes, it is RFC2861. There is some rationale in preamble to section 3.
But even for this case it can be set to full cwnd instead.

Also, I used it to calculate prior_ssthresh to restore ssthresh
in the case when we detect false retransmission. It is not an accident.
I really mean that tcp_current_sshresh() is correct estimate
for "current ssthresh" and this should be the same as in restart
situation. Of course, rationale of RFC2861 does not apply
to this case _directly_. But it was at the end of chain of syllogisms.

Look:

If we have not infinite ssthresh and cwnd > ssthresh, we are already
in congestion avoidance mode and we can set prior_ssthresh to any value
in the range ssthresh...cwnd without significant changes in behaviour
wrt congestion avoidance.

ssthresh would be an obvious underestimate.

cwnd can be an overestimate, because the path is surely congested
(cwnd > ssthresh) and cwnd is not a measure of current capacity.

Behavioral changes happen, when final cwnd is too low, because pipe
was drained during recovery. We allow to slow start to sshresh,
but value of ssthresh should be the same as we use for restart case.

Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around
  2007-12-05 10:26       ` David Miller
@ 2007-12-05 11:18         ` Ilpo Järvinen
  2007-12-05 11:33           ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-05 11:18 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6173 bytes --]

On Wed, 5 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue,  4 Dec 2007 18:48:50 +0200
> 
> > The comment in tcp_nagle_test suggests that. This bug is very
> > very old, even 2.4.0 seems to have it.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> Wow, I can't believed I missed this during the TSO output
> path rewrite.
> 
> I will have to double check the history, because I stated
> at these conditionals many times during that work and I
> wonder if I inverted a "!" somewhere along the line.

...No need, I mostly digged this up already because I was bit unsure 
how it should be, your conversions didn't invert anything significant,
at least before 2.4.0 :-).  ...Unless you first fixed it and then
broke it again (in your private working tree)... :-/

...If I understood the very old history correctly, this bug was introduced 
in 2.4.0-test12 which inverted !tail incorrectly to nonagle==1.

You can check include/net/tcp.h diffs from this commit:
http://www.linux-mips.org/git?p=linux.git;a=commitdiff;h=c9c06167e7933d93a6e396174c68abf242294abb

...Though it's very large one. So I included only the relevant portion 
here below.


-- 
 i.

c9c06167e7933d93a6e396174c68abf242294abb
diff --git a/include/net/tcp.h b/include/net/tcp.h
index dd8e74c..ccdff5a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -328,9 +328,13 @@ static __inline__ int tcp_sk_listen_hashfn(struct sock *sk)
 				  */
 
 #define TCP_DELACK_MAX	(HZ/5)	/* maximal time to delay before sending an ACK */
-#define TCP_DELACK_MIN	(2)	/* minimal time to delay before sending an ACK,
-				 * 2 scheduler ticks, not depending on HZ. */
-#define TCP_ATO_MIN	2
+#if HZ >= 100
+#define TCP_DELACK_MIN	(HZ/25)	/* minimal time to delay before sending an ACK */
+#define TCP_ATO_MIN	(HZ/25)
+#else
+#define TCP_DELACK_MIN	4
+#define TCP_ATO_MIN	4
+#endif
 #define TCP_RTO_MAX	(120*HZ)
 #define TCP_RTO_MIN	(HZ/5)
 #define TCP_TIMEOUT_INIT (3*HZ)	/* RFC 1122 initial RTO value	*/
@@ -688,6 +692,10 @@ static __inline__ void tcp_delack_init(struct tcp_opt *tp)
 	memset(&tp->ack, 0, sizeof(tp->ack));
 }
 
+static inline void tcp_clear_options(struct tcp_opt *tp)
+{
+ 	tp->tstamp_ok = tp->sack_ok = tp->wscale_ok = tp->snd_wscale = 0;
+}
 
 enum tcp_tw_status
 {
@@ -734,7 +742,8 @@ extern int			tcp_recvmsg(struct sock *sk,
 extern int			tcp_listen_start(struct sock *sk);
 
 extern void			tcp_parse_options(struct sk_buff *skb,
-						  struct tcp_opt *tp);
+						  struct tcp_opt *tp,
+						  int estab);
 
 /*
  *	TCP v4 functions exported for the inet6 API
@@ -997,6 +1006,9 @@ struct tcp_skb_cb {
 #define TCPCB_EVER_RETRANS	0x80	/* Ever retransmitted frame	*/
 #define TCPCB_RETRANS		(TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
 
+#define TCPCB_URG		0x20	/* Urgent pointer advenced here	*/
+
+#define TCPCB_AT_TAIL		(TCPCB_URG)
 
 	__u16		urg_ptr;	/* Valid w/URG flags is set.	*/
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
@@ -1134,18 +1146,19 @@ static __inline__ void tcp_minshall_update(struct tcp_opt *tp, int mss, struct s
 
 /* Return 0, if packet can be sent now without violation Nagle's rules:
    1. It is full sized.
-   2. Or it contains FIN or URG.
+   2. Or it contains FIN.
    3. Or TCP_NODELAY was set.
    4. Or TCP_CORK is not set, and all sent packets are ACKed.
       With Minshall's modification: all sent small packets are ACKed.
  */
 
-static __inline__ int tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, unsigned mss_now)
+static __inline__ int
+tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, unsigned mss_now, int nonagle)
 {
 	return (skb->len < mss_now &&
-		!(TCP_SKB_CB(skb)->flags & (TCPCB_FLAG_URG|TCPCB_FLAG_FIN)) &&
-		(tp->nonagle == 2 ||
-		 (!tp->nonagle &&
+		!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
+		(nonagle == 2 ||
+		 (!nonagle &&
 		  tp->packets_out &&
 		  tcp_minshall_check(tp))));
 }
@@ -1154,7 +1167,7 @@ static __inline__ int tcp_nagle_check(struct tcp_opt *tp, struct sk_buff *skb, u
  * should be put on the wire right now.
  */
 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
-				   unsigned cur_mss, int tail)
+				   unsigned cur_mss, int nonagle)
 {
 	/*	RFC 1122 - section 4.2.3.4
 	 *
@@ -1180,8 +1193,8 @@ static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
 	/* Don't be strict about the congestion window for the
 	 * final FIN frame.  -DaveM
 	 */
-	return ((!tail || !tcp_nagle_check(tp, skb, cur_mss) ||
-		 skb_tailroom(skb) < 32) &&
+	return ((nonagle==1 || tp->urg_mode
+		 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
 		((tcp_packets_in_flight(tp) < tp->snd_cwnd) ||
 		 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
 		!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
@@ -1204,12 +1217,15 @@ static __inline__ int tcp_skb_is_last(struct sock *sk, struct sk_buff *skb)
  */
 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
 						 struct tcp_opt *tp,
-						 unsigned cur_mss)
+						 unsigned cur_mss,
+						 int nonagle)
 {
 	struct sk_buff *skb = tp->send_head;
 
 	if (skb) {
-		if (!tcp_snd_test(tp, skb, cur_mss, tcp_skb_is_last(sk, skb)) ||
+		if (!tcp_skb_is_last(sk, skb))
+			nonagle = 1;
+		if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
 		    tcp_write_xmit(sk))
 			tcp_check_probe_timer(sk, tp);
 	}
@@ -1219,7 +1235,7 @@ static __inline__ void __tcp_push_pending_frames(struct sock *sk,
 static __inline__ void tcp_push_pending_frames(struct sock *sk,
 					       struct tcp_opt *tp)
 {
-	__tcp_push_pending_frames(sk, tp, tcp_current_mss(sk));
+	__tcp_push_pending_frames(sk, tp, tcp_current_mss(sk), tp->nonagle);
 }
 
 static __inline__ int tcp_may_send_now(struct sock *sk, struct tcp_opt *tp)
@@ -1227,7 +1243,8 @@ static __inline__ int tcp_may_send_now(struct sock *sk, struct tcp_opt *tp)
 	struct sk_buff *skb = tp->send_head;
 
 	return (skb &&
-		tcp_snd_test(tp, skb, tcp_current_mss(sk), tcp_skb_is_last(sk, skb)));
+		tcp_snd_test(tp, skb, tcp_current_mss(sk),
+			     tcp_skb_is_last(sk, skb) ? 1 : tp->nonagle));
 }
 
 static __inline__ void tcp_init_wl(struct tcp_opt *tp, u32 ack, u32 seq)

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-05 10:30   ` David Miller
@ 2007-12-05 11:30     ` Ilpo Järvinen
  2007-12-06  4:56       ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Ilpo Järvinen @ 2007-12-05 11:30 UTC (permalink / raw)
  To: David Miller; +Cc: John Heffner, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2325 bytes --]

On Wed, 5 Dec 2007, David Miller wrote:

> From: John Heffner <jheffner@psc.edu>
> Date: Tue, 04 Dec 2007 13:42:41 -0500
> 
> > Ilpo Järvinen wrote:
> > > ...I'm still to figure out why tcp_cwnd_down uses snd_ssthresh/2
> > > as lower bound even though the ssthresh was already halved, 
> > > so snd_ssthresh should suffice.
> > 
> > I remember this coming up at least once before, so it's probably worth a 
> > comment in the code.  Rate-halving attempts to actually reduce cwnd to 
> > half the delivered window.  Here, cwnd/4 (ssthresh/2) is a lower bound 
> > on how far rate-halving can reduce cwnd.  See the "Bounding Parameters" 
> > section of <http://www.psc.edu/networking/papers/FACKnotes/current/>.
> 
> I assume we're talking about the tcp_cwnd_min() usage in
> tcp_cwnd_down(), I don't see where it's dividing by two
> there.

...Ah, it's because I'm not using cubic which is not setting cwnd_min
callback. In tcp_cong.c it is:

/* Lower bound on congestion window with halving. */
u32 tcp_reno_min_cwnd(const struct sock *sk)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        return tp->snd_ssthresh/2;
}

That snd_ssthresh is already halved when CA_Recovery/CA_CWR was
entered. Thus the amount of reduction is not 1/2 * orig_cwnd but
1/4 of it.

> Anyways, someone please enlighten me and please also cook
> up a patch to add the descriptive comment :-)

Not sure if a too simple patch here is correct thing to do... Matt has a 
point regarding slow-start cwnd behavior. For FACK/SACK which can estimate 
losses very accurately it works very well, where as NewReno discovers 
those losses only one by one when cumulative ACKs arrive and until then, 
the other losses are counted as outstanding segments. As a result, every 
undiscovered loss is "missing from ACK clock" which in here results in the 
bad performance because ACK clock slows down and (almost) dies out (ie., 
cumulative ACKs keep it going, though its rate is not very astonishing). 
There might be some other bug to make it worse in the end of the recovery 
but I'm yet to uncover it. With SACK/FACK, ACK clock keeps running at 
expected rate and if cwnd becomes < ssthresh, the difference is quickly 
regained in slow start.

...I've not yet decided what is the best way to deal with it but I'll try 
to figure something out.


-- 
 i.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around
  2007-12-05 11:18         ` Ilpo Järvinen
@ 2007-12-05 11:33           ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2007-12-05 11:33 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 5 Dec 2007 13:18:14 +0200 (EET)

> ...If I understood the very old history correctly, this bug was introduced 
> in 2.4.0-test12 which inverted !tail incorrectly to nonagle==1.
> 
> You can check include/net/tcp.h diffs from this commit:
> http://www.linux-mips.org/git?p=linux.git;a=commitdiff;h=c9c06167e7933d93a6e396174c68abf242294abb
> 
> ...Though it's very large one. So I included only the relevant portion 
> here below.

Thanks for the info, indeed it seems it has been broken this
way all this time.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-2.6 0/3]: Three TCP fixes
  2007-12-05 11:30     ` Ilpo Järvinen
@ 2007-12-06  4:56       ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2007-12-06  4:56 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: jheffner, netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 5 Dec 2007 13:30:32 +0200 (EET)

> On Wed, 5 Dec 2007, David Miller wrote:
> 
> > Anyways, someone please enlighten me and please also cook
> > up a patch to add the descriptive comment :-)
> 
> Not sure if a too simple patch here is correct thing to do... Matt has a 
> point regarding slow-start cwnd behavior.
 ...
> ...I've not yet decided what is the best way to deal with it but I'll try 
> to figure something out.

Ok.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-12-06  4:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-04 16:48 [PATCH net-2.6 0/3]: Three TCP fixes Ilpo Järvinen
2007-12-04 16:48 ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust Ilpo Järvinen
2007-12-04 16:48   ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place Ilpo Järvinen
2007-12-04 16:48     ` [PATCH 3/3] [TCP]: NAGLE_PUSH seems to be a wrong way around Ilpo Järvinen
2007-12-05 10:26       ` David Miller
2007-12-05 11:18         ` Ilpo Järvinen
2007-12-05 11:33           ` David Miller
2007-12-05 10:21     ` [PATCH 2/3] [TCP]: Move prior_in_flight collect to more robust place David Miller
2007-12-05 10:21   ` [PATCH 1/3] [TCP] FRTO: Use of existing funcs make code more obvious & robust David Miller
2007-12-04 18:42 ` [PATCH net-2.6 0/3]: Three TCP fixes John Heffner
2007-12-04 21:10   ` Ilpo Järvinen
2007-12-04 21:17     ` John Heffner
2007-12-04 21:26       ` Ilpo Järvinen
2007-12-05 11:17         ` Alexey Kuznetsov
2007-12-05  2:13       ` Matt Mathis
2007-12-05 10:30   ` David Miller
2007-12-05 11:30     ` Ilpo Järvinen
2007-12-06  4:56       ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.