All of lore.kernel.org
 help / color / mirror / Atom feed
* TCP delayed ACK heuristic
       [not found] <270756364.27707018.1355842632348.JavaMail.root@redhat.com>
@ 2012-12-18 15:11 ` Cong Wang
  2012-12-18 16:30   ` Eric Dumazet
  2012-12-18 16:39   ` David Laight
  0 siblings, 2 replies; 13+ messages in thread
From: Cong Wang @ 2012-12-18 15:11 UTC (permalink / raw)
  To: netdev
  Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger,
	Rick Jones, Thomas Graf

Hello, TCP experts!

Some time ago, Ben sent a patch [1] to add some knobs for
tuning TCP delayed ACK, but rejected by David.

David's point is that we can do some heuristics for TCP
delayed ACK, so the question is that what kind of heuristics
can we use?

RFC1122 explicitly mentions:

            A TCP SHOULD implement a delayed ACK, but an ACK should not
            be excessively delayed; in particular, the delay MUST be
            less than 0.5 seconds, and in a stream of full-sized
            segments there SHOULD be an ACK for at least every second
            segment.

so this prevents us from using any heuristic for the number
of coalesced delayed ACK.

For the timeout of a delayed ACK, my idea is guessing how many
packet we suppose to receive is the TCP stream is fully utilized,
something like below:

+static inline u32 tcp_expect_packets(struct sock *sk)
+{
+       struct tcp_sock *tp = tcp_sk(sk);
+       int rtt = tp->srtt >> 3;
+       u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime;
+
+       return idle * 2 / rtt;
+}
...
+       ato -= tcp_expect_packets(sk) * delta;


The more we expect, the less we should delay. However this is
not accurate due to congestion control.

Meanwhile, we can also check how many packets are pending in TCP
sending queue, the more we pend, the more we can piggyback with
a single ACK, but not beyond how much we are able to send at
that time.

Comments? Ideas?

Thanks.

1. http://thread.gmane.org/gmane.linux.network/233859

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang
@ 2012-12-18 16:30   ` Eric Dumazet
  2012-12-19  6:54     ` Cong Wang
  2012-12-18 16:39   ` David Laight
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-12-18 16:30 UTC (permalink / raw)
  To: Cong Wang
  Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones,
	Thomas Graf

On Tue, 2012-12-18 at 10:11 -0500, Cong Wang wrote:
> Hello, TCP experts!
> 
> Some time ago, Ben sent a patch [1] to add some knobs for
> tuning TCP delayed ACK, but rejected by David.
> 
> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.
> 
> For the timeout of a delayed ACK, my idea is guessing how many
> packet we suppose to receive is the TCP stream is fully utilized,
> something like below:
> 
> +static inline u32 tcp_expect_packets(struct sock *sk)
> +{
> +       struct tcp_sock *tp = tcp_sk(sk);
> +       int rtt = tp->srtt >> 3;
> +       u32 idle = tcp_time_stamp - inet_csk(sk)->icsk_ack.lrcvtime;
> +
> +       return idle * 2 / rtt;
> +}
> ...
> +       ato -= tcp_expect_packets(sk) * delta;
> 
> 
> The more we expect, the less we should delay. However this is
> not accurate due to congestion control.
> 
> Meanwhile, we can also check how many packets are pending in TCP
> sending queue, the more we pend, the more we can piggyback with
> a single ACK, but not beyond how much we are able to send at
> that time.
> 
> Comments? Ideas?
> 

ACKS might also be delayed because of bidirectional traffic, and is more
controlled by the application response time. TCP stack can not easily
estimate it.

If you focus on bulk receive, LRO/GRO should already lower number of
ACKS to an acceptable level and without major disruption.

Stretch acks are not only the receiver concern, there are issues for the
sender that you cannot always control/change.

I recommend reading RFC2525 2.13

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: TCP delayed ACK heuristic
  2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang
  2012-12-18 16:30   ` Eric Dumazet
@ 2012-12-18 16:39   ` David Laight
  2012-12-18 17:54     ` Rick Jones
  2012-12-19  7:00     ` Cong Wang
  1 sibling, 2 replies; 13+ messages in thread
From: David Laight @ 2012-12-18 16:39 UTC (permalink / raw)
  To: Cong Wang, netdev
  Cc: Ben Greear, David Miller, Eric Dumazet, Stephen Hemminger,
	Rick Jones, Thomas Graf

> David's point is that we can do some heuristics for TCP
> delayed ACK, so the question is that what kind of heuristics
> can we use?
> 
> RFC1122 explicitly mentions:
> 
>             A TCP SHOULD implement a delayed ACK, but an ACK should not
>             be excessively delayed; in particular, the delay MUST be
>             less than 0.5 seconds, and in a stream of full-sized
>             segments there SHOULD be an ACK for at least every second
>             segment.
> 
> so this prevents us from using any heuristic for the number
> of coalesced delayed ACK.

There are problems with only implementing the acks
specified by RFC1122.

I've seen problems when the sending side is doing (I think)
'slow start' with Nagle disabled.
The sender would only send 4 segments before waiting for an
ACK - even when it had more than a full sized segment waiting.
Sender was Linux 2.6.something (probably low 20s).
I changed the application flow to send data in the reverse
direction to avoid the problem.
That was on a ~0 delay local connection - which means that
there is almost never outstanding data, and the 'slow start'
happened almost all the time.
Nagle is completely the wrong algorithm for the data flow.

	David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-18 16:39   ` David Laight
@ 2012-12-18 17:54     ` Rick Jones
  2012-12-19  9:52       ` David Laight
  2012-12-19  7:00     ` Cong Wang
  1 sibling, 1 reply; 13+ messages in thread
From: Rick Jones @ 2012-12-18 17:54 UTC (permalink / raw)
  To: David Laight
  Cc: Cong Wang, netdev, Ben Greear, David Miller, Eric Dumazet,
	Stephen Hemminger, Thomas Graf

On 12/18/2012 08:39 AM, David Laight wrote:
> There are problems with only implementing the acks
> specified by RFC1122.
>
> I've seen problems when the sending side is doing (I think)
> 'slow start' with Nagle disabled.
> The sender would only send 4 segments before waiting for an
> ACK - even when it had more than a full sized segment waiting.
> Sender was Linux 2.6.something (probably low 20s).
> I changed the application flow to send data in the reverse
> direction to avoid the problem.
> That was on a ~0 delay local connection - which means that
> there is almost never outstanding data, and the 'slow start'
> happened almost all the time.
> Nagle is completely the wrong algorithm for the data flow.

If Nagle was already disabled, why the last sentence?  And from your 
description, even if Nagle were enabled, I would think that it was 
remote ACK+cwnd behaviour getting in your way, not Nagle, given that 
Nagle is to be decided on a user-send by user-send basis and release 
queued data (to the mercies of other heuristics) when it gets to be an 
MSS-worth.

The joys of intertwined heuristics I suppose.

Personally, I would love for there to be a way to have a cwnd's 
byte-limit's-worth of small segments outstanding at one time - it would 
make my netperf-life much easier as I could get rid of the netperf-level 
congestion window intended to keep successive requests (with Nagle 
already disabled) from getting coalesced by cwnd in a "burst-mode" test. 
* And perhaps make things nicer for the test when there is the 
occasional retransmission.  I used to think that netperf was just 
"unique" in that regard, but it sounds like you have an actual 
application looking to do that??

rick jones

* because I am trying to (ab)use the burst mode TCP_RR test for a 
maximum packets per second through the stack+NIC measurement that isn't 
also a context  switching benchmark. But I cannot really come-up with a 
real-world rationale to support further cwnd behaviour changes. 
Allowing a byte-limit-cwnd's worth of single-byte-payload TCP segments 
could easily be seen as being rather anti-social :)  And 
forcing/maintaining the original segment boundaries in retransmissions 
for small packets isn't such a hot idea either.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-18 16:30   ` Eric Dumazet
@ 2012-12-19  6:54     ` Cong Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Cong Wang @ 2012-12-19  6:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Ben Greear, David Miller, Stephen Hemminger, Rick Jones,
	Thomas Graf

On Tue, 2012-12-18 at 08:30 -0800, Eric Dumazet wrote:
>  
> 
> ACKS might also be delayed because of bidirectional traffic, and is more
> controlled by the application response time. TCP stack can not easily
> estimate it.

So we still need a knob?

> 
> If you focus on bulk receive, LRO/GRO should already lower number of
> ACKS to an acceptable level and without major disruption.

Indeed.

> 
> Stretch acks are not only the receiver concern, there are issues for the
> sender that you cannot always control/change.
> 
> I recommend reading RFC2525 2.13
> 

Very helpful information!

On the sender's side, it needs to "notify" the receiver not to send
stretch acks when it is in slow-start. But I think the receiver can
detect slow-start too on its own side (based one the window size?).

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: TCP delayed ACK heuristic
  2012-12-18 16:39   ` David Laight
  2012-12-18 17:54     ` Rick Jones
@ 2012-12-19  7:00     ` Cong Wang
  2012-12-19 18:39       ` Rick Jones
  1 sibling, 1 reply; 13+ messages in thread
From: Cong Wang @ 2012-12-19  7:00 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, Ben Greear, David Miller, Eric Dumazet,
	Stephen Hemminger, Rick Jones, Thomas Graf

On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote:
> There are problems with only implementing the acks
> specified by RFC1122. 

Yeah, the problem is if we can violate this RFC for getting better
performance. Or it is just a no-no?

Although RFC 2525 mentions this as "Stretch ACK Violation", I am still
not sure if that means we can violate RFC1122 legally.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: TCP delayed ACK heuristic
  2012-12-18 17:54     ` Rick Jones
@ 2012-12-19  9:52       ` David Laight
  0 siblings, 0 replies; 13+ messages in thread
From: David Laight @ 2012-12-19  9:52 UTC (permalink / raw)
  To: Rick Jones
  Cc: Cong Wang, netdev, Ben Greear, David Miller, Eric Dumazet,
	Stephen Hemminger, Thomas Graf

> > I've seen problems when the sending side is doing (I think)
> > 'slow start' with Nagle disabled.
> > The sender would only send 4 segments before waiting for an
> > ACK - even when it had more than a full sized segment waiting.
> > Sender was Linux 2.6.something (probably low 20s).
> > I changed the application flow to send data in the reverse
> > direction to avoid the problem.
> > That was on a ~0 delay local connection - which means that
> > there is almost never outstanding data, and the 'slow start'
> > happened almost all the time.
> > Nagle is completely the wrong algorithm for the data flow.
> 
> If Nagle was already disabled, why the last sentence?  And from your
> description, even if Nagle were enabled, I would think that it was
> remote ACK+cwnd behaviour getting in your way, not Nagle, given that
> Nagle is to be decided on a user-send by user-send basis and release
> queued data (to the mercies of other heuristics) when it gets to be an
> MSS-worth.

With Nagle enabled the first segment is sent, the following ones
get buffered until full segments can be sent. Although (probably)
only 4 segments will be sent (1 small and 3 full) the 3rd of these
does generate an ack.

> ... but it sounds like you have an actual
> application looking to do that??

We are relaying data packets received over multiple SS7 signalling
links (64k hdlc) over a TCP connection. The connection will be local,
in some cases the host ethernet MAC, switch, and target cpu are all
on the same PCI(e) card (MII crossover links).
While a delay of a millisecond or two wouldn't matter (1ms is 8 byte
times) the Nagle delay is far too long - and since the data isn't
command/response the Nagle would delay happen a lot.

One of the conformance tests managed to make the system 'busy'.
Since all it does is make one 64k channel busy it shouldn't have
been able to generate a backlog of receive data - but it managed to
get over 100 data packets unacked (app level ack).

> Allowing a byte-limit-cwnd's worth of single-byte-payload TCP segments
> could easily be seen as being rather anti-social :)

If the actual RTT is almost zero (as in our case) and the network
really shouldn't be dropping packets the it doesn't matter.

I suspect that if the tx rate is faster than the RTT then the
'slow start' turns off and you can get a lot of small segments
in flight. But when the RTT is zero 'slow start' almost always
applies and you only send 4.

> And forcing/maintaining the original segment boundaries in
> retransmissions for small packets isn't such a hot idea either.

True, not splitting them might be useful, but to need to avoid
merges.

	David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-19  7:00     ` Cong Wang
@ 2012-12-19 18:39       ` Rick Jones
  2012-12-19 20:59         ` David Miller
  2012-12-19 23:08         ` Eric Dumazet
  0 siblings, 2 replies; 13+ messages in thread
From: Rick Jones @ 2012-12-19 18:39 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Laight, netdev, Ben Greear, David Miller, Eric Dumazet,
	Stephen Hemminger, Thomas Graf

On 12/18/2012 11:00 PM, Cong Wang wrote:
> On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote:
>> There are problems with only implementing the acks
>> specified by RFC1122.
>
> Yeah, the problem is if we can violate this RFC for getting better
> performance. Or it is just a no-no?
>
> Although RFC 2525 mentions this as "Stretch ACK Violation", I am still
> not sure if that means we can violate RFC1122 legally.

The term used in RFC1122 is "SHOULD" not "MUST."  Same for RFC2525 when 
it talks about "Stretch ACK Violation."   A TCP stack may have behaviour 
which differs from a SHOULD so long as there is a reasonable reason for it.

rick jones

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-19 18:39       ` Rick Jones
@ 2012-12-19 20:59         ` David Miller
  2012-12-20  3:23           ` Cong Wang
  2012-12-19 23:08         ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2012-12-19 20:59 UTC (permalink / raw)
  To: rick.jones2
  Cc: amwang, David.Laight, netdev, greearb, eric.dumazet, shemminger, tgraf

From: Rick Jones <rick.jones2@hp.com>
Date: Wed, 19 Dec 2012 10:39:37 -0800

> On 12/18/2012 11:00 PM, Cong Wang wrote:
>> On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote:
>>> There are problems with only implementing the acks
>>> specified by RFC1122.
>>
>> Yeah, the problem is if we can violate this RFC for getting better
>> performance. Or it is just a no-no?
>>
>> Although RFC 2525 mentions this as "Stretch ACK Violation", I am still
>> not sure if that means we can violate RFC1122 legally.
> 
> The term used in RFC1122 is "SHOULD" not "MUST."  Same for RFC2525
> when it talks about "Stretch ACK Violation."  A TCP stack may have
> behaviour which differs from a SHOULD so long as there is a reasonable
> reason for it.

Yes, but RFC2525 makes it very clear why we should not even
consider doing crap like this.

ACKs are the only information we have to detect loss.

And, for the same reasons that TCP VEGAS is fundamentally broken, we
cannot measure the pipe or some other receiver-side-visible piece of
information to determine when it's "safe" to stretch ACK.

And even if it's "safe", we should not do it so that losses are
accurately detected and we don't spuriously retransmit.

The only way to know when the bandwidth increases is to "test" it, by
sending more and more packets until drops happen.  That's why all
successful congestion control algorithms must operate on explicited
tested pieces of information.

Similarly, it's not really possible to universally know if it's safe
to stretch ACK or not.

Can we please drop this idea?  It has zero value and all downside as
far as I'm concerned.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-19 18:39       ` Rick Jones
  2012-12-19 20:59         ` David Miller
@ 2012-12-19 23:08         ` Eric Dumazet
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-12-19 23:08 UTC (permalink / raw)
  To: Rick Jones
  Cc: Cong Wang, David Laight, netdev, Ben Greear, David Miller,
	Stephen Hemminger, Thomas Graf

On Wed, 2012-12-19 at 10:39 -0800, Rick Jones wrote:
> On 12/18/2012 11:00 PM, Cong Wang wrote:
> > On Tue, 2012-12-18 at 16:39 +0000, David Laight wrote:
> >> There are problems with only implementing the acks
> >> specified by RFC1122.
> >
> > Yeah, the problem is if we can violate this RFC for getting better
> > performance. Or it is just a no-no?
> >
> > Although RFC 2525 mentions this as "Stretch ACK Violation", I am still
> > not sure if that means we can violate RFC1122 legally.
> 
> The term used in RFC1122 is "SHOULD" not "MUST."  Same for RFC2525 when 
> it talks about "Stretch ACK Violation."   A TCP stack may have behaviour 
> which differs from a SHOULD so long as there is a reasonable reason for it.

Generally speaking, there are no reasonable reasons, unless you control
both sender and receiver, and the path between.

ACK can be incredibly useful to recover from losses in a short time.

The vast majority of TCP sessions are small lived, and we send one ACK
per received segment anyway at beginning [1] or retransmits to let the
sender smoothly increase its cwnd, so an auto-tuning facility wont help
them that much.

For long and fast sessions, we have the LRO/GRO heuristic.

This leaves a fraction of flows where the ACK rate should not really
matter.


[1] This refers to the quickack mode

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: TCP delayed ACK heuristic
  2012-12-19 20:59         ` David Miller
@ 2012-12-20  3:23           ` Cong Wang
  2012-12-20  9:57             ` David Laight
  0 siblings, 1 reply; 13+ messages in thread
From: Cong Wang @ 2012-12-20  3:23 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, David.Laight, netdev, greearb, eric.dumazet,
	shemminger, tgraf

On Wed, 2012-12-19 at 12:59 -0800, David Miller wrote:
> 
> Yes, but RFC2525 makes it very clear why we should not even
> consider doing crap like this.
> 
> ACKs are the only information we have to detect loss.
> 
> And, for the same reasons that TCP VEGAS is fundamentally broken, we
> cannot measure the pipe or some other receiver-side-visible piece of
> information to determine when it's "safe" to stretch ACK.
> 
> And even if it's "safe", we should not do it so that losses are
> accurately detected and we don't spuriously retransmit.
> 
> The only way to know when the bandwidth increases is to "test" it, by
> sending more and more packets until drops happen.  That's why all
> successful congestion control algorithms must operate on explicited
> tested pieces of information.
> 
> Similarly, it's not really possible to universally know if it's safe
> to stretch ACK or not.

Sounds reasonable. Thanks for your explanation.

> 
> Can we please drop this idea?  It has zero value and all downside as
> far as I'm concerned.
> 

Yeah, I am just trying to see if there is any way to get a reasonable
heuristic.

So, can we at least have a sysctl to control the timeout of the delayed
ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires
the receiver to modify the application and has to be set every time when
calling recv().

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: TCP delayed ACK heuristic
  2012-12-20  3:23           ` Cong Wang
@ 2012-12-20  9:57             ` David Laight
  2012-12-20 12:41               ` Cong Wang
  0 siblings, 1 reply; 13+ messages in thread
From: David Laight @ 2012-12-20  9:57 UTC (permalink / raw)
  To: Cong Wang, David Miller
  Cc: rick.jones2, netdev, greearb, eric.dumazet, shemminger, tgraf

> So, can we at least have a sysctl to control the timeout of the delayed
> ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires
> the receiver to modify the application and has to be set every time when
> calling recv().

A sysctl in inappropriate - it affects the entire TCP protocol stack.

You want different behaviour for different remote hosts (probably
different subnets).
In particular your local subnet is unlikely to have packet loss
and very likely to have a very low RTT.

AFAICT a lot of the recent 'tuning' has been done for web/ftp
servers that are very remote from the client. These connections
are also request-response ones - quite often with large responses.

IMHO This has been to the detriment of local connections.

	David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: TCP delayed ACK heuristic
  2012-12-20  9:57             ` David Laight
@ 2012-12-20 12:41               ` Cong Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Cong Wang @ 2012-12-20 12:41 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, rick.jones2, netdev, greearb, eric.dumazet,
	shemminger, tgraf

On Thu, 2012-12-20 at 09:57 +0000, David Laight wrote:
> > So, can we at least have a sysctl to control the timeout of the delayed
> > ACK? I mean the minimum 40ms. TCP_QUICKACK can help too, but it requires
> > the receiver to modify the application and has to be set every time when
> > calling recv().
> 
> A sysctl in inappropriate - it affects the entire TCP protocol stack.
> 
> You want different behaviour for different remote hosts (probably
> different subnets).
> In particular your local subnet is unlikely to have packet loss
> and very likely to have a very low RTT.
> 
> AFAICT a lot of the recent 'tuning' has been done for web/ftp
> servers that are very remote from the client. These connections
> are also request-response ones - quite often with large responses.
> 
> IMHO This has been to the detriment of local connections.
> 

A customer prefers faster response in their low-loss environment, 40ms
is not good. Of course, they are supposed to know their environment when
they tune this.

Or maybe a sysctl equals to TCP_QUICKACK?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-12-20 12:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <270756364.27707018.1355842632348.JavaMail.root@redhat.com>
2012-12-18 15:11 ` TCP delayed ACK heuristic Cong Wang
2012-12-18 16:30   ` Eric Dumazet
2012-12-19  6:54     ` Cong Wang
2012-12-18 16:39   ` David Laight
2012-12-18 17:54     ` Rick Jones
2012-12-19  9:52       ` David Laight
2012-12-19  7:00     ` Cong Wang
2012-12-19 18:39       ` Rick Jones
2012-12-19 20:59         ` David Miller
2012-12-20  3:23           ` Cong Wang
2012-12-20  9:57             ` David Laight
2012-12-20 12:41               ` Cong Wang
2012-12-19 23:08         ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.