netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* TCP and reordering
@ 2012-11-27  9:32 Saku Ytti
  2012-11-27 17:05 ` Rick Jones
  0 siblings, 1 reply; 33+ messages in thread
From: Saku Ytti @ 2012-11-27  9:32 UTC (permalink / raw)
  To: netdev

Today due to fast retransmit performance on links which cause
reordering is appalling.

Is it too esoteric situation to handle gracefully? Couldn't we
maintain 'reorder' counter in socket, which is increment when we get
two copies of same packet after duplicate ack, if this counter is
sufficiently high in relation to packet loss, we could start delaying
duplicate acks as we'd expect to receive the sequence very soon.


--
  ++ytti

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-27  9:32 TCP and reordering Saku Ytti
@ 2012-11-27 17:05 ` Rick Jones
  2012-11-27 17:15   ` Saku Ytti
  0 siblings, 1 reply; 33+ messages in thread
From: Rick Jones @ 2012-11-27 17:05 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev

On 11/27/2012 01:32 AM, Saku Ytti wrote:
> Today due to fast retransmit performance on links which cause
> reordering is appalling.
>
> Is it too esoteric situation to handle gracefully? Couldn't we
> maintain 'reorder' counter in socket, which is increment when we get
> two copies of same packet after duplicate ack, if this counter is
> sufficiently high in relation to packet loss, we could start delaying
> duplicate acks as we'd expect to receive the sequence very soon.

Packet reordering is supposed to be the exception, not the rule.  Links 
which habitually/constantly introduce reordering are, in my opinion, 
broken.  Optimizing for them would be optimizing an error case.

That said, there is net.ipv4.tcp_reordering which you can increase from 
the default of three to desensitize TCP to such broken links.  That will 
though be on the sending rather than receiving side.

rick jones

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-27 17:05 ` Rick Jones
@ 2012-11-27 17:15   ` Saku Ytti
  2012-11-27 18:00     ` Rick Jones
  2012-11-28  2:06     ` David Miller
  0 siblings, 2 replies; 33+ messages in thread
From: Saku Ytti @ 2012-11-27 17:15 UTC (permalink / raw)
  To: Rick Jones; +Cc: netdev

On 27 November 2012 19:05, Rick Jones <rick.jones2@hp.com> wrote:

> Packet reordering is supposed to be the exception, not the rule.  Links
> which habitually/constantly introduce reordering are, in my opinion, broken.
> Optimizing for them would be optimizing an error case.

TCP used to be friendly to reordering before fast retransmit
optimization was implemented.

It seems like minimal complexity in TCP algorithm and would
dynamically work correctly depending on situation. It is rather slim
comfort that network should work, when it does not, and you cannot
affect it.

But if the complexity is higher than I expect, then I fully agree,
makes no sense to add it. Reason why reordering can happen in modern
MPLS network is that you have to essentially duck type your traffic,
and sometimes you duck type them wrong and you are then calculating
ECMP on incorrect values, causing packets inside flow to take
different ports.

--
  ++ytti

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-27 17:15   ` Saku Ytti
@ 2012-11-27 18:00     ` Rick Jones
  2012-11-28  2:06     ` David Miller
  1 sibling, 0 replies; 33+ messages in thread
From: Rick Jones @ 2012-11-27 18:00 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev

On 11/27/2012 09:15 AM, Saku Ytti wrote:
> On 27 November 2012 19:05, Rick Jones <rick.jones2@hp.com> wrote:
>
>> Packet reordering is supposed to be the exception, not the rule.
>> Links which habitually/constantly introduce reordering are, in my
>> opinion, broken. Optimizing for them would be optimizing an error
>> case.
>
> TCP used to be friendly to reordering before fast retransmit
> optimization was implemented.

It remained "friendly" to reordering even after fast retransmit was 
implemented - just not to particularly bad reordering.

And friendly is somewhat relative.  Even before fast retransmit came to 
be, TCP would immediately ACK each out-of-order segment.

> It seems like minimal complexity in TCP algorithm and would
> dynamically work correctly depending on situation. It is rather slim
> comfort that network should work, when it does not, and you cannot
> affect it.

It is probably considered an "ancient" text these days, but one of the 
chapter intros for The Mythical Man Month includes a quote from Ovid:

adde parvum parvo magnus acervus erit

which if recollection serves the book translated as:

add little to little and soon there will be a big pile.

> But if the complexity is higher than I expect, then I fully agree,
> makes no sense to add it. Reason why reordering can happen in modern
> MPLS network is that you have to essentially duck type your traffic,
> and sometimes you duck type them wrong and you are then calculating
> ECMP on incorrect values, causing packets inside flow to take
> different ports.

I appreciate that one may not always have "access" and there can be 
layer 8 and layer 9 issues involved, but if incorrect typing is the root 
cause of the reordering, treating root cause rather than the symptom is 
what should happen.  How many kludges, no matter how angelic, can fit in 
a TCP implementation?

For other reasons (CPU utilization) various stacks (HP-UX, Solaris, some 
versions of MacOS) have had explicit ACK-avoidance heuristics.  They 
would back-off from ack-every-other to back-every-N, N >> 2.  The 
heuristics worked quite well in LAN environments and on bulk flows (eg 
FTP, ttcp, netperf TCP_STREAM), not necessarily as well in other 
environments.   One (very necessary) part of the heuristic in those is 
to go back to ack-every-other when necessary.  That keyed off the 
standalone ACK timer - if that ever fired the current avoidance count 
would go back to 2, and the max allowed would be half what it was 
before.  However, that took a non-trivial performance hit when there was 
  "tail-drop" and something that wasn't a continuous stream of traffic - 
the tail got dropped, nothing to cause out-of-order to force immediate 
ACKs.  (*) Standalone ACK timer is then the only thing getting us back 
out which means idleness.  I worked a number of "WAN performance 
problems" involving one of those stacks where part of the solution was 
turning down the limits on the ack avoidance heuristic by a considerable 
quantity.  (And I say this as someone with a fondness for the schemes)

I cannot say with certainty your idea would have the same problems but 
as you look to work-out a solution to propose as a patch, you will have 
to keep that in mind.

rick

* yes, the same holds true for a non-ack-avoiding setup, the heuristic 
simply made it worse - especially if the sender wanted to send but had 
gotten limited by cwnd - the ACK(s) of the head of that chunk of data 
were "avoided" and so wouldn't open the cwnd which might then allow 
futher segments to enable detection of the dropped segments.  Even 
without losses it also tended to interact poorly with sending TCPs which 
wantend to increase the congestion window by one MSS for each ACK rather 
than based on the quantity of bytes covered by the ACK.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-27 17:15   ` Saku Ytti
  2012-11-27 18:00     ` Rick Jones
@ 2012-11-28  2:06     ` David Miller
  2012-11-28  7:26       ` Saku Ytti
  2012-11-28  7:59       ` David Woodhouse
  1 sibling, 2 replies; 33+ messages in thread
From: David Miller @ 2012-11-28  2:06 UTC (permalink / raw)
  To: saku; +Cc: rick.jones2, netdev

From: Saku Ytti <saku@ytti.fi>
Date: Tue, 27 Nov 2012 19:15:20 +0200

> TCP used to be friendly to reordering before fast retransmit
> optimization was implemented.

You're talking about 20 years ago, because that's when fast
retrasnmit was created.

It's not like this got added recently.

And the gains of fast retransmit far outweigh whatever strange
justification would give for reordering packets on purpose.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  2:06     ` David Miller
@ 2012-11-28  7:26       ` Saku Ytti
  2012-11-28  8:35         ` Vijay Subramanian
  2012-11-28  7:59       ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: Saku Ytti @ 2012-11-28  7:26 UTC (permalink / raw)
  To: David Miller; +Cc: rick.jones2, netdev

On (2012-11-27 21:06 -0500), David Miller wrote:

> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

I don't disagree. I'm not proposing to turn off fast retransmits.

My proposal (or question more accurately) was to add 'reorder' counter to
sockets, which would increment when duplicate ACK is followed by same
sequence twice. 
Then you could automatically/dynamically delay duplicate acks, as you'd
start to expect to receive the frames, out-of-order. Giving non-lossy
reordering links pretty much 100% same performance as non-lossy in-order
links.

There are good amount of optimization in TCP for corner-case, and well that
is what TCP stack does, tries to work with limitations imposed by network.

My main question is, am I underestimating complexity needed to add such
counter. Or does such counter actually already exist (I've not looked if
netstat -s reordering counters are attributable to particular socket)

-- 
  ++ytti

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  2:06     ` David Miller
  2012-11-28  7:26       ` Saku Ytti
@ 2012-11-28  7:59       ` David Woodhouse
  2012-11-28  8:21         ` Christoph Paasch
  2012-11-28  8:22         ` Vijay Subramanian
  1 sibling, 2 replies; 33+ messages in thread
From: David Woodhouse @ 2012-11-28  7:59 UTC (permalink / raw)
  To: David Miller; +Cc: saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 2170 bytes --]

On Tue, 2012-11-27 at 21:06 -0500, David Miller wrote:
> And the gains of fast retransmit far outweigh whatever strange
> justification would give for reordering packets on purpose.

My 'strange justification' for reordering, albeit not entirely on
purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
bandwidth than I had to my dorm room 16 years ago. So I bond two of
them, and naturally expect a certain amount of reordering.

I've never really done much analysis of this though, and it's never
seemed to be a problem. Then again, I don't think I get *much*
reordering. Big downloads tend to look fairly much like this:

07:36:02.272979 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67016473:67017881, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273478 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67017881:67019289, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912240], length 1408
07:36:02.273507 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67019289, win 11198, options [nop,nop,TS val 1096912356 ecr 2564943119], length 0
07:36:02.274727 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67019289:67020697, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275151 IP6 2001:770:15f::2.http > 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530: Flags [.], seq 67020697:67022105, ack 124, win 110, options [nop,nop,TS val 2564943119 ecr 1096912241], length 1408
07:36:02.275184 IP6 2001:8b0:10b:1:e6ce:8fff:fe1f:f2c0.52530 > 2001:770:15f::2.http: Flags [.], ack 67022105, win 11198, options [nop,nop,TS val 1096912358 ecr 2564943119], length 0

I suppose it might be worse if the lines weren't running at the same
speed, and if the packets weren't running over the same path through the
telco between the ISP's LNS (which alternates one packet per line) and
my local DSLAM.

Short of going through whole dumps and looking, is there a good way to
get statistics?

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  7:59       ` David Woodhouse
@ 2012-11-28  8:21         ` Christoph Paasch
  2012-11-28  8:22         ` Vijay Subramanian
  1 sibling, 0 replies; 33+ messages in thread
From: Christoph Paasch @ 2012-11-28  8:21 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev

On Wednesday 28 November 2012 07:59:26 David Woodhouse wrote:
> My 'strange justification' for reordering, albeit not entirely on
> purpose, is that a single ADSL line at 8Mb/s down, 448Kb/s up is less
> bandwidth than I had to my dorm room 16 years ago. So I bond two of
> them, and naturally expect a certain amount of reordering.

You might want to have a look at MultiPath TCP [1], which allows the use of 
multiple interfaces for a single TCP connection. It is somehow similar to 
SCTP-CMT, with the difference that MPTCP is able to pass by today's firewalls 
and NATs and does not require any modifications to the applications.


E.g., you could install MPTCP on your end host and set up an HTTP-proxy on a 
public web hoster to terminate your MPTCP session -- as servers don't (yet) 
support MPTCP, you will have to terminate the MPTCP session somewhere.


Cheers,
Christoph

[1] http://multipath-tcp.org

-- 
IP Networking Lab --- http://inl.info.ucl.ac.be
MultiPath TCP in the Linux Kernel --- http://mptcp.info.ucl.ac.be
Université Catholique de Louvain
--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  7:59       ` David Woodhouse
  2012-11-28  8:21         ` Christoph Paasch
@ 2012-11-28  8:22         ` Vijay Subramanian
  2012-11-28  9:08           ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: Vijay Subramanian @ 2012-11-28  8:22 UTC (permalink / raw)
  To: David Woodhouse; +Cc: David Miller, saku, rick.jones2, netdev

>
> Short of going through whole dumps and looking, is there a good way to
> get statistics?
>

Hi David,

I don't believe reordering is tracked on the receiver side but on the
sender, there are SNMB_MIB items.
They can be tracked and can be viewed using nstat/netstat

# nstat -az | grep -i reorder
TcpExtTCPFACKReorder            0                  0.0
TcpExtTCPSACKReorder            0                  0.0
TcpExtTCPRenoReorder            0                  0.0
TcpExtTCPTSReorder              0                  0.0

Regards,
Vijay

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  7:26       ` Saku Ytti
@ 2012-11-28  8:35         ` Vijay Subramanian
  2012-11-28  8:54           ` Saku Ytti
  0 siblings, 1 reply; 33+ messages in thread
From: Vijay Subramanian @ 2012-11-28  8:35 UTC (permalink / raw)
  To: Saku Ytti; +Cc: David Miller, rick.jones2, netdev

>
> My proposal (or question more accurately) was to add 'reorder' counter to
> sockets, which would increment when duplicate ACK is followed by same
> sequence twice.
> Then you could automatically/dynamically delay duplicate acks, as you'd
> start to expect to receive the frames, out-of-order. Giving non-lossy
> reordering links pretty much 100% same performance as non-lossy in-order
> links.

RFC 5681 says that out-of-order packets should be acked immediately.
Please see section 4.2 for detailed reasoning.
It also explains why acks should not be delayed too much.

Also note that reordering is tracked on the sender side using the per
flow variable tp->reordering . This measures the amount of reordering
on the connection so that
fast retransmit and other loss recovery mechanisms are not entered
prematurely. Doesn't this behavior at the  sender already provide the
behavior you seek?

Regards,
Vijay

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  8:35         ` Vijay Subramanian
@ 2012-11-28  8:54           ` Saku Ytti
  2012-11-28 18:24             ` Rick Jones
  0 siblings, 1 reply; 33+ messages in thread
From: Saku Ytti @ 2012-11-28  8:54 UTC (permalink / raw)
  To: netdev

On (2012-11-28 00:35 -0800), Vijay Subramanian wrote:

> Also note that reordering is tracked on the sender side using the per
> flow variable tp->reordering . This measures the amount of reordering
> on the connection so that
> fast retransmit and other loss recovery mechanisms are not entered
> prematurely. Doesn't this behavior at the  sender already provide the
> behavior you seek?

Sorry I don't seem to understand what you mean. Do you mind explaining how
the sender can help to restore performance on reordering network?

-- 
  ++ytti

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  8:22         ` Vijay Subramanian
@ 2012-11-28  9:08           ` David Woodhouse
  2012-11-28 11:02             ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2012-11-28  9:08 UTC (permalink / raw)
  To: Vijay Subramanian; +Cc: David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 804 bytes --]

On Wed, 2012-11-28 at 00:22 -0800, Vijay Subramanian wrote:
> 
> I don't believe reordering is tracked on the receiver side but on the
> sender, there are SNMB_MIB items.
> They can be tracked and can be viewed using nstat/netstat
> 
> # nstat -az | grep -i reorder
> TcpExtTCPFACKReorder            0                  0.0
> TcpExtTCPSACKReorder            0                  0.0
> TcpExtTCPRenoReorder            0                  0.0
> TcpExtTCPTSReorder              0                  0.0

Thanks. For me after a 64MiB download, I have an increase of one FACK,
one SACK and one TS reorder. So my connection probably does even less
reordering than I thought, and thus isn't particularly relevant to this
conversation. I'll shut up now and go back to playing with ATM.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  9:08           ` David Woodhouse
@ 2012-11-28 11:02             ` Eric Dumazet
  2012-11-28 11:49               ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 11:02 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, 2012-11-28 at 09:08 +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 00:22 -0800, Vijay Subramanian wrote:
> > 
> > I don't believe reordering is tracked on the receiver side but on the
> > sender, there are SNMB_MIB items.
> > They can be tracked and can be viewed using nstat/netstat
> > 
> > # nstat -az | grep -i reorder
> > TcpExtTCPFACKReorder            0                  0.0
> > TcpExtTCPSACKReorder            0                  0.0
> > TcpExtTCPRenoReorder            0                  0.0
> > TcpExtTCPTSReorder              0                  0.0
> 
> Thanks. For me after a 64MiB download, I have an increase of one FACK,
> one SACK and one TS reorder. So my connection probably does even less
> reordering than I thought, and thus isn't particularly relevant to this
> conversation. I'll shut up now and go back to playing with ATM.

But you are the receiver. A receiver should not increase these counters.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 11:02             ` Eric Dumazet
@ 2012-11-28 11:49               ` David Woodhouse
  2012-11-28 12:26                 ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 11:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 486 bytes --]

On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > one SACK and one TS reorder. So my connection probably does even less
> > reordering than I thought, and thus isn't particularly relevant to this
> > conversation. I'll shut up now and go back to playing with ATM.
> 
> But you are the receiver. A receiver should not increase these counters.

I checked it on the sending side.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 11:49               ` David Woodhouse
@ 2012-11-28 12:26                 ` Eric Dumazet
  2012-11-28 12:39                   ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 12:26 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, 2012-11-28 at 11:49 +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > > one SACK and one TS reorder. So my connection probably does even less
> > > reordering than I thought, and thus isn't particularly relevant to this
> > > conversation. I'll shut up now and go back to playing with ATM.
> > 
> > But you are the receiver. A receiver should not increase these counters.
> 
> I checked it on the sending side.
> 

If you want to play with reordering effects on your bi-ADSL line,
you could install an "netem delay 3ms" on ingress side of one of the
link.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 12:26                 ` Eric Dumazet
@ 2012-11-28 12:39                   ` David Woodhouse
  2012-11-28 12:52                     ` Eric Dumazet
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 12:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 1529 bytes --]

On Wed, 2012-11-28 at 04:26 -0800, Eric Dumazet wrote:
> On Wed, 2012-11-28 at 11:49 +0000, David Woodhouse wrote:
> > On Wed, 2012-11-28 at 03:02 -0800, Eric Dumazet wrote:
> > > > Thanks. For me after a 64MiB download, I have an increase of one FACK,
> > > > one SACK and one TS reorder. So my connection probably does even less
> > > > reordering than I thought, and thus isn't particularly relevant to this
> > > > conversation. I'll shut up now and go back to playing with ATM.
> > > 
> > > But you are the receiver. A receiver should not increase these counters.
> > 
> > I checked it on the sending side.
> > 
> 
> If you want to play with reordering effects on your bi-ADSL line,
> you could install an "netem delay 3ms" on ingress side of one of the
> link.

For now I'm content to observe that I don't really get much reordering
at all, which is fine.

I'll go back to looking at TSQ, and BQL for PPP. If I have to use
skb_orphan() and install a destructor of my own in order to do BQL for
PPP, that'll upset TSQ a little. Is there a way we could *chain* the
destructors... skb_clone() to put the skbs on the PPP channels' queues,
perhaps, then free the original from the PPP destructor? Or is that too
much overhead?

I've killed most of the channel queue for PPPoATM and PPPoE now, but
L2TP still has a whole load of buffering all the way through the stack
again before it really leaves the host.

(And PPPoE will still have the txqueuelen on the Ethernet device too).

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 12:39                   ` David Woodhouse
@ 2012-11-28 12:52                     ` Eric Dumazet
  2012-11-28 15:47                       ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 12:52 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, 2012-11-28 at 12:39 +0000, David Woodhouse wrote:

> 
> I'll go back to looking at TSQ, and BQL for PPP. If I have to use
> skb_orphan() and install a destructor of my own in order to do BQL for
> PPP, that'll upset TSQ a little. Is there a way we could *chain* the
> destructors... skb_clone() to put the skbs on the PPP channels' queues,
> perhaps, then free the original from the PPP destructor? Or is that too
> much overhead?
> 
> I've killed most of the channel queue for PPPoATM and PPPoE now, but
> L2TP still has a whole load of buffering all the way through the stack
> again before it really leaves the host.
> 
> (And PPPoE will still have the txqueuelen on the Ethernet device too).
> 

BQL is nice for high speed adapters.

For slow one, you always can stop the queue for each packet given to
start_xmit()

And restart the queue at TX completion.

Some device drivers do that (because the hardware has a single slot, no
ring buffer, not because they wanted to fight bufferbloat ;) )

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 12:52                     ` Eric Dumazet
@ 2012-11-28 15:47                       ` David Woodhouse
  2012-11-28 16:09                         ` Eric Dumazet
  2012-11-28 16:19                         ` Benjamin LaHaise
  0 siblings, 2 replies; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 15:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 1452 bytes --]

On Wed, 2012-11-28 at 04:52 -0800, Eric Dumazet wrote:
> BQL is nice for high speed adapters.

For adapters with hugely deep queues, surely? There's a massive
correlation between the two, of course — but PPP over L2TP or PPPoE
ought to be included in the classification, right?

> For slow one, you always can stop the queue for each packet given to
> start_xmit()
> 
> And restart the queue at TX completion.

Well yes, but only if we get notified of TX completion.

It's simple enough for the tty-based channels, and we can do it with a
vcc->pop() function for PPPoATM. But for PPPoE and L2TP, how do we do
it? We can install a skb destructor... but then we're stomping on TSQ's
use of the destructor by orphaning it too soon.

I'm pondering something along the lines of

	if (skb->destructor) {
		newskb = skb_clone(skb, GFP_KERNEL);
		if (newskb) {
	             skb_shinfo(newskb) = skb;
		     skb = newskb;
	        } 
	 }
	 skb_orphan(skb);
	 skb->destructor = ppp_chan_tx_completed;


... and then ppp_chan_tx_completed can also destroy the original skb
(and hence invoke TSQ's destructor too) when the time comes. And in the
(common?) case where we don't have an existing destructor, we don't
bother with the skb_clone.

But I wish there was a nicer way to chain destructors. And no, I don't
count what GSO does. We can't use the cb here anyway since we're passing
it down the stack.

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 15:47                       ` David Woodhouse
@ 2012-11-28 16:09                         ` Eric Dumazet
  2012-11-28 16:21                           ` David Woodhouse
  2012-11-28 16:19                         ` Benjamin LaHaise
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 16:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, 2012-11-28 at 15:47 +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 04:52 -0800, Eric Dumazet wrote:
> > BQL is nice for high speed adapters.
> 
> For adapters with hugely deep queues, surely? There's a massive
> correlation between the two, of course — but PPP over L2TP or PPPoE
> ought to be included in the classification, right?
> 
> > For slow one, you always can stop the queue for each packet given to
> > start_xmit()
> > 
> > And restart the queue at TX completion.
> 
> Well yes, but only if we get notified of TX completion.
> 
> It's simple enough for the tty-based channels, and we can do it with a
> vcc->pop() function for PPPoATM. But for PPPoE and L2TP, how do we do
> it? We can install a skb destructor... but then we're stomping on TSQ's
> use of the destructor by orphaning it too soon.
> 
> I'm pondering something along the lines of
> 
> 	if (skb->destructor) {
> 		newskb = skb_clone(skb, GFP_KERNEL);
> 		if (newskb) {
> 	             skb_shinfo(newskb) = skb;
> 		     skb = newskb;
> 	        } 
> 	 }
> 	 skb_orphan(skb);
> 	 skb->destructor = ppp_chan_tx_completed;
> 
> 
> ... and then ppp_chan_tx_completed can also destroy the original skb
> (and hence invoke TSQ's destructor too) when the time comes. And in the
> (common?) case where we don't have an existing destructor, we don't
> bother with the skb_clone.
> 
> But I wish there was a nicer way to chain destructors. And no, I don't
> count what GSO does. We can't use the cb here anyway since we're passing
> it down the stack.
> 

My point was that if you limit number of in flight packet to 1,
its relatively easy to add glue in the priv dev data, so that you chain
the destructor without adding yet another fields in all skbs.

At start_xmit() do the following instead of skb_orphan(skb)

if (skb->destructor) {
     BUG_ON(priv->save_destructor);
     priv->save_destructor = skb->destructor;
     priv->save_sk = skb->sk;
     skb->sk = &priv->fake_sk;
}

skb->destructor = my_destructor;
/* stop the queue */
...
}


void my_destructor(struct sk_buff *skb)
{
    struct .... *priv = container_of(skb, struct ..., priv.fake_sk);

    skb->sk = skb->save_sk;
    priv->save_destructor(skb);
    priv->save_destructor = NULL;
    /* restart the queue */
    ...
}

Something like that.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 15:47                       ` David Woodhouse
  2012-11-28 16:09                         ` Eric Dumazet
@ 2012-11-28 16:19                         ` Benjamin LaHaise
  2012-11-28 16:41                           ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: Benjamin LaHaise @ 2012-11-28 16:19 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Eric Dumazet, Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, Nov 28, 2012 at 03:47:15PM +0000, David Woodhouse wrote:
> On Wed, 2012-11-28 at 04:52 -0800, Eric Dumazet wrote:
> > BQL is nice for high speed adapters.
> 
> For adapters with hugely deep queues, surely? There's a massive
> correlation between the two, of course ??? but PPP over L2TP or PPPoE
> ought to be included in the classification, right?

Possibly, but there are many setups where PPPoE/L2TP do not connect to 
the congested link directly.

> > For slow one, you always can stop the queue for each packet given to
> > start_xmit()
> > 
> > And restart the queue at TX completion.
> 
> Well yes, but only if we get notified of TX completion.
> 
> It's simple enough for the tty-based channels, and we can do it with a
> vcc->pop() function for PPPoATM. But for PPPoE and L2TP, how do we do
> it? We can install a skb destructor... but then we're stomping on TSQ's
> use of the destructor by orphaning it too soon.
> 
> I'm pondering something along the lines of
> 
> 	if (skb->destructor) {
> 		newskb = skb_clone(skb, GFP_KERNEL);
> 		if (newskb) {
> 	             skb_shinfo(newskb) = skb;
> 		     skb = newskb;
> 	        } 
> 	 }
> 	 skb_orphan(skb);
> 	 skb->destructor = ppp_chan_tx_completed;
> 
> 
> ... and then ppp_chan_tx_completed can also destroy the original skb
> (and hence invoke TSQ's destructor too) when the time comes. And in the
> (common?) case where we don't have an existing destructor, we don't
> bother with the skb_clone.

This sort of chaining of destructors is going to be very expensive in 
terms of CPU cycles.  If this does get implemented, please ensure there is 
a way to turn it off.  Specifically, I'm thinking of the access concetrator 
roles for BRAS.  In many wholesale ISP setups, there are many incoming 
sessions coming in over a high speed link (gigabit or greater) for which 
the access concentrator (LAC/LNS in L2TP speak) has no idea of the 
bandwidth of the link actually facing the customer.  Such systems are 
usually operated in a way to avoid ever congesting the aggregation network.  
In such setups, BQL on the L2TP/PPPoE interface only serves to increase CPU 
overhead.

That said, if there is local congestion, the benefits of BQL would be 
worthwhile to have.

> But I wish there was a nicer way to chain destructors. And no, I don't
> count what GSO does. We can't use the cb here anyway since we're passing
> it down the stack.

I think all the tunneling protocols are going to have the same problem 
here, so it deserves some thought about how to tackle the issue in a 
generic way without incurring a large amount of overhead.  This exact 
problem is one of the reasons multilink PPP often doesn't work well over 
L2TP or PPPoE as compared to its behaviour over ttys.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 16:09                         ` Eric Dumazet
@ 2012-11-28 16:21                           ` David Woodhouse
  0 siblings, 0 replies; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 16:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Vijay Subramanian, David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

On Wed, 2012-11-28 at 08:09 -0800, Eric Dumazet wrote:
> My point was that if you limit number of in flight packet to 1,
> its relatively easy to add glue in the priv dev data, so that you chain
> the destructor without adding yet another fields in all skbs.

Hm, true. I do think we'll need to be able to have at least two packets
in-flight though. There's a fair amount of network stack to be navigated
by a skb once we release it, in the PPPoE and especially L2TP cases. The
latency involved in that, if we only allow one packet at a time, will
surely be noticeable. You're basically guaranteeing that a 'TX done' IRQ
won't have another packet in the device's queue ready to send, and you
won't be able to saturate the uplink.

I suppose your principle applies even for more than one skb at a time;
we can have an array of {skb, old_destructor} tuples in our per-channel
private state, rather than having to keep it with the skb itself... but
that's slightly less simple.

I'll go and knock something up...

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 16:19                         ` Benjamin LaHaise
@ 2012-11-28 16:41                           ` David Woodhouse
  2012-11-28 17:08                             ` Benjamin LaHaise
  2012-11-28 17:16                             ` Eric Dumazet
  0 siblings, 2 replies; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 16:41 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Eric Dumazet, Vijay Subramanian, David Miller, saku, rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 3676 bytes --]

On Wed, 2012-11-28 at 11:19 -0500, Benjamin LaHaise wrote:
> On Wed, Nov 28, 2012 at 03:47:15PM +0000, David Woodhouse wrote:
> > On Wed, 2012-11-28 at 04:52 -0800, Eric Dumazet wrote:
> > > BQL is nice for high speed adapters.
> > 
> > For adapters with hugely deep queues, surely? There's a massive
> > correlation between the two, of course ??? but PPP over L2TP or PPPoE
> > ought to be included in the classification, right?
> 
> Possibly, but there are many setups where PPPoE/L2TP do not connect to 
> the congested link directly.

Absolutely. But in the cases where they *do* connect to the congested
link, and the packets are backing up on the *same* host, there's no
excuse for not actually knowing that and behaving appropriately :)

And even if the congested link is somewhere upstream, you'd hope that
something exists (like ECN) to let you know about it.

In the LNS case that I'm most familiar with, the LNS *does* know about
the bandwidth of each customer's ADSL line, and limits the bandwidth of
each session appropriately. It's much better to decide at the LNS which
packets to drop, than to let the telco decide. Or worse, to have the
ADSL link drop one ATM cell out of *every* packet when it's
overloaded...

> This sort of chaining of destructors is going to be very expensive in 
> terms of CPU cycles.  If this does get implemented, please ensure there is 
> a way to turn it off.

You asked that before, and I think we agreed that it would be acceptable
to use the existing CONFIG_BQL option?

I'm looking at adding ppp-channel equivalents of
netdev_{reset,sent,completed}_queue, and having the PPP channels call
them as appropriate. For some it's trivial, but in the PPPoE/L2TP cases
because we want to install destructors without stomping on TSQ it'll be
substantial enough that it should be compiled out if CONFIG_BQL isn't
enabled.

> That said, if there is local congestion, the benefits of BQL would be 
> worthwhile to have.

If there is local congestion... *or* if you have proper bandwidth
management on the link to the clients; either by knowing the bandwidth
and voluntarily limiting to it, or by something like ECN.

> > But I wish there was a nicer way to chain destructors. And no, I don't
> > count what GSO does. We can't use the cb here anyway since we're passing
> > it down the stack.
> 
> I think all the tunneling protocols are going to have the same problem 
> here, so it deserves some thought about how to tackle the issue in a 
> generic way without incurring a large amount of overhead. 

Right. There are a few cases of skb->destructor being used at different
levels of the stack where I suspect this might already be an issue, in
fact. And things like TSQ will silently be losing track of packets
because of skb_orphan, even before they've left the box.

Hah, and I note that l2tp is *already* stomping on skb->destructor for
its own purposes. So I could potentially just use its existing callback
and pretend I hadn't seen that it screws up TSQ, and leave the issue of
chaining destructors to be Someone Else's Problem™.

Actually, I think it overwrites the destructor without calling
skb_orphan() first — which will *really* upset TSQ, won't it?

>  This exact 
> problem is one of the reasons multilink PPP often doesn't work well over 
> L2TP or PPPoE as compared to its behaviour over ttys.

Another fun issue with tunnelling protocols and BQL... packets tend to
*grow* as they get encapsulated. So you might end up calling
netdev_sent_queue() with a given size, then netdev_completed_queue()
with a bigger packet later...

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 16:41                           ` David Woodhouse
@ 2012-11-28 17:08                             ` Benjamin LaHaise
  2012-11-28 17:16                             ` Eric Dumazet
  1 sibling, 0 replies; 33+ messages in thread
From: Benjamin LaHaise @ 2012-11-28 17:08 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Eric Dumazet, Vijay Subramanian, David Miller, saku, rick.jones2, netdev

On Wed, Nov 28, 2012 at 04:41:27PM +0000, David Woodhouse wrote:
> Absolutely. But in the cases where they *do* connect to the congested
> link, and the packets are backing up on the *same* host, there's no
> excuse for not actually knowing that and behaving appropriately :)

Agreed.

> And even if the congested link is somewhere upstream, you'd hope that
> something exists (like ECN) to let you know about it.
> 
> In the LNS case that I'm most familiar with, the LNS *does* know about
> the bandwidth of each customer's ADSL line, and limits the bandwidth of
> each session appropriately. It's much better to decide at the LNS which
> packets to drop, than to let the telco decide. Or worse, to have the
> ADSL link drop one ATM cell out of *every* packet when it's
> overloaded...

I'm speaking from experience: the telcos I've dealt with (2 different 
companies here in Canada) do *not* know the speed of the ADSL link being 
fed with PPPoE at the customer premises that a LAC receives as an incoming 
session.  The issue is that the aggregation network does not propagate 
that information from the DSLAM to the LAC.  It's a big mess where the 
aggregation network has a mix of ATM and L2 ethernet switches, and much 
of the gear has no support for protocols that can carry that information.

> > This sort of chaining of destructors is going to be very expensive in 
> > terms of CPU cycles.  If this does get implemented, please ensure there is 
> > a way to turn it off.
> 
> You asked that before, and I think we agreed that it would be acceptable
> to use the existing CONFIG_BQL option?

No, that would not be sufficient, as otherwise there is no means to control 
the behaviour of distribution vendor kernels -- they would most likely 
default to on.

> I'm looking at adding ppp-channel equivalents of
> netdev_{reset,sent,completed}_queue, and having the PPP channels call
> them as appropriate. For some it's trivial, but in the PPPoE/L2TP cases
> because we want to install destructors without stomping on TSQ it'll be
> substantial enough that it should be compiled out if CONFIG_BQL isn't
> enabled.

This sounds like overhead.  That said, I'd like to measure it to see what 
sort of actual effect this has on performance before passing any judgement.  
I'd be happy to put together a test setup to run anything you've come up 
with through.

> > That said, if there is local congestion, the benefits of BQL would be 
> > worthwhile to have.
> 
> If there is local congestion... *or* if you have proper bandwidth
> management on the link to the clients; either by knowing the bandwidth
> and voluntarily limiting to it, or by something like ECN.

Improved ECN support is a very good idea.

> > > But I wish there was a nicer way to chain destructors. And no, I don't
> > > count what GSO does. We can't use the cb here anyway since we're passing
> > > it down the stack.
> > 
> > I think all the tunneling protocols are going to have the same problem 
> > here, so it deserves some thought about how to tackle the issue in a 
> > generic way without incurring a large amount of overhead. 
> 
> Right. There are a few cases of skb->destructor being used at different
> levels of the stack where I suspect this might already be an issue, in
> fact. And things like TSQ will silently be losing track of packets
> because of skb_orphan, even before they've left the box.
> 
> Hah, and I note that l2tp is *already* stomping on skb->destructor for
> its own purposes. So I could potentially just use its existing callback
> and pretend I hadn't seen that it screws up TSQ, and leave the issue of
> chaining destructors to be Someone Else's Problem???.

*nod*

> Actually, I think it overwrites the destructor without calling
> skb_orphan() first ??? which will *really* upset TSQ, won't it?

Yes, that would defeat things.

> >  This exact 
> > problem is one of the reasons multilink PPP often doesn't work well over 
> > L2TP or PPPoE as compared to its behaviour over ttys.
> 
> Another fun issue with tunnelling protocols and BQL... packets tend to
> *grow* as they get encapsulated. So you might end up calling
> netdev_sent_queue() with a given size, then netdev_completed_queue()
> with a bigger packet later...

Oh fun!

Ultimately, we want to know about congestion as early as possible in the 
packet processing.  In the case of L2TP, it would be helpful to use the 
knowledge of the path the packet will be sent out on to correclty set the 
ECN bits on the packet inside the L2TP encapsulation.  The L2TP code does 
not appear to do this at present, so this needs work.

		-ben

> -- 
> dwmw2
> 



-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 16:41                           ` David Woodhouse
  2012-11-28 17:08                             ` Benjamin LaHaise
@ 2012-11-28 17:16                             ` Eric Dumazet
  2012-11-28 18:01                               ` David Woodhouse
  2012-12-02  1:30                               ` David Woodhouse
  1 sibling, 2 replies; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 17:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev

On Wed, 2012-11-28 at 16:41 +0000, David Woodhouse wrote:

> Another fun issue with tunnelling protocols and BQL... packets tend to
> *grow* as they get encapsulated. So you might end up calling
> netdev_sent_queue() with a given size, then netdev_completed_queue()
> with a bigger packet later...

Its the driver responsibility to maintain the coherent 'bytes' value for
each transmitted/completed packet.

If a driver calls an external entity, it cannot possibly use BQL, unless
doing an approximation (bytes becomes a fixed value)

BQL was really something to control/limit queueing on ethernet links,
not for stacked devices, as stacked devices normally have no queue.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 17:16                             ` Eric Dumazet
@ 2012-11-28 18:01                               ` David Woodhouse
  2012-11-29 15:10                                 ` Noel Grandin
  2012-12-02  1:30                               ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2012-11-28 18:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 643 bytes --]

On Wed, 2012-11-28 at 09:16 -0800, Eric Dumazet wrote:
> Its the driver responsibility to maintain the coherent 'bytes' value for
> each transmitted/completed packet.
> 
> If a driver calls an external entity, it cannot possibly use BQL, unless
> doing an approximation (bytes becomes a fixed value)

If tracking the original destructor, I can also track the original size
before the skb got passed down the stack.

> BQL was really something to control/limit queueing on ethernet links,
> not for stacked devices, as stacked devices normally have no queue.

Stacked devices have more queue than anything else :)

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28  8:54           ` Saku Ytti
@ 2012-11-28 18:24             ` Rick Jones
  2012-11-28 18:33               ` Eric Dumazet
  2012-11-28 18:44               ` Saku Ytti
  0 siblings, 2 replies; 33+ messages in thread
From: Rick Jones @ 2012-11-28 18:24 UTC (permalink / raw)
  To: Saku Ytti; +Cc: netdev

On 11/28/2012 12:54 AM, Saku Ytti wrote:
> On (2012-11-28 00:35 -0800), Vijay Subramanian wrote:
>
>> Also note that reordering is tracked on the sender side using the
>> per flow variable tp->reordering . This measures the amount of
>> reordering on the connection so that fast retransmit and other loss
>> recovery mechanisms are not entered prematurely. Doesn't this
>> behavior at the  sender already provide the behavior you seek?
>
> Sorry I don't seem to understand what you mean. Do you mind explaining how
> the sender can help to restore performance on reordering network?

tp->reordering is initialized via the sysctl net.ipv4.tcp_reordering 
which controls how anxious TCP will be to fast retransmit.

By increasing net.ipv4.tcp_reordering you make the sending TCP less 
"sensitive" to duplicate ACKs and so less sensitive to reordering 
detected by the receiver.  The receiver is generating as many 
(duplicate) ACKs as before, it is just that the sender is ignoring them 
a bit more.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 18:24             ` Rick Jones
@ 2012-11-28 18:33               ` Eric Dumazet
  2012-11-28 18:52                 ` Vijay Subramanian
  2012-11-28 18:44               ` Saku Ytti
  1 sibling, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-11-28 18:33 UTC (permalink / raw)
  To: Rick Jones; +Cc: Saku Ytti, netdev

On Wed, 2012-11-28 at 10:24 -0800, Rick Jones wrote:
> On 11/28/2012 12:54 AM, Saku Ytti wrote:
> > On (2012-11-28 00:35 -0800), Vijay Subramanian wrote:
> >
> >> Also note that reordering is tracked on the sender side using the
> >> per flow variable tp->reordering . This measures the amount of
> >> reordering on the connection so that fast retransmit and other loss
> >> recovery mechanisms are not entered prematurely. Doesn't this
> >> behavior at the  sender already provide the behavior you seek?
> >
> > Sorry I don't seem to understand what you mean. Do you mind explaining how
> > the sender can help to restore performance on reordering network?
> 
> tp->reordering is initialized via the sysctl net.ipv4.tcp_reordering 
> which controls how anxious TCP will be to fast retransmit.
> 
> By increasing net.ipv4.tcp_reordering you make the sending TCP less 
> "sensitive" to duplicate ACKs and so less sensitive to reordering 
> detected by the receiver.  The receiver is generating as many 
> (duplicate) ACKs as before, it is just that the sender is ignoring them 
> a bit more.

Note that this sysctl controls the initial value of the per socket
reordering value. It _does_ increase automatically (assuming SACK is
enabled of course)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 18:24             ` Rick Jones
  2012-11-28 18:33               ` Eric Dumazet
@ 2012-11-28 18:44               ` Saku Ytti
  1 sibling, 0 replies; 33+ messages in thread
From: Saku Ytti @ 2012-11-28 18:44 UTC (permalink / raw)
  To: netdev

On (2012-11-28 10:24 -0800), Rick Jones wrote:

> By increasing net.ipv4.tcp_reordering you make the sending TCP less
> "sensitive" to duplicate ACKs and so less sensitive to reordering
> detected by the receiver.  The receiver is generating as many
> (duplicate) ACKs as before, it is just that the sender is ignoring
> them a bit more.

Thanks, I didn't know this.

-- 
  ++ytti

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 18:33               ` Eric Dumazet
@ 2012-11-28 18:52                 ` Vijay Subramanian
  0 siblings, 0 replies; 33+ messages in thread
From: Vijay Subramanian @ 2012-11-28 18:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Rick Jones, Saku Ytti, netdev

>
> Note that this sysctl controls the initial value of the per socket
> reordering value. It _does_ increase automatically (assuming SACK is
> enabled of course)
>
>
>

I believe enabling SACK is not a requirement. Even with plain Reno,
reordering  is tracked in tp->reordering and the reordering event is
counted in LINUX_MIB_TCPRENOREORDER.
tcp_add_reno_sack()-->tcp_check_reno_reordering() is the code path.

Thanks,
Vijay

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 18:01                               ` David Woodhouse
@ 2012-11-29 15:10                                 ` Noel Grandin
  0 siblings, 0 replies; 33+ messages in thread
From: Noel Grandin @ 2012-11-29 15:10 UTC (permalink / raw)
  To: netdev

David Woodhouse <dwmw2 <at> infradead.org> writes:
> On Wed, 2012-11-28 at 09:16 -0800, Eric Dumazet wrote:
> > BQL was really something to control/limit queueing on ethernet links,
> > not for stacked devices, as stacked devices normally have no queue.
> 
> Stacked devices have more queue than anything else :)
> 

Surely BQL is something that should only be implemented on the device at the 
bottom of the stack, and everything above that shouldn't be bothering?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-11-28 17:16                             ` Eric Dumazet
  2012-11-28 18:01                               ` David Woodhouse
@ 2012-12-02  1:30                               ` David Woodhouse
  2012-12-02  2:40                                 ` Eric Dumazet
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2012-12-02  1:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]

On Wed, 2012-11-28 at 09:16 -0800, Eric Dumazet wrote:
> On Wed, 2012-11-28 at 16:41 +0000, David Woodhouse wrote:
> 
> > Another fun issue with tunnelling protocols and BQL... packets tend to
> > *grow* as they get encapsulated. So you might end up calling
> > netdev_sent_queue() with a given size, then netdev_completed_queue()
> > with a bigger packet later...
> 
> Its the driver responsibility to maintain the coherent 'bytes' value for
> each transmitted/completed packet.
> 
> If a driver calls an external entity, it cannot possibly use BQL, unless
> doing an approximation (bytes becomes a fixed value)
> 
> BQL was really something to control/limit queueing on ethernet links,
> not for stacked devices, as stacked devices normally have no queue.

Oh, of *course*... this is why my kernel would panic if I attempted to
add BQL to BR2684. The ATM low-level driver was pushing a header onto
the skb, and I ended up calling netdev_completed_queue() with a larger
'bytes' value than the one I'd called netdev_sent_queue() with. Which
leads to a BUG(), which immediately results in a panic. A moderately
suboptimal failure mode, when a nasty warning and disabling BQL on this
interface might have been nicer.

However, *perhaps* it isn't so hard to get a consistent 'bytes' value.
This version appears to work... can we use something along these lines
in the general case? What if skb_is_nonlinear()?

This probably doesn't work for L2TP where it's going to be passed down
the whole stack and get a new network header and everything. But for
BR2684 is this at least a salvageable approach?

--- net/atm/br2684.c~	2012-12-01 16:35:49.000000000 +0000
+++ net/atm/br2684.c	2012-12-02 01:18:35.216607088 +0000
@@ -180,6 +180,11 @@ static struct notifier_block atm_dev_not
 	.notifier_call = atm_dev_event,
 };
 
+static unsigned int skb_acct_len(struct sk_buff *skb)
+{
+	return skb_tail_pointer(skb) - skb_network_header(skb);
+}
+
 /* chained vcc->pop function.  Check if we should wake the netif_queue */
 static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
 {
@@ -191,6 +196,8 @@ static void br2684_pop(struct atm_vcc *v
 	/* If the queue space just went up from zero, wake */
 	if (atomic_inc_return(&brvcc->qspace) == 1)
 		netif_wake_queue(brvcc->device);
+
+	netdev_completed_queue(brvcc->device, 1, skb_acct_len(skb));
 }
 
 /*
@@ -265,6 +272,7 @@ static int br2684_xmit_vcc(struct sk_buf
 			netif_wake_queue(brvcc->device);
 	}
 
+	netdev_sent_queue(brvcc->device, skb_acct_len(skb));
 	/* If this fails immediately, the skb will be freed and br2684_pop()
 	   will wake the queue if appropriate. Just return an error so that
 	   the stats are updated correctly */
@@ -710,6 +718,7 @@ static int br2684_create(void __user *ar
 		return err;
 	}
 
+	netdev_reset_queue(netdev);
 	write_lock_irq(&devs_lock);
 
 	brdev->payload = payload;




-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-12-02  1:30                               ` David Woodhouse
@ 2012-12-02  2:40                                 ` Eric Dumazet
  2012-12-02  9:31                                   ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2012-12-02  2:40 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev

On Sun, 2012-12-02 at 01:30 +0000, David Woodhouse wrote:

> Oh, of *course*... this is why my kernel would panic if I attempted to
> add BQL to BR2684. The ATM low-level driver was pushing a header onto
> the skb, and I ended up calling netdev_completed_queue() with a larger
> 'bytes' value than the one I'd called netdev_sent_queue() with. Which
> leads to a BUG(), which immediately results in a panic. A moderately
> suboptimal failure mode, when a nasty warning and disabling BQL on this
> interface might have been nicer.
> 
> However, *perhaps* it isn't so hard to get a consistent 'bytes' value.
> This version appears to work... can we use something along these lines
> in the general case? What if skb_is_nonlinear()?
> 
> This probably doesn't work for L2TP where it's going to be passed down
> the whole stack and get a new network header and everything. But for
> BR2684 is this at least a salvageable approach?
> 
> --- net/atm/br2684.c~	2012-12-01 16:35:49.000000000 +0000
> +++ net/atm/br2684.c	2012-12-02 01:18:35.216607088 +0000
> @@ -180,6 +180,11 @@ static struct notifier_block atm_dev_not
>  	.notifier_call = atm_dev_event,
>  };
>  
> +static unsigned int skb_acct_len(struct sk_buff *skb)
> +{
> +	return skb_tail_pointer(skb) - skb_network_header(skb);
> +}
> +
>  /* chained vcc->pop function.  Check if we should wake the netif_queue */
>  static void br2684_pop(struct atm_vcc *vcc, struct sk_buff *skb)
>  {
> @@ -191,6 +196,8 @@ static void br2684_pop(struct atm_vcc *v
>  	/* If the queue space just went up from zero, wake */
>  	if (atomic_inc_return(&brvcc->qspace) == 1)
>  		netif_wake_queue(brvcc->device);
> +
> +	netdev_completed_queue(brvcc->device, 1, skb_acct_len(skb));
>  }
>  
>  /*
> @@ -265,6 +272,7 @@ static int br2684_xmit_vcc(struct sk_buf
>  			netif_wake_queue(brvcc->device);
>  	}
>  
> +	netdev_sent_queue(brvcc->device, skb_acct_len(skb));
>  	/* If this fails immediately, the skb will be freed and br2684_pop()
>  	   will wake the queue if appropriate. Just return an error so that
>  	   the stats are updated correctly */
> @@ -710,6 +718,7 @@ static int br2684_create(void __user *ar
>  		return err;
>  	}


Apparently this driver doesnt add any feature at alloc_netdev() time, so
it might work. (no frags in any skb)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: TCP and reordering
  2012-12-02  2:40                                 ` Eric Dumazet
@ 2012-12-02  9:31                                   ` David Woodhouse
  0 siblings, 0 replies; 33+ messages in thread
From: David Woodhouse @ 2012-12-02  9:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Benjamin LaHaise, Vijay Subramanian, David Miller, saku,
	rick.jones2, netdev

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

On Sat, 2012-12-01 at 18:40 -0800, Eric Dumazet wrote:
> > +static unsigned int skb_acct_len(struct sk_buff *skb)
> > +{
> > +	return skb_tail_pointer(skb) - skb_network_header(skb);
> > +}
> 
> Apparently this driver doesnt add any feature at alloc_netdev() time, so
> it might work. (no frags in any skb)

Well... as long as no driver appends stuff to the *tail* of the skb. As
long as they only *prepend* headers for the device to use, we're fine.
But I think that's fairly safe, for now.

For PPP I think I'll end up keeping a tuple of
	{ skb, orig_len, orig_destructor, orig_sk }
for each skb which is in-flight. Since the queues of those should be
*short*, and should generally complete in-order, that shouldn't be too
much overhead. It'll only need a small static array of them per-channel.

(Before complaining about a *static* array holding metadata about items
on a queue, stop thinking of the PPP channel being a queue and start
thinking of it more like a ring buffer.)

-- 
dwmw2



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 6171 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2012-12-02  9:31 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-27  9:32 TCP and reordering Saku Ytti
2012-11-27 17:05 ` Rick Jones
2012-11-27 17:15   ` Saku Ytti
2012-11-27 18:00     ` Rick Jones
2012-11-28  2:06     ` David Miller
2012-11-28  7:26       ` Saku Ytti
2012-11-28  8:35         ` Vijay Subramanian
2012-11-28  8:54           ` Saku Ytti
2012-11-28 18:24             ` Rick Jones
2012-11-28 18:33               ` Eric Dumazet
2012-11-28 18:52                 ` Vijay Subramanian
2012-11-28 18:44               ` Saku Ytti
2012-11-28  7:59       ` David Woodhouse
2012-11-28  8:21         ` Christoph Paasch
2012-11-28  8:22         ` Vijay Subramanian
2012-11-28  9:08           ` David Woodhouse
2012-11-28 11:02             ` Eric Dumazet
2012-11-28 11:49               ` David Woodhouse
2012-11-28 12:26                 ` Eric Dumazet
2012-11-28 12:39                   ` David Woodhouse
2012-11-28 12:52                     ` Eric Dumazet
2012-11-28 15:47                       ` David Woodhouse
2012-11-28 16:09                         ` Eric Dumazet
2012-11-28 16:21                           ` David Woodhouse
2012-11-28 16:19                         ` Benjamin LaHaise
2012-11-28 16:41                           ` David Woodhouse
2012-11-28 17:08                             ` Benjamin LaHaise
2012-11-28 17:16                             ` Eric Dumazet
2012-11-28 18:01                               ` David Woodhouse
2012-11-29 15:10                                 ` Noel Grandin
2012-12-02  1:30                               ` David Woodhouse
2012-12-02  2:40                                 ` Eric Dumazet
2012-12-02  9:31                                   ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).