All of lore.kernel.org
 help / color / mirror / Atom feed
* TCP packet size and delivery packet decisions
       [not found] <AANLkTi=_heUejVf-wmEcKd910gVtpD7Hr=cZ_cs2Q8n9@mail.gmail.com>
@ 2010-09-07  4:20 ` ツ Leandro Melo de Sales
  2010-09-07  4:36   ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-07  4:20 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]

Hi all,
   We are facing a not expected behavior (at least for us) from the
TCP stack under Linux. We need to write 78 bytes per writing using
write() function. But instead of sending 78 bytes TCP implementation
in Linux sends 48 bytes and later 30 bytes. We then started to debug
the implementation, but although we got some progress on this, I have
decided to ask in this list if someone can anticipate provide me more
information why TCP is not sending 78 at once, in addition to have any
clarification about if it is possible to force TCP sends 78 bytes in
each write() function call.
   I have tried many alternatives: set TCP_CORK; used socket API to
write a fixed 78 bytes per write() function call; and so forth). Also,
I know that TCP is a byte-stream protocol and this decision is made by
many different factors, such as cwd size, mss and so forth, but 78
bytes is too small to split it in two different packets. Our
application works as expected under Windows, it sends 78 bytes at
once.

   In this way, any comment from the list regarding this issue will be
kindly accepted and helpful.

Thank you,
Leandro.

--
Leandro Melo de Sales
Professor in Computer Science at Federal University of Alagoas, Brazil
Pervasive and Embedded Computing Laboratory, UFCG
PhD candidate in Computer Science at UFCG

[-- Attachment #2: result.png --]
[-- Type: image/png, Size: 110510 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  4:20 ` TCP packet size and delivery packet decisions ツ Leandro Melo de Sales
@ 2010-09-07  4:36   ` Eric Dumazet
  2010-09-07  5:13     ` ツ Leandro Melo de Sales
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2010-09-07  4:36 UTC (permalink / raw)
  To: ツ Leandro Melo de Sales; +Cc: netdev

Le mardi 07 septembre 2010 à 01:20 -0300, ツ Leandro Melo de Sales a
écrit :
> Hi all,
>    We are facing a not expected behavior (at least for us) from the
> TCP stack under Linux. We need to write 78 bytes per writing using
> write() function. But instead of sending 78 bytes TCP implementation
> in Linux sends 48 bytes and later 30 bytes. We then started to debug
> the implementation, but although we got some progress on this, I have
> decided to ask in this list if someone can anticipate provide me more
> information why TCP is not sending 78 at once, in addition to have any
> clarification about if it is possible to force TCP sends 78 bytes in
> each write() function call.
>    I have tried many alternatives: set TCP_CORK; used socket API to
> write a fixed 78 bytes per write() function call; and so forth). Also,
> I know that TCP is a byte-stream protocol and this decision is made by
> many different factors, such as cwd size, mss and so forth, but 78
> bytes is too small to split it in two different packets. Our
> application works as expected under Windows, it sends 78 bytes at
> once.
> 
>    In this way, any comment from the list regarding this issue will be
> kindly accepted and helpful.


Could you send a tcpdump of such session, with say first 10 packets ?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  4:36   ` Eric Dumazet
@ 2010-09-07  5:13     ` ツ Leandro Melo de Sales
  2010-09-07  5:16       ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-07  5:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, Sep 7, 2010 at 1:36 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 07 septembre 2010 à 01:20 -0300, ツ Leandro Melo de Sales a
> écrit :
>> Hi all,
>>    We are facing a not expected behavior (at least for us) from the
>> TCP stack under Linux. We need to write 78 bytes per writing using
>> write() function. But instead of sending 78 bytes TCP implementation
>> in Linux sends 48 bytes and later 30 bytes. We then started to debug
>> the implementation, but although we got some progress on this, I have
>> decided to ask in this list if someone can anticipate provide me more
>> information why TCP is not sending 78 at once, in addition to have any
>> clarification about if it is possible to force TCP sends 78 bytes in
>> each write() function call.
>>    I have tried many alternatives: set TCP_CORK; used socket API to
>> write a fixed 78 bytes per write() function call; and so forth). Also,
>> I know that TCP is a byte-stream protocol and this decision is made by
>> many different factors, such as cwd size, mss and so forth, but 78
>> bytes is too small to split it in two different packets. Our
>> application works as expected under Windows, it sends 78 bytes at
>> once.
>>
>>    In this way, any comment from the list regarding this issue will be
>> kindly accepted and helpful.
>
>
> Could you send a tcpdump of such session, with say first 10 packets ?
>
>

For this moment, this is all that I have, I can't not use TCPDump
right now, but I think the below information are what you'd like to
have:

Under Linux:

Source              Dest.               Flags
192.168.0.34    192.168.0.70    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
TSV=105155 TSER=0 WS=7
192.168.0.70    192.168.0.34    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
192.168.0.34    192.168.0.70    [ACK] Seq=1 Ack=1 Win=5840 Len=0
192.168.0.34    192.168.0.70    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48
192.168.0.34    192.168.0.70    [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30
192.168.0.70    192.168.0.34    [ACK] Seq=1 Ack=49 Win=78 Len=0
192.168.0.70    192.168.0.34    [RST, ACK] Seq=1 Ack=79 Win=78 Len=0

Under Windows:

Source              Dest.               Flags
192.168.0.35    192.168.0.70    [SYN] Seq=0 Win=64240 Len=0 MSS=1460
192.168.0.70    192.168.0.35    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
192.168.0.35    192.168.0.70    [ACK] Seq=1 Ack=1 Win=64272 Len=0
192.168.0.35    192.168.0.70    [PSH, ACK] Seq=1 Ack=1 Win=64272 Len=78
192.168.0.70    192.168.0.35    [ACK] Seq=1 Ack=79 Win=78 Len=0
192.168.0.35    192.168.0.70    [PSH, ACK] Seq=79 Ack=1 Win=64272 Len=78
192.168.0.70    192.168.0.35    [ACK] Seq=1 Ack=157 Win=78 Len=0
192.168.0.35    192.168.0.70    [PSH, ACK] Seq=157 Ack=1 Win=64272 Len=78
192.168.0.70    192.168.0.35    [ACK] Seq=1 Ack=235 Win=78 Len=0
192.168.0.35    192.168.0.70    [FIN, ACK] Seq=235 Ack=1 Win=64272 Len=0
192.168.0.70    192.168.0.35    [FIN, ACK] Seq=1 Ack=236 Win=78 Len=0
192.168.0.35    192.168.0.70    [ACK] Seq=236 Ack=2 Win=64272 Len=0

To simplify, below is a very summarized source code used to connect to
the remove server. I have tried to use sendall() function from python
socket API and it does not work either. I have implemented similar
program in C, but nothing new was produced.

==========

import socket
host = '192.168.0.70'
port = 70

hex_CONNECT = "<78 hex bytes -- suppressed>"
hex_COMMAND1 = "<78 hex bytes -- suppressed>"
hex_COMMAND6 = "<78 hex bytes -- suppressed>"

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, port))
s.setsockopt(socket.SOL_TCP, socket.TCP_CORK, 1)
s.send(unhexlify(hex_CONNECT))
s.send(unhexlify(hex_COMMAND1))
s.send(unhexlify(hex_COMMAND6))
s.close()

==========

Leandro

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  5:13     ` ツ Leandro Melo de Sales
@ 2010-09-07  5:16       ` David Miller
  2010-09-07  5:21         ` ツ Leandro Melo de Sales
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2010-09-07  5:16 UTC (permalink / raw)
  To: leandroal; +Cc: eric.dumazet, netdev

From: ツ Leandro Melo de Sales <leandroal@gmail.com>
Date: Tue, 7 Sep 2010 02:13:46 -0300

> 192.168.0.70    192.168.0.34    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78

Find out why this system is advertising a 78 byte window
with no window scaling.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  5:16       ` David Miller
@ 2010-09-07  5:21         ` ツ Leandro Melo de Sales
  2010-09-07  5:30           ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-07  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

On Tue, Sep 7, 2010 at 2:16 AM, David Miller <davem@davemloft.net> wrote:
> From: ツ Leandro Melo de Sales <leandroal@gmail.com>
> Date: Tue, 7 Sep 2010 02:13:46 -0300
>
>> 192.168.0.70    192.168.0.34    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
>
> Find out why this system is advertising a 78 byte window
> with no window scaling.
>

This is a embedded system implemented in a proprietary hardware that I
can send TCP commands to turn on/off relays. It will be very difficult
to find the reason for this behaviour. Since it needs to receive very
small data, probably it is not necessary to scaling window.

Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  5:21         ` ツ Leandro Melo de Sales
@ 2010-09-07  5:30           ` David Miller
  2010-09-07  6:02             ` ツ Leandro Melo de Sales
  2010-09-07 11:39             ` Eric Dumazet
  0 siblings, 2 replies; 23+ messages in thread
From: David Miller @ 2010-09-07  5:30 UTC (permalink / raw)
  To: leandroal; +Cc: eric.dumazet, netdev

From: ツ Leandro Melo de Sales <leandroal@gmail.com>
Date: Tue, 7 Sep 2010 02:21:42 -0300

> This is a embedded system implemented in a proprietary hardware that I
> can send TCP commands to turn on/off relays. It will be very difficult
> to find the reason for this behaviour. Since it needs to receive very
> small data, probably it is not necessary to scaling window.

The small 78 byte window is why the sending system is splitting up the
writes into smaller pieces.

I presume that the system advertises exactly a 78 byte window because
this is how large the commands are.  But this is an extremely foolish
and baroque thing to do, and it's why you are having problems.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  5:30           ` David Miller
@ 2010-09-07  6:02             ` ツ Leandro Melo de Sales
  2010-09-07  6:09               ` Eric Dumazet
  2010-09-07 11:39             ` Eric Dumazet
  1 sibling, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-07  6:02 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev

2010/9/7 David Miller <davem@davemloft.net>:
> From: ツ Leandro Melo de Sales <leandroal@gmail.com>
> Date: Tue, 7 Sep 2010 02:21:42 -0300
>
>> This is a embedded system implemented in a proprietary hardware that I
>> can send TCP commands to turn on/off relays. It will be very difficult
>> to find the reason for this behaviour. Since it needs to receive very
>> small data, probably it is not necessary to scaling window.
>
> The small 78 byte window is why the sending system is splitting up the
> writes into smaller pieces.
>
> I presume that the system advertises exactly a 78 byte window because
> this is how large the commands are.  But this is an extremely foolish
> and baroque thing to do, and it's why you are having problems.
>

David,
   Yes, 78 bytes is the size of each command. I have concluded the
same as you. In this case, to deal with this type of situation, how
about changing TCP implementation on Linux to send the whole packet
when it is too small such in this case or when TCP notice that there
is no change in the advertised rcwd or when advertised win is equal to
advertised MSS, since the receiver is saying "I can receive packets in
the same size of my cwd"?

  I'm sorry if I'm suggesting something that does not make sense, but
since receiver advertises that are able to receive packets in that
size which is equal to the congwin size, splitting the packet in this
case only [unnecessarily] increased the flow completion time or avoid
data to be delivered to the application as soon as possible since it
arrived splitted. As we could notice from the tcpdump report, TCP
implementation under Linux does not wait for any ACK of the first 48
bytes sent out, it just sent the other 30 bytes in a consecutive
delivery, which is the same as sending 78 bytes at once, since no
decision is taken to send or not the other last 30 bytes.

   Regardless this discussion, can you at least suggest any workaround
that I can do in the application layer to make TCP send the whole
packet at once? As I said, I tried to use TCP_CORK suggested by
Arnaldo (acme), but it does not work for me in this case.

Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  6:02             ` ツ Leandro Melo de Sales
@ 2010-09-07  6:09               ` Eric Dumazet
  2010-09-07  7:16                 ` ツ Leandro Melo de Sales
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2010-09-07  6:09 UTC (permalink / raw)
  To: ツ Leandro Melo de Sales; +Cc: David Miller, netdev

Le mardi 07 septembre 2010 à 03:02 -0300, ツ Leandro Melo de Sales a
écrit :
> >
> 
> David,
>    Yes, 78 bytes is the size of each command. I have concluded the
> same as you. In this case, to deal with this type of situation, how
> about changing TCP implementation on Linux to send the whole packet
> when it is too small such in this case or when TCP notice that there
> is no change in the advertised rcwd or when advertised win is equal to
> advertised MSS, since the receiver is saying "I can receive packets in
> the same size of my cwd"?
> 
>   I'm sorry if I'm suggesting something that does not make sense, but
> since receiver advertises that are able to receive packets in that
> size which is equal to the congwin size, splitting the packet in this
> case only [unnecessarily] increased the flow completion time or avoid
> data to be delivered to the application as soon as possible since it
> arrived splitted. As we could notice from the tcpdump report, TCP
> implementation under Linux does not wait for any ACK of the first 48
> bytes sent out, it just sent the other 30 bytes in a consecutive
> delivery, which is the same as sending 78 bytes at once, since no
> decision is taken to send or not the other last 30 bytes.
> 
>    Regardless this discussion, can you at least suggest any workaround
> that I can do in the application layer to make TCP send the whole
> packet at once? As I said, I tried to use TCP_CORK suggested by
> Arnaldo (acme), but it does not work for me in this case.
> 

Is it a critical problem for you ?

If you give us at least some minutes or hours or even days to think
about this corner case, I am sure we can find a solution :)

They are two factors here : MSS=78 and WIN=78

Thanks



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  6:09               ` Eric Dumazet
@ 2010-09-07  7:16                 ` ツ Leandro Melo de Sales
  2010-09-07  7:32                   ` Eric Dumazet
  0 siblings, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-07  7:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Thiago Luiz

On Tue, Sep 7, 2010 at 3:09 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 07 septembre 2010 à 03:02 -0300, ツ Leandro Melo de Sales a
> écrit :
>> >
>>
>> David,
>>    Yes, 78 bytes is the size of each command. I have concluded the
>> same as you. In this case, to deal with this type of situation, how
>> about changing TCP implementation on Linux to send the whole packet
>> when it is too small such in this case or when TCP notice that there
>> is no change in the advertised rcwd or when advertised win is equal to
>> advertised MSS, since the receiver is saying "I can receive packets in
>> the same size of my cwd"?
>>
>>   I'm sorry if I'm suggesting something that does not make sense, but
>> since receiver advertises that are able to receive packets in that
>> size which is equal to the congwin size, splitting the packet in this
>> case only [unnecessarily] increased the flow completion time or avoid
>> data to be delivered to the application as soon as possible since it
>> arrived splitted. As we could notice from the tcpdump report, TCP
>> implementation under Linux does not wait for any ACK of the first 48
>> bytes sent out, it just sent the other 30 bytes in a consecutive
>> delivery, which is the same as sending 78 bytes at once, since no
>> decision is taken to send or not the other last 30 bytes.
>>
>>    Regardless this discussion, can you at least suggest any workaround
>> that I can do in the application layer to make TCP send the whole
>> packet at once? As I said, I tried to use TCP_CORK suggested by
>> Arnaldo (acme), but it does not work for me in this case.
>>
>
> Is it a critical problem for you ?
>
> If you give us at least some minutes or hours or even days to think
> about this corner case, I am sure we can find a solution :)
>
> They are two factors here : MSS=78 and WIN=78
>
> Thanks

My short answer is: this is not a critical problem for me, at all. I
just thought that this could be easily fixed by finding the source of
the problem, as David and I shared it is due to small and fixed cwd
advertised by the receiver.

But...  This just make me think about why it works under windows, but
not under linux. When I begin to think about the relation between Win
and MSS, in my point of view it is feasible to think like I said: if
the receiver is telling me that it is able to receive a packet that is
in the same size of the cwd and cwd is sufficiently small in respect
to congestion control mechanism and MTU size, why postpone the flow
completion time if I can do this at once, ... avoid make two
consecutive TCP-PSH without any sending decision between them? For our
discussion MSS == Win, while they are very small if compared to MTU,
almost 20 times, at least in ethernet. I know that "very", "small",
"big", "tall", "short" etc are very vague works, and everything will
depend on the point of view, but maybe we can consider Win a very
small size (at lease when it is equal to MSS) when TCP is in the Slow
Start phase until ssthresh, don't know...

    From one perspective I agree with David that the receiver device
of my case provided a kind of foolish and/or baroque implementation,
but in another perspective they where very smart to announce MSS ==
cwd, this way they avoiding sender to send more than it (receiver) can
handle, does not use too much resource since it does not increase the
cwd, in addition to telling to the sender: "send me your complete
'sk_write_queue' at once (talking about Linux TCP implementation)".
But Linux did not, instead it sent two consecutive packets without any
decision taken between them, why? In this case, how much resource we
spend when we allocate a new packet and add it in the double-linked
queue? how much computation we wasting when we have to process one
more packet (in this case for each tcp.send())? Well, if this is not
the case here or if wasting resources is computational cheaper than
make some checks and send the packet at once, let's try another
approach...

   Well, I don't know if what I mentioned above are real arguments to
promote a change in the TCP implementation, just want to solve my
problem, at the same time I have decided to share with you guys my
problem, since maybe it can be a problem faced by someone else when
using Linux, or already occurred in the past.

   Finally, one other (at least for my project) consideration is that
I wouldn't like to deploy my application only under windows (since
there my app works) and tell to my customer: well, we have done a
multi-platform solution, but due to **this** issue we won't be able to
deploy the system under linux because it simply does not work (at
least considering all tests using alternatives and workaround that I
have mentioned in my previous e-mail).

Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  7:16                 ` ツ Leandro Melo de Sales
@ 2010-09-07  7:32                   ` Eric Dumazet
  2010-09-08 14:01                     ` ツ Leandro Melo de Sales
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2010-09-07  7:32 UTC (permalink / raw)
  To: ツ Leandro Melo de Sales; +Cc: David Miller, netdev, Thiago Luiz

Le mardi 07 septembre 2010 à 04:16 -0300, ツ Leandro Melo de Sales a
écrit :

> My short answer is: this is not a critical problem for me, at all. I
> just thought that this could be easily fixed by finding the source of
> the problem, as David and I shared it is due to small and fixed cwd
> advertised by the receiver.
> 
> But...  This just make me think about why it works under windows, but
> not under linux. When I begin to think about the relation between Win
> and MSS, in my point of view it is feasible to think like I said: if
> the receiver is telling me that it is able to receive a packet that is
> in the same size of the cwd and cwd is sufficiently small in respect
> to congestion control mechanism and MTU size, why postpone the flow
> completion time if I can do this at once, ... avoid make two
> consecutive TCP-PSH without any sending decision between them? For our
> discussion MSS == Win, while they are very small if compared to MTU,
> almost 20 times, at least in ethernet. I know that "very", "small",
> "big", "tall", "short" etc are very vague works, and everything will
> depend on the point of view, but maybe we can consider Win a very
> small size (at lease when it is equal to MSS) when TCP is in the Slow
> Start phase until ssthresh, don't know...
> 
>     From one perspective I agree with David that the receiver device
> of my case provided a kind of foolish and/or baroque implementation,
> but in another perspective they where very smart to announce MSS ==
> cwd, this way they avoiding sender to send more than it (receiver) can
> handle, does not use too much resource since it does not increase the
> cwd, in addition to telling to the sender: "send me your complete
> 'sk_write_queue' at once (talking about Linux TCP implementation)".
> But Linux did not, instead it sent two consecutive packets without any
> decision taken between them, why? In this case, how much resource we
> spend when we allocate a new packet and add it in the double-linked
> queue? how much computation we wasting when we have to process one
> more packet (in this case for each tcp.send())? Well, if this is not
> the case here or if wasting resources is computational cheaper than
> make some checks and send the packet at once, let's try another
> approach...
> 
>    Well, I don't know if what I mentioned above are real arguments to
> promote a change in the TCP implementation, just want to solve my
> problem, at the same time I have decided to share with you guys my
> problem, since maybe it can be a problem faced by someone else when
> using Linux, or already occurred in the past.
> 
>    Finally, one other (at least for my project) consideration is that
> I wouldn't like to deploy my application only under windows (since
> there my app works) and tell to my customer: well, we have done a
> multi-platform solution, but due to **this** issue we won't be able to
> deploy the system under linux because it simply does not work (at
> least considering all tests using alternatives and workaround that I
> have mentioned in my previous e-mail).


Really this has nothing to do with congestion.

We send _one_ packet, and this packet has not the optimum size.

This can be fixed, with a 100% probability :)

Quite frankly, if your application depends on _one_ packet being sent
instead of two, you can do even better under linux, avoiding the third
packet (pure ACK) of the tcp session :=)

192.168.0.34    192.168.0.70    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
192.168.0.70    192.168.0.34    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
192.168.0.34    192.168.0.70    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78

Nice isnt it ?

BTW, what is the version of linux kernel you use ?




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  5:30           ` David Miller
  2010-09-07  6:02             ` ツ Leandro Melo de Sales
@ 2010-09-07 11:39             ` Eric Dumazet
  2010-09-07 12:15               ` Ilpo Järvinen
                                 ` (3 more replies)
  1 sibling, 4 replies; 23+ messages in thread
From: Eric Dumazet @ 2010-09-07 11:39 UTC (permalink / raw)
  To: David Miller; +Cc: leandroal, netdev, Ilpo Järvinen

Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit :


> The small 78 byte window is why the sending system is splitting up the
> writes into smaller pieces.
> 
> I presume that the system advertises exactly a 78 byte window because
> this is how large the commands are.  But this is an extremely foolish
> and baroque thing to do, and it's why you are having problems.

I am not sure why TSO added a "Bound mss with half of window"
requirement for tcp_sync_mss()

I tried with MSS=1000 and WIN=1000, and segment size chosen is 500

With WIN=78, 78/2->39 is then capped to 48 (68U - tp->tcp_header_len)

Is there a hard requirement about segment size being at most half the
window ?

Following patch solves the problem for me :

[PATCH] tcp: bound mss to window in tcp_sync_mss()

Leandro Melo de Sales noticed that if a peer announces a very small
initial tcp window (78 in his case), first sent frames have unnecessary
small lengths (48 in his case)

CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48
CLNT->SRV    [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30
SRV->CLNT    [ACK] Seq=1 Ack=49 Win=78 Len=0
SRV->CLNT    [RST, ACK] Seq=1 Ack=79 Win=78 Len=0

tcp_sync_mss() bounds mss to half the window, while it could use full
window:

CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78
SRV->CLNT    [ACK] Seq=1 Ack=79 Win=78 Len=0

Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h     |    9 +++++++++
 net/ipv4/tcp_output.c |    2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index eaa9582..c262676 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 		return pktsize;
 }
 
+/* Bound MSS / TSO packet size with the window */
+static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize)
+{
+	if (tp->max_window && pktsize > tp->max_window)
+		return max(tp->max_window, 68U - tp->tcp_header_len);
+	else
+		return pktsize;
+}
+
 /* tcp.c */
 extern void tcp_get_info(struct sock *, struct tcp_info *);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de3bd84..49cdbe4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
 		icsk->icsk_mtup.search_high = pmtu;
 
 	mss_now = tcp_mtu_to_mss(sk, pmtu);
-	mss_now = tcp_bound_to_half_wnd(tp, mss_now);
+	mss_now = tcp_bound_to_wnd(tp, mss_now);
 
 	/* And store cached results */
 	icsk->icsk_pmtu_cookie = pmtu;



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07 11:39             ` Eric Dumazet
@ 2010-09-07 12:15               ` Ilpo Järvinen
  2010-09-08  3:18                 ` David Miller
  2010-09-07 16:35               ` David Miller
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Ilpo Järvinen @ 2010-09-07 12:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, leandroal, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3660 bytes --]

On Tue, 7 Sep 2010, Eric Dumazet wrote:

> Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit :
> 
> 
> > The small 78 byte window is why the sending system is splitting up the
> > writes into smaller pieces.
> > 
> > I presume that the system advertises exactly a 78 byte window because
> > this is how large the commands are.  But this is an extremely foolish
> > and baroque thing to do, and it's why you are having problems.
> 
> I am not sure why TSO added a "Bound mss with half of window"
> requirement for tcp_sync_mss()

I've thought it is more related on window behavior in general and 
much much older than TSO (it certainly seems to be old one).

I guess we might run to some SWS issue if MSS < rwin < 2*MSS with your 
patch that are avoided by the current approach?

> I tried with MSS=1000 and WIN=1000, and segment size chosen is 500

Perhaps worth to try with MSS=999 WIN=1000 pair too and see what happens?

> With WIN=78, 78/2->39 is then capped to 48 (68U - tp->tcp_header_len)
> 
> Is there a hard requirement about segment size being at most half the
> window ?
>
> Following patch solves the problem for me :
> 
> [PATCH] tcp: bound mss to window in tcp_sync_mss()
> 
> Leandro Melo de Sales noticed that if a peer announces a very small
> initial tcp window (78 in his case), first sent frames have unnecessary
> small lengths (48 in his case)
> 
> CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
> SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
> CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
> CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48
> CLNT->SRV    [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30
> SRV->CLNT    [ACK] Seq=1 Ack=49 Win=78 Len=0
> SRV->CLNT    [RST, ACK] Seq=1 Ack=79 Win=78 Len=0
> 
> tcp_sync_mss() bounds mss to half the window, while it could use full
> window:
> 
> CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
> SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
> CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
> CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78
> SRV->CLNT    [ACK] Seq=1 Ack=79 Win=78 Len=0
> 
> Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  include/net/tcp.h     |    9 +++++++++
>  net/ipv4/tcp_output.c |    2 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eaa9582..c262676 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
>  		return pktsize;
>  }
>  
> +/* Bound MSS / TSO packet size with the window */
> +static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize)
> +{
> +	if (tp->max_window && pktsize > tp->max_window)
> +		return max(tp->max_window, 68U - tp->tcp_header_len);
> +	else
> +		return pktsize;
> +}
> +
>
>  /* tcp.c */
>  extern void tcp_get_info(struct sock *, struct tcp_info *);
>  
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index de3bd84..49cdbe4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
>  		icsk->icsk_mtup.search_high = pmtu;
>  
>  	mss_now = tcp_mtu_to_mss(sk, pmtu);
> -	mss_now = tcp_bound_to_half_wnd(tp, mss_now);
> +	mss_now = tcp_bound_to_wnd(tp, mss_now);
>  
>  	/* And store cached results */
>  	icsk->icsk_pmtu_cookie = pmtu;

I don't quite follow why tcp.c caller should then be left as is? Though it 
seems to me that it is more complex than tcp_sync_mss case.

-- 
 i.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07 11:39             ` Eric Dumazet
  2010-09-07 12:15               ` Ilpo Järvinen
@ 2010-09-07 16:35               ` David Miller
  2010-09-07 16:57               ` Rick Jones
  2010-09-08 14:06               ` ツ Leandro Melo de Sales
  3 siblings, 0 replies; 23+ messages in thread
From: David Miller @ 2010-09-07 16:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: leandroal, netdev, ilpo.jarvinen

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Sep 2010 13:39:12 +0200

> I am not sure why TSO added a "Bound mss with half of window"
> requirement for tcp_sync_mss()

Because if we TSO to the full window the link doesn't fill up
properly.

We definitely do this on purpose, that's for sure.  :-)

I'll look into the full version control history to unearth the
exact reason so we can validate your patch, good find Eric.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07 11:39             ` Eric Dumazet
  2010-09-07 12:15               ` Ilpo Järvinen
  2010-09-07 16:35               ` David Miller
@ 2010-09-07 16:57               ` Rick Jones
  2010-09-08 14:06               ` ツ Leandro Melo de Sales
  3 siblings, 0 replies; 23+ messages in thread
From: Rick Jones @ 2010-09-07 16:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, leandroal, netdev, Ilpo Järvinen

> Is there a hard requirement about segment size being at most half the
> window ?

In the deep, dark, past of other stacks, there were concerns about devolving 
into stop-start.  I suppose from a theoretical standpoint, there is a chance 
that MSS==window, coupled with immediate ACKnowledgement on the receiver could 
send zero windows that might elicit undesirable probes?

rick jones

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07 12:15               ` Ilpo Järvinen
@ 2010-09-08  3:18                 ` David Miller
  2010-09-08 12:14                   ` Alexey Kuznetsov
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2010-09-08  3:18 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: eric.dumazet, leandroal, netdev, kuznet

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 7 Sep 2010 15:15:25 +0300 (EEST)

[ Alexey, problem is that when receiver's maximum window is miniscule
  (f.e. equal to MSS :-), we never send full MSS sized frames due to
  our sender size SWS implementation. ]

> On Tue, 7 Sep 2010, Eric Dumazet wrote:
> 
>> Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit :
>> 
>> 
>> > The small 78 byte window is why the sending system is splitting up the
>> > writes into smaller pieces.
>> > 
>> > I presume that the system advertises exactly a 78 byte window because
>> > this is how large the commands are.  But this is an extremely foolish
>> > and baroque thing to do, and it's why you are having problems.
>> 
>> I am not sure why TSO added a "Bound mss with half of window"
>> requirement for tcp_sync_mss()
> 
> I've thought it is more related on window behavior in general and 
> much much older than TSO (it certainly seems to be old one).
> 
> I guess we might run to some SWS issue if MSS < rwin < 2*MSS with your 
> patch that are avoided by the current approach?

Right, this clamping is part of RFC1122 silly window syndrome avoidance.

In ancient times we used to do this straight in sendmsg(), which had
the comment:

			/* We also need to worry about the window.  If
			 * window < 1/2 the maximum window we've seen
			 * from this host, don't use it.  This is
			 * sender side silly window prevention, as
			 * specified in RFC1122.  (Note that this is
			 * different than earlier versions of SWS
			 * prevention, e.g. RFC813.).  What we
			 * actually do is use the whole MSS.  Since
			 * the results in the right edge of the packet
			 * being outside the window, it will be queued
			 * for later rather than sent.
			 */

But in January 2000, Alexey Kuznetsov moved this logic into tcp_sync_mss().
netdev-vger-2.6 commit is:

--------------------
commit 214d457eb454a70f0f373371de044403834d8042
Author: davem <davem>
Date:   Tue Jan 18 08:24:09 2000 +0000

    Merge in bug fixes and small enhancements
    from Alexey for the TCP/UDP softnet mega-merge
    from the other day.
--------------------

Well, what is the SWS sender rule?  RFC1122 states that we should send
data if (wher U == usable window, D == data queued up but not yet sent):

1) if a maximum-sized segment can be sent, i.e, if:

   min(D,U) >= Eff.snd.MSS;

2)  or if the data is pushed and all queued data can
    be sent now, i.e., if:

        [SND.NXT = SND.UNA and] PUSHED and D <= U

    (the bracketed condition is imposed by the Nagle
    algorithm);

3)  or if at least a fraction Fs of the maximum window
    can be sent, i.e., if:

    [SND.NXT = SND.UNA and]

    min(D.U) >= Fs * Max(SND.WND);

4)  or if data is PUSHed and the override timeout
    occurs.

The recommmended value for the "Fs" fraction is 1/2, so that's where
the "one-half" logic comes from.

The current code implements this by pre-chopping the MSS to
half the largest window we've seen advertised by the peer, f.e.
when doing sendmsg() packetization.  Packets are packetized
to this 1/2 MAX_WINDOW MSS, represented by mss_now.

Then the send path SWS logic will send any packet that is at least
"mss_now".

Effectively we try to implement test #1 and #3 above at the same
time by just making test #1 and chopping the Eff.snd.MSS by half
the largest receive window we've seen advertised by the peer.

This strategy seems to break down when the peer's MSS and the maximum
receive window we'll ever see from the peer are the same order of
magnitude.

It seems that the conditions above really need to be checked in the
right order, and because we try to combine a later test (case #3) with
an earlier test (case #1) we don't send a full sized frame in this
special scenerio.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-08  3:18                 ` David Miller
@ 2010-09-08 12:14                   ` Alexey Kuznetsov
  2010-09-13  1:23                     ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Alexey Kuznetsov @ 2010-09-08 12:14 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, eric.dumazet, leandroal, netdev

Hello!

> [ Alexey, problem is that when receiver's maximum window is miniscule
>   (f.e. equal to MSS :-), we never send full MSS sized frames due to
>   our sender size SWS implementation. ]

I see.

The problem was that we do early packetization. If we chop
frames to real mss (> max_window/2), we cannot do #3
(min(D.U) >= Fs * Max(SND.WND)) without subsequent fragmentation.

The solution to chop to max_window/2 was mine and it was intended
to solve the problem for devices with _large_ mtu, where max_window
is comparable with mtu not because window is small, but because mtu is large.
This case was important from performance viewpoint, fragmentation
would destroy our smart early packetization technique.

What's about the case of sane mtu and small max_window, that case was not
simply ignored as "not-so-important", it also had a rational explanation, see below.

If the issue must be resolved, I would suggest to:

1. Complicate mss = min(mss, max_window/2). Probably, do something like:

   if (max_window >= 65536 /* just a guess */)
	mss = min(mss, max_window/2);
   else
	mss = min(mss, max_window);

2. Add SWS avoidance checks in tcp_write_xmit(). It should detect condition
   when end_seq > tcp_wnd_end(tp) (like now), but proceed with fragmentation when
   tp->snd_nxt == tp->snd_una && tp->snd_wnd >= tp->max_window/2.
   Luckily, all this logic is already there due to TSO, only conditions when
   to fragment are to be adjusted a little.


Frankly, I am not so sure that the issue should be resolved.
There is one more aspect, not related to SWS. When mss > max_window/2 we can have
only one segment in pipe, which is not good. When mss==max_window and we never see
full sized frame sent, this looks strange, but I bet it is still better
for performance under almost any curcumstances.

I do not know actual context, of course. I can guess the situation
can be like this: "I set window=mss exactly to see only one packet in flight
all the times. Why the hell linux tries to send two mss/2 sized?" :-)


> In ancient times we used to do this straight in sendmsg(), which had
> the comment:
> 
> 			/* We also need to worry about the window.  If
> 			 * window < 1/2 the maximum window we've seen
> 			 * from this host, don't use it.  This is
> 			 * sender side silly window prevention, as
> 			 * specified in RFC1122.  (Note that this is
> 			 * different than earlier versions of SWS
> 			 * prevention, e.g. RFC813.).  What we
> 			 * actually do is use the whole MSS.  Since
> 			 * the results in the right edge of the packet
> 			 * being outside the window, it will be queued
> 			 * for later rather than sent.
> 			 */

BTW the comment was good, but this logic was not actually implemented.
The code below this comment was incorrect, it chopped segments at tail of write_queue
(with seq > snd_nxt) based on window calculation at head of queue,
so that it did not work. Actually, this check can be done
not earlier than in tcp_write_xmit().

Alexey

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07  7:32                   ` Eric Dumazet
@ 2010-09-08 14:01                     ` ツ Leandro Melo de Sales
  0 siblings, 0 replies; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-08 14:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Thiago Luiz

On Tue, Sep 7, 2010 at 4:32 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Le mardi 07 septembre 2010 à 04:16 -0300, ツ Leandro Melo de Sales a
> écrit :
>
> > My short answer is: this is not a critical problem for me, at all. I
> > just thought that this could be easily fixed by finding the source of
> > the problem, as David and I shared it is due to small and fixed cwd
> > advertised by the receiver.
> >
> > But...  This just make me think about why it works under windows, but
> > not under linux. When I begin to think about the relation between Win
> > and MSS, in my point of view it is feasible to think like I said: if
> > the receiver is telling me that it is able to receive a packet that is
> > in the same size of the cwd and cwd is sufficiently small in respect
> > to congestion control mechanism and MTU size, why postpone the flow
> > completion time if I can do this at once, ... avoid make two
> > consecutive TCP-PSH without any sending decision between them? For our
> > discussion MSS == Win, while they are very small if compared to MTU,
> > almost 20 times, at least in ethernet. I know that "very", "small",
> > "big", "tall", "short" etc are very vague works, and everything will
> > depend on the point of view, but maybe we can consider Win a very
> > small size (at lease when it is equal to MSS) when TCP is in the Slow
> > Start phase until ssthresh, don't know...
> >
> >     From one perspective I agree with David that the receiver device
> > of my case provided a kind of foolish and/or baroque implementation,
> > but in another perspective they where very smart to announce MSS ==
> > cwd, this way they avoiding sender to send more than it (receiver) can
> > handle, does not use too much resource since it does not increase the
> > cwd, in addition to telling to the sender: "send me your complete
> > 'sk_write_queue' at once (talking about Linux TCP implementation)".
> > But Linux did not, instead it sent two consecutive packets without any
> > decision taken between them, why? In this case, how much resource we
> > spend when we allocate a new packet and add it in the double-linked
> > queue? how much computation we wasting when we have to process one
> > more packet (in this case for each tcp.send())? Well, if this is not
> > the case here or if wasting resources is computational cheaper than
> > make some checks and send the packet at once, let's try another
> > approach...
> >
> >    Well, I don't know if what I mentioned above are real arguments to
> > promote a change in the TCP implementation, just want to solve my
> > problem, at the same time I have decided to share with you guys my
> > problem, since maybe it can be a problem faced by someone else when
> > using Linux, or already occurred in the past.
> >
> >    Finally, one other (at least for my project) consideration is that
> > I wouldn't like to deploy my application only under windows (since
> > there my app works) and tell to my customer: well, we have done a
> > multi-platform solution, but due to **this** issue we won't be able to
> > deploy the system under linux because it simply does not work (at
> > least considering all tests using alternatives and workaround that I
> > have mentioned in my previous e-mail).
>
>
> Really this has nothing to do with congestion.
>
> We send _one_ packet, and this packet has not the optimum size.
>
> This can be fixed, with a 100% probability :)
>
> Quite frankly, if your application depends on _one_ packet being sent
> instead of two, you can do even better under linux, avoiding the third
> packet (pure ACK) of the tcp session :=)
>
> 192.168.0.34    192.168.0.70    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
> 192.168.0.70    192.168.0.34    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
> 192.168.0.34    192.168.0.70    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78
>
> Nice isnt it ?
>
> BTW, what is the version of linux kernel you use ?
>
>
>

Very nice... piggybacking... ;) But this is not the case, I need to
send more packets, it is not just one...

Kernel version: 2.6.32, but I have run this in the later versions too...

Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-07 11:39             ` Eric Dumazet
                                 ` (2 preceding siblings ...)
  2010-09-07 16:57               ` Rick Jones
@ 2010-09-08 14:06               ` ツ Leandro Melo de Sales
  2010-09-08 15:24                 ` David Miller
  3 siblings, 1 reply; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-08 14:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ilpo Järvinen

On Tue, Sep 7, 2010 at 8:39 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le lundi 06 septembre 2010 à 22:30 -0700, David Miller a écrit :
>
>
>> The small 78 byte window is why the sending system is splitting up the
>> writes into smaller pieces.
>>
>> I presume that the system advertises exactly a 78 byte window because
>> this is how large the commands are.  But this is an extremely foolish
>> and baroque thing to do, and it's why you are having problems.
>
> I am not sure why TSO added a "Bound mss with half of window"
> requirement for tcp_sync_mss()
>
> I tried with MSS=1000 and WIN=1000, and segment size chosen is 500
>
> With WIN=78, 78/2->39 is then capped to 48 (68U - tp->tcp_header_len)
>
> Is there a hard requirement about segment size being at most half the
> window ?
>
> Following patch solves the problem for me :
>
> [PATCH] tcp: bound mss to window in tcp_sync_mss()
>
> Leandro Melo de Sales noticed that if a peer announces a very small
> initial tcp window (78 in his case), first sent frames have unnecessary
> small lengths (48 in his case)
>
> CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
> SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
> CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
> CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=48
> CLNT->SRV    [PSH, ACK] Seq=49 Ack=1 Win=5840 Len=30
> SRV->CLNT    [ACK] Seq=1 Ack=49 Win=78 Len=0
> SRV->CLNT    [RST, ACK] Seq=1 Ack=79 Win=78 Len=0
>
> tcp_sync_mss() bounds mss to half the window, while it could use full
> window:
>
> CLNT->SRV    [SYN] Seq=0 Win=5840 Len=0 MSS=1460
> SRV->CLNT    [SYN, ACK] Seq=0 Ack=1 Win=78 Len=0 MSS=78
> CLNT->SRV    [ACK] Seq=1 Ack=1 Win=5840 Len=0
> CLNT->SRV    [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=78
> SRV->CLNT    [ACK] Seq=1 Ack=79 Win=78 Len=0
>
> Reported-by: ツ Leandro Melo de Sales <leandroal@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> ---
>  include/net/tcp.h     |    9 +++++++++
>  net/ipv4/tcp_output.c |    2 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index eaa9582..c262676 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -481,6 +481,15 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
>                return pktsize;
>  }
>
> +/* Bound MSS / TSO packet size with the window */
> +static inline int tcp_bound_to_wnd(struct tcp_sock *tp, int pktsize)
> +{
> +       if (tp->max_window && pktsize > tp->max_window)
> +               return max(tp->max_window, 68U - tp->tcp_header_len);
> +       else
> +               return pktsize;
> +}
> +
>  /* tcp.c */
>  extern void tcp_get_info(struct sock *, struct tcp_info *);
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index de3bd84..49cdbe4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1224,7 +1224,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
>                icsk->icsk_mtup.search_high = pmtu;
>
>        mss_now = tcp_mtu_to_mss(sk, pmtu);
> -       mss_now = tcp_bound_to_half_wnd(tp, mss_now);
> +       mss_now = tcp_bound_to_wnd(tp, mss_now);
>
>        /* And store cached results */
>        icsk->icsk_pmtu_cookie = pmtu;
>
>
>


Hi Eric and others,
   Good. I have tested and it is working... This is the exactly
translation for code what I mentioned. Very nice...

   Will this patch be applied?

Thank you,
Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-08 14:06               ` ツ Leandro Melo de Sales
@ 2010-09-08 15:24                 ` David Miller
  2010-09-08 16:03                   ` ツ Leandro Melo de Sales
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2010-09-08 15:24 UTC (permalink / raw)
  To: leandroal; +Cc: eric.dumazet, netdev, ilpo.jarvinen

From: ツ Leandro Melo de Sales <leandroal@gmail.com>
Date: Wed, 8 Sep 2010 11:06:43 -0300

>    Good. I have tested and it is working... This is the exactly
> translation for code what I mentioned. Very nice...
> 
>    Will this patch be applied?

No because the patch is wrong, see the rest of the discussion.

That line Eric is removing in his patch needs to be there
because it is part of our implementation of sender side
Silly Window Syndrome avoidance.

We'll have to fix this another way.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-08 15:24                 ` David Miller
@ 2010-09-08 16:03                   ` ツ Leandro Melo de Sales
  0 siblings, 0 replies; 23+ messages in thread
From: ツ Leandro Melo de Sales @ 2010-09-08 16:03 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, ilpo.jarvinen

2010/9/8 David Miller <davem@davemloft.net>:
> From: ツ Leandro Melo de Sales <leandroal@gmail.com>
> Date: Wed, 8 Sep 2010 11:06:43 -0300
>
>>    Good. I have tested and it is working... This is the exactly
>> translation for code what I mentioned. Very nice...
>>
>>    Will this patch be applied?
>
> No because the patch is wrong, see the rest of the discussion.
>
> That line Eric is removing in his patch needs to be there
> because it is part of our implementation of sender side
> Silly Window Syndrome avoidance.
>
> We'll have to fix this another way.
>

Yes, just after I reply asking this question, I read the other
discussions, mainly the comments from Alexey about SWS and TSO.

Is someone working on trying to fix this another way?

Thank you,
Leandro.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-08 12:14                   ` Alexey Kuznetsov
@ 2010-09-13  1:23                     ` David Miller
  2010-09-14  9:37                       ` Alexey Kuznetsov
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2010-09-13  1:23 UTC (permalink / raw)
  To: kuznet; +Cc: ilpo.jarvinen, eric.dumazet, leandroal, netdev

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Wed, 8 Sep 2010 16:14:50 +0400

> I do not know actual context, of course. I can guess the situation
> can be like this: "I set window=mss exactly to see only one packet in flight
> all the times. Why the hell linux tries to send two mss/2 sized?" :-)

The context is that a very stupid embedded device takes commands that
are all some specific size (I think it was 75 bytes), and it's
TCP stack sets MSS and window to exactly 75 bytes.

Device will only process the commands sent to it over TCP properly
if they are sent full sized, 75 bytes, by the sending TCP stack.
Linux is the only stack with which the device will not work.

To be honest, at such tiny MSS and window sizes, such packetization
for the sake of SWS done by us is stupid.

Let's be a little bit postmodern and just turn off this stuff using
perhaps a modified version of your idea #1, elide the SWS "window / 2"
logic when "max_window >= 512" or something like that.

It seems to handle every angle:

1) Huge MTU (>= 65535) devices are still covered.

2) For sane ethernet derived MSS and normal large window
   behavior stays the same.

3) When speaking to devices using super tiny windows, it is
   apparent that loss recovery performance is not their primary
   concern. :-)

Therefore, how about the following?

diff --git a/include/net/tcp.h b/include/net/tcp.h
index eaa9582..f8744e0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -475,7 +475,15 @@ extern unsigned int tcp_current_mss(struct sock *sk);
 /* Bound MSS / TSO packet size with the half of the window */
 static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 {
-	if (tp->max_window && pktsize > (tp->max_window >> 1))
+	/* When peer uses tiny windows, there is no use in packetizing
+	 * to sub-MSS pieces for the sake of SWS or making sure there
+	 * are enough packets in the pipe for fast recovery.
+	 *
+	 * On the other hand, for extremely large MSS devices, handling
+	 * smaller than MSS windows in this way does make sense.
+	 */
+	if (tp->max_window >= 512 &&
+	    pktsize > (tp->max_window >> 1))
 		return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
 	else
 		return pktsize;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-13  1:23                     ` David Miller
@ 2010-09-14  9:37                       ` Alexey Kuznetsov
  2010-09-15 17:29                         ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Alexey Kuznetsov @ 2010-09-14  9:37 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, eric.dumazet, leandroal, netdev

Hello!

> Therefore, how about the following?

This will work.

There is one trap though. If pktsize > max_window, packets never
will be transmitted in normal path, only from probe timer, which
will be very slow.

If we do not want to mess with tcp_write_xmit(), it is still possible
to relax the bound to tp->max_window. (BTW, this even does not
contradict to RFC, only Fs = 1 for tiny windows). This should work:

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..8cf8cdb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -501,8 +501,22 @@ extern unsigned int tcp_current_mss(struct sock *sk);
 /* Bound MSS / TSO packet size with the half of the window */
 static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 {
-	if (tp->max_window && pktsize > (tp->max_window >> 1))
-		return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
+	int cutoff;
+
+	/* When peer uses tiny windows, there is no use in packetizing
+	 * to sub-MSS pieces for the sake of SWS or making sure there
+	 * are enough packets in the pipe for fast recovery.
+	 *
+	 * On the other hand, for extremely large MSS devices, handling
+	 * smaller than MSS windows in this way does make sense.
+	 */
+	if (tp->max_window >= 512)
+		cutoff = (tp->max_window >> 1);
+	else
+		cutoff = tp->max_window;
+
+	if (cutoff && pktsize > cutoff)
+		return max(cutoff, 68U - tp->tcp_header_len);
 	else
 		return pktsize;
 }



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: TCP packet size and delivery packet decisions
  2010-09-14  9:37                       ` Alexey Kuznetsov
@ 2010-09-15 17:29                         ` David Miller
  0 siblings, 0 replies; 23+ messages in thread
From: David Miller @ 2010-09-15 17:29 UTC (permalink / raw)
  To: kuznet; +Cc: ilpo.jarvinen, eric.dumazet, leandroal, netdev

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Tue, 14 Sep 2010 13:37:58 +0400

> If we do not want to mess with tcp_write_xmit(), it is still possible
> to relax the bound to tp->max_window. (BTW, this even does not
> contradict to RFC, only Fs = 1 for tiny windows). This should work:

Looks good to me, here is final commit I used:

--------------------
>From 02923b5c7664d6ce85192ae986b7cdf62ee7dbf7 Mon Sep 17 00:00:00 2001
From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Wed, 15 Sep 2010 10:27:52 -0700
Subject: [PATCH] tcp: Prevent overzealous packetization by SWS logic.

If peer uses tiny MSS (say, 75 bytes) and similarly tiny advertised
window, the SWS logic will packetize to half the MSS unnecessarily.

This causes problems with some embedded devices.

However for large MSS devices we do want to half-MSS packetize
otherwise we never get enough packets into the pipe for things
like fast retransmit and recovery to work.

Be careful also to handle the case where MSS > window, otherwise
we'll never send until the probe timer.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/net/tcp.h |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index eaa9582..2222fc2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -475,8 +475,22 @@ extern unsigned int tcp_current_mss(struct sock *sk);
 /* Bound MSS / TSO packet size with the half of the window */
 static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 {
-	if (tp->max_window && pktsize > (tp->max_window >> 1))
-		return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
+	int cutoff;
+
+	/* When peer uses tiny windows, there is no use in packetizing
+	 * to sub-MSS pieces for the sake of SWS or making sure there
+	 * are enough packets in the pipe for fast recovery.
+	 *
+	 * On the other hand, for extremely large MSS devices, handling
+	 * smaller than MSS windows in this way does make sense.
+	 */
+	if (tp->max_window >= 512)
+		cutoff = (tp->max_window >> 1);
+	else
+		cutoff = tp->max_window;
+
+	if (cutoff && pktsize > cutoff)
+		return max(cutoff, 68U - tp->tcp_header_len);
 	else
 		return pktsize;
 }
-- 
1.7.2.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-09-15 17:29 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AANLkTi=_heUejVf-wmEcKd910gVtpD7Hr=cZ_cs2Q8n9@mail.gmail.com>
2010-09-07  4:20 ` TCP packet size and delivery packet decisions ツ Leandro Melo de Sales
2010-09-07  4:36   ` Eric Dumazet
2010-09-07  5:13     ` ツ Leandro Melo de Sales
2010-09-07  5:16       ` David Miller
2010-09-07  5:21         ` ツ Leandro Melo de Sales
2010-09-07  5:30           ` David Miller
2010-09-07  6:02             ` ツ Leandro Melo de Sales
2010-09-07  6:09               ` Eric Dumazet
2010-09-07  7:16                 ` ツ Leandro Melo de Sales
2010-09-07  7:32                   ` Eric Dumazet
2010-09-08 14:01                     ` ツ Leandro Melo de Sales
2010-09-07 11:39             ` Eric Dumazet
2010-09-07 12:15               ` Ilpo Järvinen
2010-09-08  3:18                 ` David Miller
2010-09-08 12:14                   ` Alexey Kuznetsov
2010-09-13  1:23                     ` David Miller
2010-09-14  9:37                       ` Alexey Kuznetsov
2010-09-15 17:29                         ` David Miller
2010-09-07 16:35               ` David Miller
2010-09-07 16:57               ` Rick Jones
2010-09-08 14:06               ` ツ Leandro Melo de Sales
2010-09-08 15:24                 ` David Miller
2010-09-08 16:03                   ` ツ Leandro Melo de Sales

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.