Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
       [not found] ` <1263767939.8876.94.camel@localhost>
@ 2010-01-18 14:43   ` Anders Boström
  2010-01-20  6:03     ` Jie Yang
  0 siblings, 1 reply; 32+ messages in thread
From: Anders Boström @ 2010-01-18 14:43 UTC (permalink / raw)
  To: ben, netdev, 565404; +Cc: xiong.huang, jie.yang

[-- Attachment #1: Type: Text/Plain, Size: 1228 bytes --]

>>>>> "BH" == Ben Hutchings <ben@decadent.org.uk> writes:

 BH> On Fri, 2010-01-15 at 14:25 +0100, Anders Boström wrote:
 >> When I run NFS over TCP (default options) and read large files from a
 >> server with Atheros AR8121/AR8113/AR8114 Ethernet chip, I only get

 BH> Do you know which specific chip it is?

No, the only information I have is what I get from lspci:

03:00.0 Ethernet controller: Attansic Technology Corp. Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (rev b0)
        Subsystem: ASUSTeK Computer Inc. Device 831c

It is an ASUS M4A78 PRO motherboard with the Atheros
AR8121/AR8113/AR8114 on-board.

 >> ~25Mbyte/s performance. I get ~5000 retransmitted packets per GByte
 >> data, according to RetransSegs in /proc/net/snmp . wireshark in the
 >> client show that the server send out a sequence of frames. All but the
 >> last one are 1500 bytes IP-packets. The last one is shorter, but the
 >> IP-header still say 1500 byte. The client then requests retransmit,
 >> and the retransmitted frame arrives with correct IP-header.

 BH> Please can you send a longer packet capture in pcap format?

Yes, it is attached. Packet 26 in the capture is the offending one.

/ Anders

[-- Attachment #2: bad_IP_head_packet_sequence.pcap --]
[-- Type: Application/Octet-Stream, Size: 38418 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-18 14:43   ` Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken Anders Boström
@ 2010-01-20  6:03     ` Jie Yang
  2010-01-20  9:27       ` Anders Boström
  0 siblings, 1 reply; 32+ messages in thread
From: Jie Yang @ 2010-01-20  6:03 UTC (permalink / raw)
  To: Anders Boström, ben, netdev; +Cc: Xiong Huang

Anders Boström <anders@netinsight.net> wrote:

> It is an ASUS M4A78 PRO motherboard with the Atheros
> AR8121/AR8113/AR8114 on-board.
>
>  >> ~25Mbyte/s performance. I get ~5000 retransmitted packets
> per GByte  >> data, according to RetransSegs in
> /proc/net/snmp . wireshark in the  >> client show that the
> server send out a sequence of frames. All but the  >> last
> one are 1500 bytes IP-packets. The last one is shorter, but
> the  >> IP-header still say 1500 byte. The client then
> requests retransmit,  >> and the retransmitted frame arrives
> with correct IP-header.

i just test it on Linux localhost.localdomain 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux.
with hardware, Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (rev b0)
device id : 1969:1026 (rev b0)

i upload/download a 382M it work well with retransmit packet:

Tcp: RtoAlgorithm RtoMin RtoMax MaxConn ActiveOpens PassiveOpens AttemptFails EstabResets CurrEstab InSegs OutSegs RetransSegs InErrs OutRsts
Tcp: 1 200 120000 -1 2 4 2 0 2 532501 220631 6 0 2

I also test it on kernel 2.6.33-rc1 sync from git. but it fail to boot kernel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-20  6:03     ` Jie Yang
@ 2010-01-20  9:27       ` Anders Boström
  2010-01-21  5:37         ` Jie Yang
  0 siblings, 1 reply; 32+ messages in thread
From: Anders Boström @ 2010-01-20  9:27 UTC (permalink / raw)
  To: Jie.Yang; +Cc: ben, netdev, 565404, Xiong.Huang

>>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:

 JY> Anders Boström <anders@netinsight.net> wrote:
 >> It is an ASUS M4A78 PRO motherboard with the Atheros
 >> AR8121/AR8113/AR8114 on-board.
 >> 
 >> >> ~25Mbyte/s performance. I get ~5000 retransmitted packets
 >> per GByte  >> data, according to RetransSegs in
 >> /proc/net/snmp . wireshark in the  >> client show that the
 >> server send out a sequence of frames. All but the  >> last
 >> one are 1500 bytes IP-packets. The last one is shorter, but
 >> the  >> IP-header still say 1500 byte. The client then
 >> requests retransmit,  >> and the retransmitted frame arrives
 >> with correct IP-header.

 JY> i just test it on Linux localhost.localdomain 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009 x86_64 x86_64 x86_64 GNU/Linux.
 JY> with hardware, Atheros AR8121/AR8113/AR8114 PCI-E Ethernet Controller (rev b0)
 JY> device id : 1969:1026 (rev b0)

 JY> i upload/download a 382M it work well with retransmit packet:

Have you tested NFS over TCP? The block-size the application uses can
have an effect on this. What application did you use? Block-size?

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-20  9:27       ` Anders Boström
@ 2010-01-21  5:37         ` Jie Yang
  2010-01-21 16:42           ` Anders Boström
  0 siblings, 1 reply; 32+ messages in thread
From: Jie Yang @ 2010-01-21  5:37 UTC (permalink / raw)
  To: Anders Boström; +Cc: ben, netdev, 565404, Xiong Huang

 Anders Boström <anders@netinsight.net>
> Sent: Wednesday, January 20, 2010 5:27 PM
> To: Jie Yang
> Cc: ben@decadent.org.uk; netdev@vger.kernel.org;
> 565404@bugs.debian.org; Xiong Huang
> Subject: Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e:
> TSO is broken
>
> >>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:
>
>  JY> Anders Boström <anders@netinsight.net> wrote:
>  >> It is an ASUS M4A78 PRO motherboard with the Atheros  >>
> AR8121/AR8113/AR8114 on-board.
>  >>
>  >> >> ~25Mbyte/s performance. I get ~5000 retransmitted
> packets  >> per GByte  >> data, according to RetransSegs in
> >> /proc/net/snmp . wireshark in the  >> client show that the
>  >> server send out a sequence of frames. All but the  >>
> last  >> one are 1500 bytes IP-packets. The last one is
> shorter, but  >> the  >> IP-header still say 1500 byte. The
> client then  >> requests retransmit,  >> and the
> retransmitted frame arrives  >> with correct IP-header.
>
>  JY> i just test it on Linux localhost.localdomain
> 2.6.31.5-127.fc12.x86_64 #1 SMP Sat Nov 7 21:11:14 EST 2009
> x86_64 x86_64 x86_64 GNU/Linux.
>  JY> with hardware, Atheros AR8121/AR8113/AR8114 PCI-E
> Ethernet Controller (rev b0)  JY> device id : 1969:1026 (rev b0)
>
>  JY> i upload/download a 382M it work well with retransmit packet:
>
> Have you tested NFS over TCP? The block-size the application
> uses can have an effect on this. What application did you
> use? Block-size?
>
yes, I tested NFS over TCP.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-21  5:37         ` Jie Yang
@ 2010-01-21 16:42           ` Anders Boström
  2010-01-23 15:29             ` Ben Hutchings
  2010-01-25  5:41             ` Jie Yang
  0 siblings, 2 replies; 32+ messages in thread
From: Anders Boström @ 2010-01-21 16:42 UTC (permalink / raw)
  To: Jie.Yang; +Cc: ben, netdev, 565404, Xiong.Huang

>>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:

 >> Have you tested NFS over TCP? The block-size the application
 >> uses can have an effect on this. What application did you
 >> use? Block-size?
 >> 
 JY> yes, I tested NFS over TCP.

One strange observation is that I can only reproduce this problem when
transmitting data from a NFS-server using TCP with Atheros
AR8121/AR8113/AR8114.

I've tried to reproduce the problem using test-programs, like nttcp
and netpipe, without any success. One observation is that the
test-programs *only* generates 1500 bytes IP-packets. When
the NFS-server sends data, a sequence of 1500 bytes IP-packets are
generated, ending with a shorter packet. And this last packet in the
sequence has 1500 in the IP-header length field, but is shorter.

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-21 16:42           ` Anders Boström
@ 2010-01-23 15:29             ` Ben Hutchings
  2010-01-24  1:36               ` Herbert Xu
  2010-01-25  5:41             ` Jie Yang
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Hutchings @ 2010-01-23 15:29 UTC (permalink / raw)
  To: Anders Boström; +Cc: Jie.Yang, netdev, 565404, Xiong.Huang

[-- Attachment #1: Type: text/plain, Size: 3021 bytes --]

On Thu, 2010-01-21 at 17:42 +0100, Anders Boström wrote:
> >>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:
> 
>  >> Have you tested NFS over TCP? The block-size the application
>  >> uses can have an effect on this. What application did you
>  >> use? Block-size?
>  >> 
>  JY> yes, I tested NFS over TCP.
> 
> One strange observation is that I can only reproduce this problem when
> transmitting data from a NFS-server using TCP with Atheros
> AR8121/AR8113/AR8114.
> 
> I've tried to reproduce the problem using test-programs, like nttcp
> and netpipe, without any success. One observation is that the
> test-programs *only* generates 1500 bytes IP-packets. When
> the NFS-server sends data, a sequence of 1500 bytes IP-packets are
> generated, ending with a shorter packet. And this last packet in the
> sequence has 1500 in the IP-header length field, but is shorter.

I ran tcpdump over your packet capture and saw:

13:48:39.122723 00:26:18:ae:69:6d > 00:18:f3:52:22:3f, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 32664, offset 0, flags [DF], proto TCP (6), length 1500)
    10.100.0.88.2049 > 10.100.1.25.888: Flags [.], cksum 0x3ebd (correct), seq 21720:23168, ack 157, win 501, options [nop,nop,TS val 152460082 ecr 1212787170], length 1448
13:48:39.122733 00:18:f3:52:22:3f > 00:26:18:ae:69:6d, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 39773, offset 0, flags [DF], proto TCP (6), length 52)
    10.100.1.25.888 > 10.100.0.88.2049: Flags [.], cksum 0x5cfc (correct), ack 23168, win 58293, options [nop,nop,TS val 1212787170 ecr 152460082], length 0
13:48:39.122742 00:26:18:ae:69:6d > 00:18:f3:52:22:3f, ethertype IPv4 (0x0800), length 1462: truncated-ip - 52 bytes missing! (tos 0x0, ttl 64, id 32664, offset 0, flags [DF], proto TCP (6), length 1500)
    10.100.0.88.2049 > 10.100.1.25.888: Flags [.], seq 23168:24616, ack 157, win 501, options [nop,nop,TS val 152460082 ecr 1212787170], length 1448
13:48:39.122747 00:26:18:ae:69:6d > 00:18:f3:52:22:3f, ethertype IPv4 (0x0800), length 1514: (tos 0x0, ttl 64, id 32666, offset 0, flags [DF], proto TCP (6), length 1500)
    10.100.0.88.2049 > 10.100.1.25.888: Flags [.], cksum 0x33a1 (correct), seq 24564:26012, ack 157, win 501, options [nop,nop,TS val 152460082 ecr 1212787170], length 1448

Based on the TCP sequence numbers, it seems that the length of the
broken packet is correct but its IP header is wrong.

My understanding is that the length of the TCP payload in a GSO skb must
always be a multiple of the gso_size, so that hardware is not required
to adjust length fields.  So I see several possible explanations:

1. Something generated invalid GSO skbs (unlikely; other hardware should
show the same problem)
2. The driver constructed TSO DMA descriptors for a non-GSO skb
3. The hardware is continuing to apply TSO to packets with non-TSO DMA
descriptors

Ben.

-- 
Ben Hutchings
Any smoothly functioning technology is indistinguishable from a rigged demo.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-23 15:29             ` Ben Hutchings
@ 2010-01-24  1:36               ` Herbert Xu
  0 siblings, 0 replies; 32+ messages in thread
From: Herbert Xu @ 2010-01-24  1:36 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: anders, Jie.Yang, netdev, 565404, Xiong.Huang

Ben Hutchings <ben@decadent.org.uk> wrote:
> 
> Based on the TCP sequence numbers, it seems that the length of the
> broken packet is correct but its IP header is wrong.
> 
> My understanding is that the length of the TCP payload in a GSO skb must
> always be a multiple of the gso_size, so that hardware is not required
> to adjust length fields.  So I see several possible explanations:

No, there is no such requirement.  The trailer skb can be of any
size less than or equal to gso_size.

However, if the hardware assumed this then yes it would explain
the problem.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-21 16:42           ` Anders Boström
  2010-01-23 15:29             ` Ben Hutchings
@ 2010-01-25  5:41             ` Jie Yang
  2010-01-25 15:36               ` Anders Boström
  1 sibling, 1 reply; 32+ messages in thread
From: Jie Yang @ 2010-01-25  5:41 UTC (permalink / raw)
  To: Anders Boström; +Cc: ben, netdev, 565404, Xiong Huang

Anders Boström <anders@netinsight.net> wrote:

> Cc: ben@decadent.org.uk; netdev@vger.kernel.org;
> 565404@bugs.debian.org; Xiong Huang
> Subject: Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e:
> TSO is broken

> One strange observation is that I can only reproduce this
> problem when transmitting data from a NFS-server using TCP
> with Atheros AR8121/AR8113/AR8114.
>
> I've tried to reproduce the problem using test-programs, like
> nttcp and netpipe, without any success. One observation is
> that the test-programs *only* generates 1500 bytes
> IP-packets. When the NFS-server sends data, a sequence of
> 1500 bytes IP-packets are generated, ending with a shorter
> packet. And this last packet in the sequence has 1500 in the
> IP-header length field, but is shorter.
>
following is my test cese,

a nfs server server with ar8131chip, device id 1063. export /tmp/ dir as the nfs share directory,
the client, mount the server_ip:/tmp to local dir /mnt/nfs, ust a python script to write and read data on the
/mnt/nfs/testnfs.log. it works fine.

Can you give me some advice on how to reproduce this bug??

Best wishes
jie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-25  5:41             ` Jie Yang
@ 2010-01-25 15:36               ` Anders Boström
  2010-01-26  2:04                 ` Jie Yang
  0 siblings, 1 reply; 32+ messages in thread
From: Anders Boström @ 2010-01-25 15:36 UTC (permalink / raw)
  To: Jie.Yang; +Cc: ben, netdev, 565404, Xiong.Huang

>>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:

 JY> Anders Boström <anders@netinsight.net> wrote:
 >> Cc: ben@decadent.org.uk; netdev@vger.kernel.org;
 >> 565404@bugs.debian.org; Xiong Huang
 >> Subject: Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e:
 >> TSO is broken

 >> One strange observation is that I can only reproduce this
 >> problem when transmitting data from a NFS-server using TCP
 >> with Atheros AR8121/AR8113/AR8114.
 >> 
 >> I've tried to reproduce the problem using test-programs, like
 >> nttcp and netpipe, without any success. One observation is
 >> that the test-programs *only* generates 1500 bytes
 >> IP-packets. When the NFS-server sends data, a sequence of
 >> 1500 bytes IP-packets are generated, ending with a shorter
 >> packet. And this last packet in the sequence has 1500 in the
 >> IP-header length field, but is shorter.
 >> 
 JY> following is my test cese,

 JY> a nfs server server with ar8131chip, device id 1063. export /tmp/ dir as the nfs share directory,
 JY> the client, mount the server_ip:/tmp to local dir /mnt/nfs, ust a python script to write and read data on the
 JY> /mnt/nfs/testnfs.log. it works fine.

OK, the device-ID in our NFS-server is 1026, rev. b0. So it is
possible that the problem is specific to that chip/version.

 JY> Can you give me some advice on how to reproduce this bug??

The only suggestion I have is to try to find a board with a 1026-chip
on it.

My test-case is just copy of a 1 Gbyte file from the
NFS-server to /dev/null , after making sure that the file isn't cached
on the client by reading huge amounts of other data.

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-25 15:36               ` Anders Boström
@ 2010-01-26  2:04                 ` Jie Yang
  2010-01-26  8:34                   ` Anders Boström
  0 siblings, 1 reply; 32+ messages in thread
From: Jie Yang @ 2010-01-26  2:04 UTC (permalink / raw)
  To: Anders Boström; +Cc: ben, netdev, 565404, Xiong Huang

Anders Boström <anders@netinsight.net> wrote:
>> IP-header length field, but is shorter.
>  >>
>  JY> following is my test cese,
>
>  JY> a nfs server server with ar8131chip, device id 1063.
> export /tmp/ dir as the nfs share directory,  JY> the client,
> mount the server_ip:/tmp to local dir /mnt/nfs, ust a python
> script to write and read data on the  JY>
> /mnt/nfs/testnfs.log. it works fine.
>
> OK, the device-ID in our NFS-server is 1026, rev. b0. So it
> is possible that the problem is specific to that chip/version.
oops, its my mistake in writing, my case is 1026 device ID

>
>  JY> Can you give me some advice on how to reproduce this bug??
>
> The only suggestion I have is to try to find a board with a
> 1026-chip on it.
>
> My test-case is just copy of a 1 Gbyte file from the
> NFS-server to /dev/null , after making sure that the file
> isn't cached on the client by reading huge amounts of other data.
>
just to check, if the kernel version is 2.6.26-2 ??

Best wishes
jie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-26  2:04                 ` Jie Yang
@ 2010-01-26  8:34                   ` Anders Boström
  2013-03-31  0:25                     ` Ben Hutchings
  0 siblings, 1 reply; 32+ messages in thread
From: Anders Boström @ 2010-01-26  8:34 UTC (permalink / raw)
  To: Jie.Yang; +Cc: ben, netdev, 565404, Xiong.Huang

>>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:

 JY> Anders Boström <anders@netinsight.net> wrote:

 JY> following is my test cese,
 >> 
 JY> a nfs server server with ar8131chip, device id 1063.
 >> export /tmp/ dir as the nfs share directory,  JY> the client,
 >> mount the server_ip:/tmp to local dir /mnt/nfs, ust a python
 >> script to write and read data on the  JY>
 >> /mnt/nfs/testnfs.log. it works fine.
 >> 
 >> OK, the device-ID in our NFS-server is 1026, rev. b0. So it
 >> is possible that the problem is specific to that chip/version.
 JY> oops, its my mistake in writing, my case is 1026 device ID

 >> 
 JY> Can you give me some advice on how to reproduce this bug??
 >> 
 >> The only suggestion I have is to try to find a board with a
 >> 1026-chip on it.
 >> 
 >> My test-case is just copy of a 1 Gbyte file from the
 >> NFS-server to /dev/null , after making sure that the file
 >> isn't cached on the client by reading huge amounts of other data.
 >> 
 JY> just to check, if the kernel version is 2.6.26-2 ??

I've tested with
Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2,
Debian linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
kernel.org 2.6.30.10 amd64 with ethtool patch for setting of tso. Same
result.

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2010-01-26  8:34                   ` Anders Boström
@ 2013-03-31  0:25                     ` Ben Hutchings
  2013-03-31  0:43                       ` Huang, Xiong
                                         ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Ben Hutchings @ 2013-03-31  0:25 UTC (permalink / raw)
  To: Anders Boström; +Cc: Jie.Yang, netdev, 565404, Xiong.Huang

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

On Tue, 2010-01-26 at 09:34 +0100, Anders Boström wrote:
> >>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:
> 
>  JY> Anders Boström <anders@netinsight.net> wrote:
> 
>  JY> following is my test cese,
>  >> 
>  JY> a nfs server server with ar8131chip, device id 1063.
>  >> export /tmp/ dir as the nfs share directory,  JY> the client,
>  >> mount the server_ip:/tmp to local dir /mnt/nfs, ust a python
>  >> script to write and read data on the  JY>
>  >> /mnt/nfs/testnfs.log. it works fine.
>  >> 
>  >> OK, the device-ID in our NFS-server is 1026, rev. b0. So it
>  >> is possible that the problem is specific to that chip/version.
>  JY> oops, its my mistake in writing, my case is 1026 device ID
> 
>  >> 
>  JY> Can you give me some advice on how to reproduce this bug??
>  >> 
>  >> The only suggestion I have is to try to find a board with a
>  >> 1026-chip on it.
>  >> 
>  >> My test-case is just copy of a 1 Gbyte file from the
>  >> NFS-server to /dev/null , after making sure that the file
>  >> isn't cached on the client by reading huge amounts of other data.
>  >> 
>  JY> just to check, if the kernel version is 2.6.26-2 ??
> 
> I've tested with
> Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2,
> Debian linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
> kernel.org 2.6.30.10 amd64 with ethtool patch for setting of tso. Same
> result.

Does booting with the kernel parameter 'pci=nomsi' avoid the problem?

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  0:25                     ` Ben Hutchings
@ 2013-03-31  0:43                       ` Huang, Xiong
  2013-03-31  1:18                       ` Huang, Xiong
                                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Huang, Xiong @ 2013-03-31  0:43 UTC (permalink / raw)
  To: Ben Hutchings, Anders Boström
  Cc: netdev, 565404, Hannes Frederic Sowa (hannes@stressinduktion.org)

> >
> > I've tested with
> > Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2, Debian
> > linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and kernel.org
> > 2.6.30.10 amd64 with ethtool patch for setting of tso. Same result.
> 
> Does booting with the kernel parameter 'pci=nomsi' avoid the problem?
> 

Hannes has found DMA-write (for rx-packet)  is abnormal due to msi function. 
But TSO is for rx-packet, an opposite direction. I'm not sure :(,
If someone has this issue,  he/she could have a try.

Thanks
Xiong

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  0:25                     ` Ben Hutchings
  2013-03-31  0:43                       ` Huang, Xiong
@ 2013-03-31  1:18                       ` Huang, Xiong
  2013-03-31  2:10                         ` Ben Hutchings
  2013-03-31 21:11                       ` Hannes Frederic Sowa
  2013-04-02  7:35                       ` Anders Boström
  3 siblings, 1 reply; 32+ messages in thread
From: Huang, Xiong @ 2013-03-31  1:18 UTC (permalink / raw)
  To: Ben Hutchings, Anders Boström
  Cc: netdev, 565404, Hannes Frederic Sowa (hannes@stressinduktion.org)

> > >
> > > I've tested with
> > > Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2, Debian
> > > linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
> > > kernel.org
> > > 2.6.30.10 amd64 with ethtool patch for setting of tso. Same result.
> >
> > Does booting with the kernel parameter 'pci=nomsi' avoid the problem?
> >
> 
> Hannes has found DMA-write (for rx-packet)  is abnormal due to msi function.
> But TSO is for rx-packet, an opposite direction. I'm not sure :(, If someone
> has this issue,  he/she could have a try.
> 

I checked windows driver, it does limit  the max packet length for TSO
windows XP : 32*1024 bytes (include MAC header and all MAC payload). No support IP/TCP option.
Windows 7:  15, 000 bytes, support IP/TCP option.

Thanks
Xiong


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  1:18                       ` Huang, Xiong
@ 2013-03-31  2:10                         ` Ben Hutchings
  2013-04-01  2:51                           ` Huang, Xiong
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Hutchings @ 2013-03-31  2:10 UTC (permalink / raw)
  To: Huang, Xiong
  Cc: Anders Boström, netdev, 565404,
	Hannes Frederic Sowa (hannes@stressinduktion.org)

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

On Sun, 2013-03-31 at 02:18 +0000, Huang, Xiong wrote:
> > > >
> > > > I've tested with
> > > > Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2, Debian
> > > > linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
> > > > kernel.org
> > > > 2.6.30.10 amd64 with ethtool patch for setting of tso. Same result.
> > >
> > > Does booting with the kernel parameter 'pci=nomsi' avoid the problem?
> > >
> > 
> > Hannes has found DMA-write (for rx-packet)  is abnormal due to msi function.
> > But TSO is for rx-packet, an opposite direction. I'm not sure :(, If someone
> > has this issue,  he/she could have a try.
> > 
> 
> I checked windows driver, it does limit  the max packet length for TSO
> windows XP : 32*1024 bytes (include MAC header and all MAC payload). No support IP/TCP option.
> Windows 7:  15, 000 bytes, support IP/TCP option.

If TSO on these devices don't work properly with TCP options then you're
just going to have to disable it - Linux requires it to support at least
the timestamp option.  I'm not sure about IP options (this really ought
to be documented).

If there's a length limit lower than 64K, you'll need to set the limit
using netif_set_gso_max_size() before registering the net device.

Ben.

-- 
Ben Hutchings
Teamwork is essential - it allows you to blame someone else.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  0:25                     ` Ben Hutchings
  2013-03-31  0:43                       ` Huang, Xiong
  2013-03-31  1:18                       ` Huang, Xiong
@ 2013-03-31 21:11                       ` Hannes Frederic Sowa
  2013-04-02  7:35                       ` Anders Boström
  3 siblings, 0 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-31 21:11 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Anders Boström, Jie.Yang, netdev, 565404, Xiong.Huang

On Sun, Mar 31, 2013 at 12:25:58AM +0000, Ben Hutchings wrote:
> On Tue, 2010-01-26 at 09:34 +0100, Anders Boström wrote:
> > >>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:
> > 
> >  JY> Anders Boström <anders@netinsight.net> wrote:
> > 
> >  JY> following is my test cese,
> >  >> 
> >  JY> a nfs server server with ar8131chip, device id 1063.
> >  >> export /tmp/ dir as the nfs share directory,  JY> the client,
> >  >> mount the server_ip:/tmp to local dir /mnt/nfs, ust a python
> >  >> script to write and read data on the  JY>
> >  >> /mnt/nfs/testnfs.log. it works fine.
> >  >> 
> >  >> OK, the device-ID in our NFS-server is 1026, rev. b0. So it
> >  >> is possible that the problem is specific to that chip/version.
> >  JY> oops, its my mistake in writing, my case is 1026 device ID
> > 
> >  >> 
> >  JY> Can you give me some advice on how to reproduce this bug??
> >  >> 
> >  >> The only suggestion I have is to try to find a board with a
> >  >> 1026-chip on it.
> >  >> 
> >  >> My test-case is just copy of a 1 Gbyte file from the
> >  >> NFS-server to /dev/null , after making sure that the file
> >  >> isn't cached on the client by reading huge amounts of other data.
> >  >> 
> >  JY> just to check, if the kernel version is 2.6.26-2 ??
> > 
> > I've tested with
> > Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2,
> > Debian linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
> > kernel.org 2.6.30.10 amd64 with ethtool patch for setting of tso. Same
> > result.
> 
> Does booting with the kernel parameter 'pci=nomsi' avoid the problem?

Thanks Ben for bringing this up.

I'll have a look if I can reproduce it in the next days and if I'll try to
find a workaround.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  2:10                         ` Ben Hutchings
@ 2013-04-01  2:51                           ` Huang, Xiong
  2013-04-02 21:15                             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Xiong @ 2013-04-01  2:51 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Anders Boström, netdev, 565404,
	Hannes Frederic Sowa (hannes@stressinduktion.org)

> >
> > I checked windows driver, it does limit  the max packet length for TSO
> > windows XP : 32*1024 bytes (include MAC header and all MAC payload). No
> support IP/TCP option.
> > Windows 7:  15, 000 bytes, support IP/TCP option.
> 
> If TSO on these devices don't work properly with TCP options then you're
> just going to have to disable it - Linux requires it to support at least the
> timestamp option.  I'm not sure about IP options (this really ought to be
> documented).
> 
> If there's a length limit lower than 64K, you'll need to set the limit using
> netif_set_gso_max_size() before registering the net device.
> 

Ben, thanks for your advice. 
I have discussed with windows driver developer and hardware designer, the TSO limitation for win driver is just
For simplifying windows driver due to the buffer length limitation of TX descriptor. The hardware itself has no limitation on
TSO packet length.

BTW. Ip/tcp option is supported as well.

Thanks
Xiong

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-03-31  0:25                     ` Ben Hutchings
                                         ` (2 preceding siblings ...)
  2013-03-31 21:11                       ` Hannes Frederic Sowa
@ 2013-04-02  7:35                       ` Anders Boström
  2013-04-02  9:41                         ` Hannes Frederic Sowa
  3 siblings, 1 reply; 32+ messages in thread
From: Anders Boström @ 2013-04-02  7:35 UTC (permalink / raw)
  To: ben; +Cc: Jie.Yang, netdev, 565404, Xiong.Huang

>>>>> "BH" == Ben Hutchings <ben@decadent.org.uk> writes:

 BH> On Tue, 2010-01-26 at 09:34 +0100, Anders Boström wrote:
 >> >>>>> "JY" == Jie Yang <Jie.Yang@Atheros.com> writes:
 >> 
 JY> Anders Boström <anders@netinsight.net> wrote:
 >> 
 JY> following is my test cese,
 >> >> 
 JY> a nfs server server with ar8131chip, device id 1063.
 >> >> export /tmp/ dir as the nfs share directory,  JY> the client,
 >> >> mount the server_ip:/tmp to local dir /mnt/nfs, ust a python
 >> >> script to write and read data on the  JY>
 >> >> /mnt/nfs/testnfs.log. it works fine.
 >> >> 
 >> >> OK, the device-ID in our NFS-server is 1026, rev. b0. So it
 >> >> is possible that the problem is specific to that chip/version.
 JY> oops, its my mistake in writing, my case is 1026 device ID
 >> 
 >> >> 
 JY> Can you give me some advice on how to reproduce this bug??
 >> >> 
 >> >> The only suggestion I have is to try to find a board with a
 >> >> 1026-chip on it.
 >> >> 
 >> >> My test-case is just copy of a 1 Gbyte file from the
 >> >> NFS-server to /dev/null , after making sure that the file
 >> >> isn't cached on the client by reading huge amounts of other data.
 >> >> 
 JY> just to check, if the kernel version is 2.6.26-2 ??
 >> 
 >> I've tested with
 >> Debian linux-image-2.6.26-2-amd64 version 2.6.26-19lenny2,
 >> Debian linux-image-2.6.30-bpo.2-amd64 version 2.6.30-8~bpo50+2 and
 >> kernel.org 2.6.30.10 amd64 with ethtool patch for setting of tso. Same
 >> result.

 BH> Does booting with the kernel parameter 'pci=nomsi' avoid the problem?

I'm sorry, but I can't test this at the moment. The computer with the
TSO-problem is running as a file-server => can't be used for testing.
Also, we don't use the Atheros Ethernet interface any more due to
other problems, hard hang (need reset) of the Eth-interface
every ~6 month's.

However, the computer is scheduled to be replaced as file-server quite
soon, so I might be able to test this again after the replacement.

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02  7:35                       ` Anders Boström
@ 2013-04-02  9:41                         ` Hannes Frederic Sowa
  2013-04-02 12:22                           ` Anders Boström
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-02  9:41 UTC (permalink / raw)
  To: Anders Boström; +Cc: ben, Jie.Yang, netdev, 565404, Xiong.Huang

On Tue, Apr 02, 2013 at 09:35:04AM +0200, Anders Boström wrote:
> I'm sorry, but I can't test this at the moment. The computer with the
> TSO-problem is running as a file-server => can't be used for testing.
> Also, we don't use the Atheros Ethernet interface any more due to
> other problems, hard hang (need reset) of the Eth-interface
> every ~6 month's.

The bug is definitely still around. Yesterday I could reproduce it and will
look for a solution in the next days.

Do you have any details on the hangs every 6 months? Could you catch
thread dumps or oopses?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02  9:41                         ` Hannes Frederic Sowa
@ 2013-04-02 12:22                           ` Anders Boström
  0 siblings, 0 replies; 32+ messages in thread
From: Anders Boström @ 2013-04-02 12:22 UTC (permalink / raw)
  To: hannes; +Cc: ben, Jie.Yang, netdev, 565404, Xiong.Huang

>>>>> "HFS" == Hannes Frederic Sowa <hannes@stressinduktion.org> writes:

 HFS> The bug is definitely still around. Yesterday I could reproduce it and will
 HFS> look for a solution in the next days.

This sounds great!

 HFS> Do you have any details on the hangs every 6 months? Could you catch
 HFS> thread dumps or oopses?

As I wrote, the computer is a live file-server, so we have restarted
the computer as soon as possible when this has occured, and currently
use an Intel NIC instead.

The following was logged when the hang occured:

May 19 12:50:32 flash kernel: [12182478.782248] ATL1E 0000:03:00.0: atl1e_clean is called when AT_DOWN
...
Dec  8 15:00:28 flash kernel: [5282450.781172] ATL1E 0000:03:00.0: atl1e_clean is called when AT_DOWN

/ Anders

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-01  2:51                           ` Huang, Xiong
@ 2013-04-02 21:15                             ` Hannes Frederic Sowa
  2013-04-02 21:51                               ` Huang, Xiong
  2013-04-02 22:00                               ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-02 21:15 UTC (permalink / raw)
  To: Huang, Xiong; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

On Mon, Apr 01, 2013 at 02:51:56AM +0000, Huang, Xiong wrote:
> > >
> > > I checked windows driver, it does limit  the max packet length for TSO
> > > windows XP : 32*1024 bytes (include MAC header and all MAC payload). No
> > support IP/TCP option.
> > > Windows 7:  15, 000 bytes, support IP/TCP option.
> > 
> > If TSO on these devices don't work properly with TCP options then you're
> > just going to have to disable it - Linux requires it to support at least the
> > timestamp option.  I'm not sure about IP options (this really ought to be
> > documented).
> > 
> > If there's a length limit lower than 64K, you'll need to set the limit using
> > netif_set_gso_max_size() before registering the net device.
> > 
> 
> Ben, thanks for your advice. 
> I have discussed with windows driver developer and hardware designer, the TSO limitation for win driver is just
> For simplifying windows driver due to the buffer length limitation of TX descriptor. The hardware itself has no limitation on
> TSO packet length.

The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN
in the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I
can even raise it to 0x3000 and don't see any tcp retransmits. Do you
have an advice on how to size this value (e.g. should we switch to the
windows values)?

I also found some irregularities in the mtu update code. It differs from the
calculations in the init function (I will send a patch for that).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 21:15                             ` Hannes Frederic Sowa
@ 2013-04-02 21:51                               ` Huang, Xiong
  2013-04-02 22:19                                 ` Hannes Frederic Sowa
  2013-04-02 22:00                               ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Huang, Xiong @ 2013-04-02 21:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

> The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN in
> the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I can even
> raise it to 0x3000 and don't see any tcp retransmits. Do you have an advice on
> how to size this value (e.g. should we switch to the windows values)?
> 

Would you try 0x4000 ? because the buffer-length in TX descriptor is 14bits, 0x4000 exceeds max value.
Do you find any bug/issue on the code that calculate the length for each TX descriptor ?

Thanks
Xiong

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 21:15                             ` Hannes Frederic Sowa
  2013-04-02 21:51                               ` Huang, Xiong
@ 2013-04-02 22:00                               ` Eric Dumazet
  2013-04-02 22:15                                 ` Hannes Frederic Sowa
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-04-02 22:00 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström, netdev, 565404

On Tue, 2013-04-02 at 23:15 +0200, Hannes Frederic Sowa wrote:

> The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN
> in the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I
> can even raise it to 0x3000 and don't see any tcp retransmits. Do you
> have an advice on how to size this value (e.g. should we switch to the
> windows values)?

This looks like an overflow error...

diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
index 7e0a822..7965f89 100644
--- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
+++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
@@ -1569,18 +1569,17 @@ static u16 atl1e_cal_tdp_req(const struct sk_buff *skb)
 {
 	int i = 0;
 	u16 tpd_req = 1;
-	u16 fg_size = 0;
-	u16 proto_hdr_len = 0;
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+		u32 fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+
 		tpd_req += ((fg_size + MAX_TX_BUF_LEN - 1) >> MAX_TX_BUF_SHIFT);
 	}
 
 	if (skb_is_gso(skb)) {
 		if (skb->protocol == htons(ETH_P_IP) ||
 		   (skb_shinfo(skb)->gso_type == SKB_GSO_TCPV6)) {
-			proto_hdr_len = skb_transport_offset(skb) +
+			u32 proto_hdr_len = skb_transport_offset(skb) +
 					tcp_hdrlen(skb);
 			if (proto_hdr_len < skb_headlen(skb)) {
 				tpd_req += ((skb_headlen(skb) - proto_hdr_len +

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:00                               ` Eric Dumazet
@ 2013-04-02 22:15                                 ` Hannes Frederic Sowa
  2013-04-02 22:34                                   ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-02 22:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström, netdev, 565404

On Tue, Apr 02, 2013 at 03:00:38PM -0700, Eric Dumazet wrote:
> On Tue, 2013-04-02 at 23:15 +0200, Hannes Frederic Sowa wrote:
> 
> > The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN
> > in the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I
> > can even raise it to 0x3000 and don't see any tcp retransmits. Do you
> > have an advice on how to size this value (e.g. should we switch to the
> > windows values)?
> 
> This looks like an overflow error...

Thanks for your input, Eric.

I am limited in my time to work on this today but nontheless just tested
your patch without any of my changes and count a lot of TcpRetransSegs
again. Either there is really some hardware limitation or another
overflow.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 21:51                               ` Huang, Xiong
@ 2013-04-02 22:19                                 ` Hannes Frederic Sowa
  2013-04-02 22:23                                   ` Huang, Xiong
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-02 22:19 UTC (permalink / raw)
  To: Huang, Xiong; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

On Tue, Apr 02, 2013 at 09:51:12PM +0000, Huang, Xiong wrote:
> > The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN in
> > the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I can even
> > raise it to 0x3000 and don't see any tcp retransmits. Do you have an advice on
> > how to size this value (e.g. should we switch to the windows values)?
> > 
> 
> Would you try 0x4000 ? because the buffer-length in TX descriptor is 14bits, 0x4000 exceeds max value.
> Do you find any bug/issue on the code that calculate the length for each TX descriptor ?

Setting MAX_TX_BUF_LEN to 0x4000

[ 8949.833750] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full Duplex>
[ 8949.833783] IPv6: ADDRCONF(NETDEV_CHANGE): p33p1: link becomes ready
[ 8960.861557] ATL1E 0000:04:00.0 p33p1: PCIE DMA RW error (status = 0x5000400)
[ 8960.866879] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full Duplex>
[ 8961.095266] ATL1E 0000:04:00.0 p33p1: PCIE DMA RW error (status = 0x5000400)
[ 8961.100791] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full Duplex>

I have not looked at the buffer calculations intensly.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:19                                 ` Hannes Frederic Sowa
@ 2013-04-02 22:23                                   ` Huang, Xiong
  2013-04-03  0:00                                     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Xiong @ 2013-04-02 22:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Ben Hutchings, Anders Boström, netdev, 565404


> 
> On Tue, Apr 02, 2013 at 09:51:12PM +0000, Huang, Xiong wrote:
> > > The error vanishes as soon as I put a gso size limit of
> > > MAX_TX_BUF_LEN in the driver. MAX_TX_BUF_LEN seems to be
> arbitrary
> > > set to 0x2000. I can even raise it to 0x3000 and don't see any tcp
> > > retransmits. Do you have an advice on how to size this value (e.g. should
> we switch to the windows values)?
> > >
> >
> > Would you try 0x4000 ? because the buffer-length in TX descriptor is 14bits,
> 0x4000 exceeds max value.
> > Do you find any bug/issue on the code that calculate the length for each TX
> descriptor ?
> 
> Setting MAX_TX_BUF_LEN to 0x4000
> 
> [ 8949.833750] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full
> Duplex> [ 8949.833783] IPv6: ADDRCONF(NETDEV_CHANGE): p33p1: link
> becomes ready [ 8960.861557] ATL1E 0000:04:00.0 p33p1: PCIE DMA RW error
> (status = 0x5000400) [ 8960.866879] ATL1E 0000:04:00.0 p33p1: NIC Link is Up
> <100 Mbps Full Duplex> [ 8961.095266] ATL1E 0000:04:00.0 p33p1: PCIE DMA
> RW error (status = 0x5000400) [ 8961.100791] ATL1E 0000:04:00.0 p33p1: NIC
> Link is Up <100 Mbps Full Duplex>
> 
Hannes,  Thanks for your testing !

 simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX configuration...
I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....

Thanks
Xiong


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:15                                 ` Hannes Frederic Sowa
@ 2013-04-02 22:34                                   ` Eric Dumazet
  2013-04-02 23:24                                     ` Hannes Frederic Sowa
  2013-04-03  0:38                                     ` Hannes Frederic Sowa
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-04-02 22:34 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström, netdev, 565404

On Wed, 2013-04-03 at 00:15 +0200, Hannes Frederic Sowa wrote:
> On Tue, Apr 02, 2013 at 03:00:38PM -0700, Eric Dumazet wrote:
> > On Tue, 2013-04-02 at 23:15 +0200, Hannes Frederic Sowa wrote:
> > 
> > > The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN
> > > in the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I
> > > can even raise it to 0x3000 and don't see any tcp retransmits. Do you
> > > have an advice on how to size this value (e.g. should we switch to the
> > > windows values)?
> > 
> > This looks like an overflow error...
> 
> Thanks for your input, Eric.
> 
> I am limited in my time to work on this today but nontheless just tested
> your patch without any of my changes and count a lot of TcpRetransSegs
> again. Either there is really some hardware limitation or another
> overflow.

Another overflow...

Really I don't understand why people use u16 instead of u32.

u16 is slower most of the time, and more prone to overflows.

diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
index 7e0a822..48ac487 100644
--- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
+++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
@@ -1569,18 +1569,17 @@ static u16 atl1e_cal_tdp_req(const struct sk_buff *skb)
 {
 	int i = 0;
 	u16 tpd_req = 1;
-	u16 fg_size = 0;
-	u16 proto_hdr_len = 0;
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
-		fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+		u32 fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
+
 		tpd_req += ((fg_size + MAX_TX_BUF_LEN - 1) >> MAX_TX_BUF_SHIFT);
 	}
 
 	if (skb_is_gso(skb)) {
 		if (skb->protocol == htons(ETH_P_IP) ||
 		   (skb_shinfo(skb)->gso_type == SKB_GSO_TCPV6)) {
-			proto_hdr_len = skb_transport_offset(skb) +
+			u32 proto_hdr_len = skb_transport_offset(skb) +
 					tcp_hdrlen(skb);
 			if (proto_hdr_len < skb_headlen(skb)) {
 				tpd_req += ((skb_headlen(skb) - proto_hdr_len +
@@ -1670,7 +1669,7 @@ static void atl1e_tx_map(struct atl1e_adapter *adapter,
 {
 	struct atl1e_tpd_desc *use_tpd = NULL;
 	struct atl1e_tx_buffer *tx_buffer = NULL;
-	u16 buf_len = skb_headlen(skb);
+	u32 buf_len = skb_headlen(skb);
 	u16 map_len = 0;
 	u16 mapped_len = 0;
 	u16 hdr_len = 0;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:34                                   ` Eric Dumazet
@ 2013-04-02 23:24                                     ` Hannes Frederic Sowa
  2013-04-03  0:38                                     ` Hannes Frederic Sowa
  1 sibling, 0 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-02 23:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström, netdev, 565404

On Tue, Apr 02, 2013 at 03:34:53PM -0700, Eric Dumazet wrote:
> On Wed, 2013-04-03 at 00:15 +0200, Hannes Frederic Sowa wrote:
> > On Tue, Apr 02, 2013 at 03:00:38PM -0700, Eric Dumazet wrote:
> > > On Tue, 2013-04-02 at 23:15 +0200, Hannes Frederic Sowa wrote:
> > > 
> > > > The error vanishes as soon as I put a gso size limit of MAX_TX_BUF_LEN
> > > > in the driver. MAX_TX_BUF_LEN seems to be arbitrary set to 0x2000. I
> > > > can even raise it to 0x3000 and don't see any tcp retransmits. Do you
> > > > have an advice on how to size this value (e.g. should we switch to the
> > > > windows values)?
> > > 
> > > This looks like an overflow error...
> > 
> > Thanks for your input, Eric.
> > 
> > I am limited in my time to work on this today but nontheless just tested
> > your patch without any of my changes and count a lot of TcpRetransSegs
> > again. Either there is really some hardware limitation or another
> > overflow.
> 
> Another overflow...
> 
> Really I don't understand why people use u16 instead of u32.
> 
> u16 is slower most of the time, and more prone to overflows.

Just gave your patch a test and I still have a fast increasing tcp
retransmitted segments counter.

Maximum skb length hitting the device is 23234 in my tests (as reported
by ftrace). So I actually think it is a device limitation.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:23                                   ` Huang, Xiong
@ 2013-04-03  0:00                                     ` Hannes Frederic Sowa
  2013-04-03  0:12                                       ` Huang, Xiong
  0 siblings, 1 reply; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-03  0:00 UTC (permalink / raw)
  To: Huang, Xiong; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

On Tue, Apr 02, 2013 at 10:23:54PM +0000, Huang, Xiong wrote:
> 
> > 
> > On Tue, Apr 02, 2013 at 09:51:12PM +0000, Huang, Xiong wrote:
> > > > The error vanishes as soon as I put a gso size limit of
> > > > MAX_TX_BUF_LEN in the driver. MAX_TX_BUF_LEN seems to be
> > arbitrary
> > > > set to 0x2000. I can even raise it to 0x3000 and don't see any tcp
> > > > retransmits. Do you have an advice on how to size this value (e.g. should
> > we switch to the windows values)?
> > > >
> > >
> > > Would you try 0x4000 ? because the buffer-length in TX descriptor is 14bits,
> > 0x4000 exceeds max value.
> > > Do you find any bug/issue on the code that calculate the length for each TX
> > descriptor ?
> > 
> > Setting MAX_TX_BUF_LEN to 0x4000
> > 
> > [ 8949.833750] ATL1E 0000:04:00.0 p33p1: NIC Link is Up <100 Mbps Full
> > Duplex> [ 8949.833783] IPv6: ADDRCONF(NETDEV_CHANGE): p33p1: link
> > becomes ready [ 8960.861557] ATL1E 0000:04:00.0 p33p1: PCIE DMA RW error
> > (status = 0x5000400) [ 8960.866879] ATL1E 0000:04:00.0 p33p1: NIC Link is Up
> > <100 Mbps Full Duplex> [ 8961.095266] ATL1E 0000:04:00.0 p33p1: PCIE DMA
> > RW error (status = 0x5000400) [ 8961.100791] ATL1E 0000:04:00.0 p33p1: NIC
> > Link is Up <100 Mbps Full Duplex>
> > 
> Hannes,  Thanks for your testing !
> 
>  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX configuration...
> I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....

I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
If I understand you correctly 0x4000 is a safe limit?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-03  0:00                                     ` Hannes Frederic Sowa
@ 2013-04-03  0:12                                       ` Huang, Xiong
  2013-04-03  0:43                                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Xiong @ 2013-04-03  0:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

> > Hannes,  Thanks for your testing !
> >
> >  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX
> configuration...
> > I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....
> 
> I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
> If I understand you correctly 0x4000 is a safe limit?

Since Win7 driver uses 15000 bytes as its max packet length for TSO, I think 0x3C00 is more safer than 0x4000. :)

Thanks
Xiong

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-02 22:34                                   ` Eric Dumazet
  2013-04-02 23:24                                     ` Hannes Frederic Sowa
@ 2013-04-03  0:38                                     ` Hannes Frederic Sowa
  1 sibling, 0 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-03  0:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Huang, Xiong, Ben Hutchings, Anders Boström, netdev, 565404

On Tue, Apr 02, 2013 at 03:34:53PM -0700, Eric Dumazet wrote:
> Really I don't understand why people use u16 instead of u32.
> 
> u16 is slower most of the time, and more prone to overflows.
> 
> diff --git a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> index 7e0a822..48ac487 100644
> --- a/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> +++ b/drivers/net/ethernet/atheros/atl1e/atl1e_main.c
> @@ -1569,18 +1569,17 @@ static u16 atl1e_cal_tdp_req(const struct sk_buff *skb)
>  {
>  	int i = 0;
>  	u16 tpd_req = 1;
> -	u16 fg_size = 0;
> -	u16 proto_hdr_len = 0;
>  
>  	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
> -		fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
> +		u32 fg_size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
> +
>  		tpd_req += ((fg_size + MAX_TX_BUF_LEN - 1) >> MAX_TX_BUF_SHIFT);
>  	}
>  
>  	if (skb_is_gso(skb)) {
>  		if (skb->protocol == htons(ETH_P_IP) ||
>  		   (skb_shinfo(skb)->gso_type == SKB_GSO_TCPV6)) {
> -			proto_hdr_len = skb_transport_offset(skb) +
> +			u32 proto_hdr_len = skb_transport_offset(skb) +
>  					tcp_hdrlen(skb);
>  			if (proto_hdr_len < skb_headlen(skb)) {
>  				tpd_req += ((skb_headlen(skb) - proto_hdr_len +
> @@ -1670,7 +1669,7 @@ static void atl1e_tx_map(struct atl1e_adapter *adapter,
>  {
>  	struct atl1e_tpd_desc *use_tpd = NULL;
>  	struct atl1e_tx_buffer *tx_buffer = NULL;
> -	u16 buf_len = skb_headlen(skb);
> +	u32 buf_len = skb_headlen(skb);
>  	u16 map_len = 0;
>  	u16 mapped_len = 0;
>  	u16 hdr_len = 0;
> 

I tested this patch ontop of the patch which reduces gso max size to 0x3c00.
If you want to submit the patch you could add my acked-by.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken
  2013-04-03  0:12                                       ` Huang, Xiong
@ 2013-04-03  0:43                                         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 32+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-03  0:43 UTC (permalink / raw)
  To: Huang, Xiong; +Cc: Ben Hutchings, Anders Boström, netdev, 565404

On Wed, Apr 03, 2013 at 12:12:12AM +0000, Huang, Xiong wrote:
> > > Hannes,  Thanks for your testing !
> > >
> > >  simply revising MAX_TX_BUF_LEN to 0x4000 will cause incorrect TX
> > configuration...
> > > I mean you can try to put a gso size limit of 0x4000 (or 0x5000)....
> > 
> > I tested both values with multi-gigabyte nfsv4 traffic and both values are ok.
> > If I understand you correctly 0x4000 is a safe limit?
> 
> Since Win7 driver uses 15000 bytes as its max packet length for TSO, I think 0x3C00 is more safer than 0x4000. :)

Thanks again for helping to resolve this issue. I just submitted a patch
but accidently killed the cc-line.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-04-03  0:44 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20100115.142502.968775035345957525.anders@netinsight.net>
     [not found] ` <1263767939.8876.94.camel@localhost>
2010-01-18 14:43   ` Bug#565404: linux-image-2.6.26-2-amd64: atl1e: TSO is broken Anders Boström
2010-01-20  6:03     ` Jie Yang
2010-01-20  9:27       ` Anders Boström
2010-01-21  5:37         ` Jie Yang
2010-01-21 16:42           ` Anders Boström
2010-01-23 15:29             ` Ben Hutchings
2010-01-24  1:36               ` Herbert Xu
2010-01-25  5:41             ` Jie Yang
2010-01-25 15:36               ` Anders Boström
2010-01-26  2:04                 ` Jie Yang
2010-01-26  8:34                   ` Anders Boström
2013-03-31  0:25                     ` Ben Hutchings
2013-03-31  0:43                       ` Huang, Xiong
2013-03-31  1:18                       ` Huang, Xiong
2013-03-31  2:10                         ` Ben Hutchings
2013-04-01  2:51                           ` Huang, Xiong
2013-04-02 21:15                             ` Hannes Frederic Sowa
2013-04-02 21:51                               ` Huang, Xiong
2013-04-02 22:19                                 ` Hannes Frederic Sowa
2013-04-02 22:23                                   ` Huang, Xiong
2013-04-03  0:00                                     ` Hannes Frederic Sowa
2013-04-03  0:12                                       ` Huang, Xiong
2013-04-03  0:43                                         ` Hannes Frederic Sowa
2013-04-02 22:00                               ` Eric Dumazet
2013-04-02 22:15                                 ` Hannes Frederic Sowa
2013-04-02 22:34                                   ` Eric Dumazet
2013-04-02 23:24                                     ` Hannes Frederic Sowa
2013-04-03  0:38                                     ` Hannes Frederic Sowa
2013-03-31 21:11                       ` Hannes Frederic Sowa
2013-04-02  7:35                       ` Anders Boström
2013-04-02  9:41                         ` Hannes Frederic Sowa
2013-04-02 12:22                           ` Anders Boström

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.