All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-09 19:50 Панов Андрей
  2015-03-10 20:12   ` Панов Андрей
  2015-03-13 13:48 ` Ben Hutchings
  0 siblings, 2 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-09 19:50 UTC (permalink / raw)
  To: Nimrod Andy, netdev

Hello!

Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces a nasty bug in transmit, corrupting packets.

To reproduce:

$ dd if=/dev/zero of=zeros bs=1M count=20
$ md5sum -b zeros
8f4e33f3dc3e414ff94e5fb6905cba8c *zeros

This checksum is correct.

Copy file "zeros" to another host with NFS, and it gets corrupted, checksum is changed.
File should be big, small amounts of transmit isn't affected.

I use an i.MX6 Quad board.

If this commit is reverted, all works fine.

--
 Андрей

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-09 19:50 Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3 Панов Андрей
@ 2015-03-10 20:12   ` Панов Андрей
  2015-03-13 13:48 ` Ben Hutchings
  1 sibling, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-10 20:12 UTC (permalink / raw)
  To: Nimrod Andy, netdev, linux-arm-kernel

Hello!
Adding lakml to cc:

> Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces a nasty bug in transmit, corrupting packets.
>
> To reproduce:
>
> $ dd if=/dev/zero of=zeros bs=1M count=20
> $ md5sum -b zeros
> 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
>
> This checksum is correct.
>
> Copy file "zeros" to another host with NFS, and it gets corrupted, checksum is changed.
> File should be big, small amounts of transmit isn't affected.
>
> I use an i.MX6 Quad board.
>
> If this commit is reverted, all works fine.

3.19 works fine too.
And it is not random corruption, when copying all-zero file this is received on NFS host:
$ hd zeros | head -16
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0052c0f0  00 00 00 00 00 00 00 00  1c f0 9f e5 1c f0 9f e5  |................|
0052c100  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
0052c110  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00  |............t...|
0052c120  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00  |................|
0052c130  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00  |................|
0052c140  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
0052c150  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
0052c160  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
0052c170  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
0052c180  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
0052c190  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
0052c1a0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|
0052c1b0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00  |.z..uz..."..I...|
0052c1c0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8  |.!...p...!..#JR.|

Looks like zero page with vectors at beginning.
Freescale i.MX6 Quad-based Embedsky E9 board.

--
 Андрей

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-10 20:12   ` Панов Андрей
  0 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-10 20:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hello!
Adding lakml to cc:

> Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces a nasty bug in transmit, corrupting packets.
>
> To reproduce:
>
> $ dd if=/dev/zero of=zeros bs=1M count=20
> $ md5sum -b zeros
> 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
>
> This checksum is correct.
>
> Copy file "zeros" to another host with NFS, and it gets corrupted, checksum is changed.
> File should be big, small amounts of transmit isn't affected.
>
> I use an i.MX6 Quad board.
>
> If this commit is reverted, all works fine.

3.19 works fine too.
And it is not random corruption, when copying all-zero file this is received on NFS host:
$ hd zeros | head -16
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0052c0f0  00 00 00 00 00 00 00 00  1c f0 9f e5 1c f0 9f e5  |................|
0052c100  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
0052c110  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00  |............t...|
0052c120  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00  |................|
0052c130  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00  |................|
0052c140  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
0052c150  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
0052c160  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
0052c170  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
0052c180  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
0052c190  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
0052c1a0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|
0052c1b0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00  |.z..uz..."..I...|
0052c1c0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8  |.!...p...!..#JR.|

Looks like zero page with vectors at beginning.
Freescale i.MX6 Quad-based Embedsky E9 board.

--
 ??????

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-09 19:50 Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3 Панов Андрей
  2015-03-10 20:12   ` Панов Андрей
@ 2015-03-13 13:48 ` Ben Hutchings
  2015-03-13 16:41   ` David Miller
  1 sibling, 1 reply; 23+ messages in thread
From: Ben Hutchings @ 2015-03-13 13:48 UTC (permalink / raw)
  To: Панов
	Андрей
  Cc: Nimrod Andy, netdev

[-- Attachment #1: Type: text/plain, Size: 955 bytes --]

On Mon, 2015-03-09 at 22:50 +0300, Панов Андрей wrote:
> Hello!
> 
> Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces a nasty bug in transmit, corrupting packets.
> 
> To reproduce:
> 
> $ dd if=/dev/zero of=zeros bs=1M count=20
> $ md5sum -b zeros
> 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
> 
> This checksum is correct.
> 
> Copy file "zeros" to another host with NFS, and it gets corrupted, checksum is changed.
> File should be big, small amounts of transmit isn't affected.
> 
> I use an i.MX6 Quad board.
> 
> If this commit is reverted, all works fine.

And the bug described in the commit message doesn't seem to exist in the
previous version.  The change just doesn't make sense to me.

I seem to remember DMA debug checks sometimes causing false positives
for network I/O, too.

Ben.

-- 
Ben Hutchings
Any smoothly functioning technology is indistinguishable from a rigged demo.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-13 13:48 ` Ben Hutchings
@ 2015-03-13 16:41   ` David Miller
  0 siblings, 0 replies; 23+ messages in thread
From: David Miller @ 2015-03-13 16:41 UTC (permalink / raw)
  To: ben; +Cc: rockford, b38611, netdev

From: Ben Hutchings <ben@decadent.org.uk>
Date: Fri, 13 Mar 2015 13:48:47 +0000

> On Mon, 2015-03-09 at 22:50 +0300, Панов Андрей wrote:
>> Hello!
>> 
>> Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces a nasty bug in transmit, corrupting packets.
>> 
>> To reproduce:
>> 
>> $ dd if=/dev/zero of=zeros bs=1M count=20
>> $ md5sum -b zeros
>> 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
>> 
>> This checksum is correct.
>> 
>> Copy file "zeros" to another host with NFS, and it gets corrupted, checksum is changed.
>> File should be big, small amounts of transmit isn't affected.
>> 
>> I use an i.MX6 Quad board.
>> 
>> If this commit is reverted, all works fine.
> 
> And the bug described in the commit message doesn't seem to exist in the
> previous version.  The change just doesn't make sense to me.
> 
> I seem to remember DMA debug checks sometimes causing false positives
> for network I/O, too.

I'd be happy to apply a revert.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-10 20:12   ` Панов Андрей
@ 2015-03-16  9:21     ` fugang.duan at freescale.com
  -1 siblings, 0 replies; 23+ messages in thread
From: fugang.duan @ 2015-03-16  9:21 UTC (permalink / raw)
  To: Панов
	Андрей,
	netdev, linux-arm-kernel

From: Панов Андрей <rockford@yandex.ru> Sent: Wednesday, March 11, 2015 4:12 AM
> To: Duan Fugang-B38611; netdev@vger.kernel.org; linux-arm-kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hello!
> Adding lakml to cc:
> 
> > Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces
> a nasty bug in transmit, corrupting packets.
> >
> > To reproduce:
> >
> > $ dd if=/dev/zero of=zeros bs=1M count=20 $ md5sum -b zeros
> > 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
> >
> > This checksum is correct.
> >
> > Copy file "zeros" to another host with NFS, and it gets corrupted,
> checksum is changed.
> > File should be big, small amounts of transmit isn't affected.
> >
> > I use an i.MX6 Quad board.
> >
> > If this commit is reverted, all works fine.
> 
> 3.19 works fine too.
> And it is not random corruption, when copying all-zero file this is
> received on NFS host:
> $ hd zeros | head -16
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 0052c0f0  00 00 00 00 00 00 00 00  1c f0 9f e5 1c f0 9f e5
> |................|
> 0052c100  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5
> |................|
> 0052c110  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00
> |............t...|
> 0052c120  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00
> |................|
> 0052c130  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00
> |................|
> 0052c140  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c)
> Copyrigh|
> 0052c150  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012,
> Fre|
> 0052c160  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale
> Semicondu|
> 0052c170  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All
> rights|
> 0052c180  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  |
> reserved.....,A|
> 0052c190  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00
> |.s...t..=u...x..|
> 0052c1a0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00
> |Ox..uw...v...y..|
> 0052c1b0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00
> |.z..uz..."..I...|
> 0052c1c0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8
> |.!...p...!..#JR.|
> 
> Looks like zero page with vectors at beginning.
> Freescale i.MX6 Quad-based Embedsky E9 board.
> 
Hi, 

I try your case on i.MX6q sabresd board with net tree kernel.
root@freescale ~$ uname -r
4.0.0-rc3-11071-gf00bbd2

With below test steps for 10 time, the zero file size range from 20M to 300M bytes, 
compare the md5sum checksum between i.MX6q and i.MX6q/PC host,  the checksum is the same.

root@freescale ~$ rm zeros
root@freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
root@freescale ~$ md5sum -b zeros
0d97a9cd8bbd7ce75a2a76bb06258915  zeros
root@freescale ~$ cp zeros /mnt/nfs/ -f
root@freescale ~$ rm zeros
root@freescale ~$ cp /mnt/nfs/zeros ./
root@freescale ~$ md5sum -b zeros
0d97a9cd8bbd7ce75a2a76bb06258915  zeros

Do you have any lost for reproduce the issue ?


Regards,
Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-16  9:21     ` fugang.duan at freescale.com
  0 siblings, 0 replies; 23+ messages in thread
From: fugang.duan at freescale.com @ 2015-03-16  9:21 UTC (permalink / raw)
  To: linux-arm-kernel

From: ????? ?????? <rockford@yandex.ru> Sent: Wednesday, March 11, 2015 4:12 AM
> To: Duan Fugang-B38611; netdev at vger.kernel.org; linux-arm-kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hello!
> Adding lakml to cc:
> 
> > Commit 2b995f63987013bacde99168218f9c7b252bdcf1 in 4.0.0-rc3 introduces
> a nasty bug in transmit, corrupting packets.
> >
> > To reproduce:
> >
> > $ dd if=/dev/zero of=zeros bs=1M count=20 $ md5sum -b zeros
> > 8f4e33f3dc3e414ff94e5fb6905cba8c *zeros
> >
> > This checksum is correct.
> >
> > Copy file "zeros" to another host with NFS, and it gets corrupted,
> checksum is changed.
> > File should be big, small amounts of transmit isn't affected.
> >
> > I use an i.MX6 Quad board.
> >
> > If this commit is reverted, all works fine.
> 
> 3.19 works fine too.
> And it is not random corruption, when copying all-zero file this is
> received on NFS host:
> $ hd zeros | head -16
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 0052c0f0  00 00 00 00 00 00 00 00  1c f0 9f e5 1c f0 9f e5
> |................|
> 0052c100  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5
> |................|
> 0052c110  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00
> |............t...|
> 0052c120  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00
> |................|
> 0052c130  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00
> |................|
> 0052c140  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c)
> Copyrigh|
> 0052c150  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012,
> Fre|
> 0052c160  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale
> Semicondu|
> 0052c170  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All
> rights|
> 0052c180  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  |
> reserved.....,A|
> 0052c190  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00
> |.s...t..=u...x..|
> 0052c1a0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00
> |Ox..uw...v...y..|
> 0052c1b0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00
> |.z..uz..."..I...|
> 0052c1c0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8
> |.!...p...!..#JR.|
> 
> Looks like zero page with vectors at beginning.
> Freescale i.MX6 Quad-based Embedsky E9 board.
> 
Hi, 

I try your case on i.MX6q sabresd board with net tree kernel.
root at freescale ~$ uname -r
4.0.0-rc3-11071-gf00bbd2

With below test steps for 10 time, the zero file size range from 20M to 300M bytes, 
compare the md5sum checksum between i.MX6q and i.MX6q/PC host,  the checksum is the same.

root at freescale ~$ rm zeros
root at freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
300+0 records in
300+0 records out
314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
root at freescale ~$ md5sum -b zeros
0d97a9cd8bbd7ce75a2a76bb06258915  zeros
root at freescale ~$ cp zeros /mnt/nfs/ -f
root at freescale ~$ rm zeros
root at freescale ~$ cp /mnt/nfs/zeros ./
root at freescale ~$ md5sum -b zeros
0d97a9cd8bbd7ce75a2a76bb06258915  zeros

Do you have any lost for reproduce the issue ?


Regards,
Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16  9:21     ` fugang.duan at freescale.com
@ 2015-03-16 13:37       ` Панов Андрей
  -1 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-16 13:37 UTC (permalink / raw)
  To: fugang.duan, netdev, linux-arm-kernel

Hi!

16.03.2015, 12:21, "fugang.duan@freescale.com" <fugang.duan@freescale.com>:
>
> I try your case on i.MX6q sabresd board with net tree kernel.
> root@freescale ~$ uname -r
> 4.0.0-rc3-11071-gf00bbd2
>
> With below test steps for 10 time, the zero file size range from 20M to 300M bytes,
> compare the md5sum checksum between i.MX6q and i.MX6q/PC host,  the checksum is the same.
>
> root@freescale ~$ rm zeros
> root@freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
> 300+0 records in
> 300+0 records out
> 314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
> root@freescale ~$ md5sum -b zeros
> 0d97a9cd8bbd7ce75a2a76bb06258915  zeros
> root@freescale ~$ cp zeros /mnt/nfs/ -f
> root@freescale ~$ rm zeros
> root@freescale ~$ cp /mnt/nfs/zeros ./
> root@freescale ~$ md5sum -b zeros
> 0d97a9cd8bbd7ce75a2a76bb06258915  zeros
>
> Do you have any lost for reproduce the issue ?

Current net tree does not have this issue, it works fine. Thanks to all.

From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git

apx@astra:~$ uname -r
4.0.0-rc3-00150-g10640d3-dirty


--
 Андрей

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-16 13:37       ` Панов Андрей
  0 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-16 13:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

16.03.2015, 12:21, "fugang.duan at freescale.com" <fugang.duan@freescale.com>:
>
> I try your case on i.MX6q sabresd board with net tree kernel.
> root at freescale ~$ uname -r
> 4.0.0-rc3-11071-gf00bbd2
>
> With below test steps for 10 time, the zero file size range from 20M to 300M bytes,
> compare the md5sum checksum between i.MX6q and i.MX6q/PC host, ?the checksum is the same.
>
> root at freescale ~$ rm zeros
> root at freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
> 300+0 records in
> 300+0 records out
> 314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
> root at freescale ~$ md5sum -b zeros
> 0d97a9cd8bbd7ce75a2a76bb06258915 ?zeros
> root at freescale ~$ cp zeros /mnt/nfs/ -f
> root at freescale ~$ rm zeros
> root at freescale ~$ cp /mnt/nfs/zeros ./
> root at freescale ~$ md5sum -b zeros
> 0d97a9cd8bbd7ce75a2a76bb06258915 ?zeros
>
> Do you have any lost for reproduce the issue ?

Current net tree does not have this issue, it works fine. Thanks to all.

>From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git

apx at astra:~$ uname -r
4.0.0-rc3-00150-g10640d3-dirty


--
 ??????

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16 13:37       ` Панов Андрей
@ 2015-03-16 14:01         ` fugang.duan at freescale.com
  -1 siblings, 0 replies; 23+ messages in thread
From: fugang.duan @ 2015-03-16 14:01 UTC (permalink / raw)
  To: Панов
	Андрей,
	netdev, linux-arm-kernel

From: Панов Андрей <rockford@yandex.ru> Sent: Monday, March 16, 2015 9:38 PM
> To: Duan Fugang-B38611; netdev@vger.kernel.org; linux-arm-kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hi!
> 
> 16.03.2015, 12:21, "fugang.duan@freescale.com"
> <fugang.duan@freescale.com>:
> >
> > I try your case on i.MX6q sabresd board with net tree kernel.
> > root@freescale ~$ uname -r
> > 4.0.0-rc3-11071-gf00bbd2
> >
> > With below test steps for 10 time, the zero file size range from 20M
> > to 300M bytes, compare the md5sum checksum between i.MX6q and i.MX6q/PC
> host,  the checksum is the same.
> >
> > root@freescale ~$ rm zeros
> > root@freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
> > 300+0 records in
> > 300+0 records out
> > 314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
> > root@freescale ~$ md5sum -b zeros
> > 0d97a9cd8bbd7ce75a2a76bb06258915  zeros root@freescale ~$ cp zeros
> > /mnt/nfs/ -f root@freescale ~$ rm zeros root@freescale ~$ cp
> > /mnt/nfs/zeros ./ root@freescale ~$ md5sum -b zeros
> > 0d97a9cd8bbd7ce75a2a76bb06258915  zeros
> >
> > Do you have any lost for reproduce the issue ?
> 
> Current net tree does not have this issue, it works fine. Thanks to all.
> 
> From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
> 
> apx@astra:~$ uname -r
> 4.0.0-rc3-00150-g10640d3-dirty
> 
> 
> --
>  Андрей

I test the commit f00bbd2 that include the patch 2b995f63987.
But cannot reproduce your issue. Pls double confirm it in your board.

Regards,
Andy

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-16 14:01         ` fugang.duan at freescale.com
  0 siblings, 0 replies; 23+ messages in thread
From: fugang.duan at freescale.com @ 2015-03-16 14:01 UTC (permalink / raw)
  To: linux-arm-kernel

From: ????? ?????? <rockford@yandex.ru> Sent: Monday, March 16, 2015 9:38 PM
> To: Duan Fugang-B38611; netdev at vger.kernel.org; linux-arm-kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hi!
> 
> 16.03.2015, 12:21, "fugang.duan at freescale.com"
> <fugang.duan@freescale.com>:
> >
> > I try your case on i.MX6q sabresd board with net tree kernel.
> > root at freescale ~$ uname -r
> > 4.0.0-rc3-11071-gf00bbd2
> >
> > With below test steps for 10 time, the zero file size range from 20M
> > to 300M bytes, compare the md5sum checksum between i.MX6q and i.MX6q/PC
> host, ?the checksum is the same.
> >
> > root at freescale ~$ rm zeros
> > root at freescale ~$ dd if=/dev/zero of=zeros bs=1M count=300
> > 300+0 records in
> > 300+0 records out
> > 314572800 bytes (300.0MB) copied, 1.811459 seconds, 165.6MB/s
> > root at freescale ~$ md5sum -b zeros
> > 0d97a9cd8bbd7ce75a2a76bb06258915 ?zeros root at freescale ~$ cp zeros
> > /mnt/nfs/ -f root at freescale ~$ rm zeros root at freescale ~$ cp
> > /mnt/nfs/zeros ./ root at freescale ~$ md5sum -b zeros
> > 0d97a9cd8bbd7ce75a2a76bb06258915 ?zeros
> >
> > Do you have any lost for reproduce the issue ?
> 
> Current net tree does not have this issue, it works fine. Thanks to all.
> 
> From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
> 
> apx at astra:~$ uname -r
> 4.0.0-rc3-00150-g10640d3-dirty
> 
> 
> --
>  ??????

I test the commit f00bbd2 that include the patch 2b995f63987.
But cannot reproduce your issue. Pls double confirm it in your board.

Regards,
Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16 14:01         ` fugang.duan at freescale.com
  (?)
  (?)
@ 2015-03-16 19:09         ` Панов Андрей
  -1 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-16 19:09 UTC (permalink / raw)
  To: fugang.duan, netdev, linux-arm-kernel

Hi!

16.03.2015, 17:01, "fugang.duan@freescale.com" <fugang.duan@freescale.com>:
>>>
>>>  Do you have any lost for reproduce the issue ?
>>  Current net tree does not have this issue, it works fine. Thanks to all.
>>
>>  From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
>>
>>  apx@astra:~$ uname -r
>>  4.0.0-rc3-00150-g10640d3-dirty
>>
>
> I test the commit f00bbd2 that include the patch 2b995f63987.
> But cannot reproduce your issue. Pls double confirm it in your board.

Confirmed that net-next is buggy.

Client kernel version:
apx@astra:~$ uname -r
4.0.0-rc3-00875-gf00bbd2-dirty

100M file of zeros on client:
apx@astra:~$ ls -l zeros 
-rw-r--r-- 1 apx apx 104857600 мар 10 22:33 zeros

File contents (ran on client host):
apx@astra:~$ hd zeros 
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
06400000

NFS server kernel version:
apx@ct:~$ uname -r
4.0.0-rc2-00480-g29e70e6

100M file of zeros on client:
apx@ct:~$ ls -l zeros 
-rw-r--r-- 1 apx apx 104857600 мар 10 22:33 zeros

File contents (ran on server host):
apx@ct:~$ hd zeros    
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
*
001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00  |....t...........|
001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00  |................|
001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20  |............(c) |
001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright 2007-2|
001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012, Freescale S|
001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor. Al|
001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights reserve|
001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00  |d.....,A.s...t..|
001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00  |=u...x..Ox..uw..|
001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00  |.v...y...z..uz..|

Any other client works fine with this server.
And note that corruption is not random, it looks like a zero page (vectors at beginning),
so somewhere pointer to data is corrupted (and set to zero(?)).
This is repeated many times.

I use an Embedsky E9 board: http://en.embedsky.com/product_info.php?cateid=169&id=169
It is SabreSD-alike board with a fewer peripherials, and at network side it has a different PHY,
instead of Atheros it has a Realtek RTL8211E, and I have to initialize it this way (board early fixup patch):

--- a/arch/arm/mach-imx/mach-imx6q.c
+++ b/arch/arm/mach-imx/mach-imx6q.c
@@ -166,6 +166,19 @@ static int ar8035_phy_fixup(struct phy_device *dev)
 
 #define PHY_ID_AR8035 0x004dd072
 
+static int rtl8211e_phy_fixup(struct phy_device *dev)
+{
+       phy_write(dev, 0x00, 0x3140);
+       msleep(10);
+       phy_write(dev, 0x00, 0x3340);
+       msleep(10);
+
+       return 0;
+}
+
+#define PHY_ID_RTL8211E 0x001cc915
+#define REALTEK_PHY_ID_MASK 0x001fffff
+
 static void __init imx6q_enet_phy_init(void)
 {
        if (IS_BUILTIN(CONFIG_PHYLIB)) {
@@ -177,6 +190,8 @@ static void __init imx6q_enet_phy_init(void)
                                ar8031_phy_fixup);
                phy_register_fixup_for_uid(PHY_ID_AR8035, 0xffffffef,
                                ar8035_phy_fixup);
+               phy_register_fixup_for_uid(PHY_ID_RTL8211E, REALTEK_PHY_ID_MASK,
+                               rtl8211e_phy_fixup);
        }
 }

(LAKML folks told me this should go somewhere in network driver, but now it is there)

Without this there is no network at all, and I had ran this code for year without any glitch. 3.19 kernel works fine.
(It causes "-dirty" in kernel version)
And different phy initialization cannot cause a non-random stream corruption.

I've found commit that produces a bug by looking a changes between 3.19 (surely working)
 and 4.0.0-rc3 (surely non-working) in freescale network driver.

Hope this helps.

--
 Андрей

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16 14:01         ` fugang.duan at freescale.com
  (?)
@ 2015-03-16 19:09         ` Панов Андрей
  2015-03-17  1:49           ` fugang.duan
  2015-03-21 20:53             ` Russell King - ARM Linux
  -1 siblings, 2 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-16 19:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

16.03.2015, 17:01, "fugang.duan at freescale.com" <fugang.duan@freescale.com>:
>>>
>>> ?Do you have any lost for reproduce the issue ?
>> ?Current net tree does not have this issue, it works fine. Thanks to all.
>>
>> ?From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
>>
>> ?apx at astra:~$ uname -r
>> ?4.0.0-rc3-00150-g10640d3-dirty
>>
>
> I test the commit f00bbd2 that include the patch 2b995f63987.
> But cannot reproduce your issue. Pls double confirm it in your board.

Confirmed that net-next is buggy.

Client kernel version:
apx at astra:~$ uname -r
4.0.0-rc3-00875-gf00bbd2-dirty

100M file of zeros on client:
apx at astra:~$ ls -l zeros 
-rw-r--r-- 1 apx apx 104857600 ??? 10 22:33 zeros

File contents (ran on client host):
apx at astra:~$ hd zeros 
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
06400000

NFS server kernel version:
apx at ct:~$ uname -r
4.0.0-rc2-00480-g29e70e6

100M file of zeros on client:
apx at ct:~$ ls -l zeros 
-rw-r--r-- 1 apx apx 104857600 ??? 10 22:33 zeros

File contents (ran on server host):
apx at ct:~$ hd zeros    
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
*
001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00  |....t...........|
001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00  |................|
001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20  |............(c) |
001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright 2007-2|
001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012, Freescale S|
001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor. Al|
001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights reserve|
001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00  |d.....,A.s...t..|
001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00  |=u...x..Ox..uw..|
001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00  |.v...y...z..uz..|

Any other client works fine with this server.
And note that corruption is not random, it looks like a zero page (vectors at beginning),
so somewhere pointer to data is corrupted (and set to zero(?)).
This is repeated many times.

I use an Embedsky E9 board: http://en.embedsky.com/product_info.php?cateid=169&id=169
It is SabreSD-alike board with a fewer peripherials, and at network side it has a different PHY,
instead of Atheros it has a Realtek RTL8211E, and I have to initialize it this way (board early fixup patch):

--- a/arch/arm/mach-imx/mach-imx6q.c
+++ b/arch/arm/mach-imx/mach-imx6q.c
@@ -166,6 +166,19 @@ static int ar8035_phy_fixup(struct phy_device *dev)
 
 #define PHY_ID_AR8035 0x004dd072
 
+static int rtl8211e_phy_fixup(struct phy_device *dev)
+{
+       phy_write(dev, 0x00, 0x3140);
+       msleep(10);
+       phy_write(dev, 0x00, 0x3340);
+       msleep(10);
+
+       return 0;
+}
+
+#define PHY_ID_RTL8211E 0x001cc915
+#define REALTEK_PHY_ID_MASK 0x001fffff
+
 static void __init imx6q_enet_phy_init(void)
 {
        if (IS_BUILTIN(CONFIG_PHYLIB)) {
@@ -177,6 +190,8 @@ static void __init imx6q_enet_phy_init(void)
                                ar8031_phy_fixup);
                phy_register_fixup_for_uid(PHY_ID_AR8035, 0xffffffef,
                                ar8035_phy_fixup);
+               phy_register_fixup_for_uid(PHY_ID_RTL8211E, REALTEK_PHY_ID_MASK,
+                               rtl8211e_phy_fixup);
        }
 }

(LAKML folks told me this should go somewhere in network driver, but now it is there)

Without this there is no network at all, and I had ran this code for year without any glitch. 3.19 kernel works fine.
(It causes "-dirty" in kernel version)
And different phy initialization cannot cause a non-random stream corruption.

I've found commit that produces a bug by looking a changes between 3.19 (surely working)
 and 4.0.0-rc3 (surely non-working) in freescale network driver.

Hope this helps.

--
 ??????

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16 19:09         ` Панов Андрей
@ 2015-03-17  1:49           ` fugang.duan
  2015-03-21 20:53             ` Russell King - ARM Linux
  1 sibling, 0 replies; 23+ messages in thread
From: fugang.duan @ 2015-03-17  1:49 UTC (permalink / raw)
  To: Панов
	Андрей,
	netdev

From: Панов Андрей <rockford@yandex.ru> Sent: Tuesday, March 17, 2015 3:09 AM
> To: Duan Fugang-B38611; netdev@vger.kernel.org; linux-arm-kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hi!
> 
> 16.03.2015, 17:01, "fugang.duan@freescale.com"
> <fugang.duan@freescale.com>:
> >>>
> >>>  Do you have any lost for reproduce the issue ?
> >>  Current net tree does not have this issue, it works fine. Thanks to
> all.
> >>
> >>  From git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
> >>
> >>  apx@astra:~$ uname -r
> >>  4.0.0-rc3-00150-g10640d3-dirty
> >>
> >
> > I test the commit f00bbd2 that include the patch 2b995f63987.
> > But cannot reproduce your issue. Pls double confirm it in your board.
> 
> Confirmed that net-next is buggy.
> 
> Client kernel version:
> apx@astra:~$ uname -r
> 4.0.0-rc3-00875-gf00bbd2-dirty
> 
> 100M file of zeros on client:
> apx@astra:~$ ls -l zeros
> -rw-r--r-- 1 apx apx 104857600 мар 10 22:33 zeros
> 
> File contents (ran on client host):
> apx@astra:~$ hd zeros
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 06400000
> 
> NFS server kernel version:
> apx@ct:~$ uname -r
> 4.0.0-rc2-00480-g29e70e6
> 
> 100M file of zeros on client:
> apx@ct:~$ ls -l zeros
> -rw-r--r-- 1 apx apx 104857600 мар 10 22:33 zeros
> 
> File contents (ran on server host):
> apx@ct:~$ hd zeros
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> *
> 001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5
> |................|
> *
> 001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00
> |....t...........|
> 001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00
> |................|
> 001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20
> |............(c) |
> 001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright
> 2007-2|
> 001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012,
> Freescale S|
> 001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor.
> Al|
> 001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights
> reserve|
> 001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00
> |d.....,A.s...t..|
> 001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00
> |=u...x..Ox..uw..|
> 001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00
> |.v...y...z..uz..|
> 
> Any other client works fine with this server.
> And note that corruption is not random, it looks like a zero page
> (vectors at beginning), so somewhere pointer to data is corrupted (and
> set to zero(?)).
> This is repeated many times.
> 
> I use an Embedsky E9 board:
> http://en.embedsky.com/product_info.php?cateid=169&id=169
> It is SabreSD-alike board with a fewer peripherials, and at network side
> it has a different PHY, instead of Atheros it has a Realtek RTL8211E, and
> I have to initialize it this way (board early fixup patch):
> 
> --- a/arch/arm/mach-imx/mach-imx6q.c
> +++ b/arch/arm/mach-imx/mach-imx6q.c
> @@ -166,6 +166,19 @@ static int ar8035_phy_fixup(struct phy_device *dev)
> 
>  #define PHY_ID_AR8035 0x004dd072
> 
> +static int rtl8211e_phy_fixup(struct phy_device *dev) {
> +       phy_write(dev, 0x00, 0x3140);
> +       msleep(10);
> +       phy_write(dev, 0x00, 0x3340);
> +       msleep(10);
> +
> +       return 0;
> +}
> +
> +#define PHY_ID_RTL8211E 0x001cc915
> +#define REALTEK_PHY_ID_MASK 0x001fffff
> +
>  static void __init imx6q_enet_phy_init(void)  {
>         if (IS_BUILTIN(CONFIG_PHYLIB)) { @@ -177,6 +190,8 @@ static void
> __init imx6q_enet_phy_init(void)
>                                 ar8031_phy_fixup);
>                 phy_register_fixup_for_uid(PHY_ID_AR8035, 0xffffffef,
>                                 ar8035_phy_fixup);
> +               phy_register_fixup_for_uid(PHY_ID_RTL8211E,
> REALTEK_PHY_ID_MASK,
> +                               rtl8211e_phy_fixup);
>         }
>  }
> 
> (LAKML folks told me this should go somewhere in network driver, but now
> it is there)
> 
> Without this there is no network at all, and I had ran this code for year
> without any glitch. 3.19 kernel works fine.
> (It causes "-dirty" in kernel version)
> And different phy initialization cannot cause a non-random stream
> corruption.
> 
> I've found commit that produces a bug by looking a changes between 3.19
> (surely working)  and 4.0.0-rc3 (surely non-working) in freescale network
> driver.
> 
> Hope this helps.
> 
> --
Thank you for double confirm.
The issue cannot be reproduced at i.MX6q sabresd board with net-next 4.0.0-rc3-11071-gf00bbd2 kernel.
I don't have Embedsky E9 board, so cannot continue to do this work.
Do you have i.MX6q sabresd board, if you have, you can try it.

Regards,
Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-16 19:09         ` Панов Андрей
@ 2015-03-21 20:53             ` Russell King - ARM Linux
  2015-03-21 20:53             ` Russell King - ARM Linux
  1 sibling, 0 replies; 23+ messages in thread
From: Russell King - ARM Linux @ 2015-03-21 20:53 UTC (permalink / raw)
  To: Панов
	Андрей
  Cc: fugang.duan, netdev, linux-arm-kernel

On Mon, Mar 16, 2015 at 10:09:04PM +0300, Панов Андрей wrote:
> apx@ct:~$ hd zeros    
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
> *
> 001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00  |....t...........|
> 001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00  |................|
> 001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20  |............(c) |
> 001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright 2007-2|
> 001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012, Freescale S|
> 001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor. Al|
> 001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights reserve|
> 001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00  |d.....,A.s...t..|
> 001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00  |=u...x..Ox..uw..|
> 001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00  |.v...y...z..uz..|

I'm seeing this too with 4.0-rc4 _without_ net-next:

0000ba00  2d 20 d4 e5 80 00 52 e3  01 00 52 13 00 20 a0 03  |- ....R...R.. ..|
0000ba10  01 20 a0 13 a4 00 00 0a  1c f0 9f e5 1c f0 9f e5  |. ..............|
0000ba20  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
0000ba30  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00  |............t...|
0000ba40  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00  |................|
0000ba50  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00  |................|
0000ba60  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
0000ba70  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
0000ba80  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
0000ba90  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
0000baa0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
0000bab0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
0000bac0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|
0000bad0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00  |.z..uz..."..I...|
0000bae0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8  |.!...p...!..#JR.|
0000baf0  20 00 00 68 c8 40 00 f0  01 00 70 47 10 b5 01 23  | ..h.@....pG...#|
0000bb00  1e 4c 1c 3c 8b 40 54 f8  20 00 01 68 0a b1 19 43  |.L.<.@T. ..h...C|

In my case, this appeared in the middle of the bluetooth.ko module.

If we look at physical memory at zero (I'll omit the devmem2 commands):

Value at address 0x00000000: 0xe59ff01c
Value at address 0x00000004: 0xe59ff01c
Value at address 0x00000008: 0xe59ff01c
Value at address 0x0000000c: 0xe59ff01c
Value at address 0x00000010: 0xe59ff01c
Value at address 0x00000014: 0xe59ff01c
Value at address 0x00000018: 0xe59ff01c
Value at address 0x0000001c: 0xe59ff01c
Value at address 0x00000020: 0xe59ff01c
Value at address 0x00000024: 0x0000fb74
Value at address 0x00000028: 0x0093ffbc
Value at address 0x0000002c: 0x0093ffc0
Value at address 0x00000030: 0x0093ffc4
Value at address 0x00000034: 0x0093ffc8
Value at address 0x00000038: 0x0093ffcc
Value at address 0x0000003c: 0x0093ffd0
Value at address 0x00000040: 0x0093ffd4
Value at address 0x00000044: 0x0093ffd8
Value at address 0x00000048: 0x00000013
Value at address 0x0000004c: 0x20296328 <== start of (c) string.
Value at address 0x00000050: 0x79706f43
Value at address 0x00000054: 0x68676972
Value at address 0x00000058: 0x30322074
Value at address 0x0000005c: 0x322d3730
Value at address 0x00000060: 0x2c323130
Value at address 0x00000064: 0x65724620
Value at address 0x00000068: 0x61637365
Value at address 0x0000006c: 0x5320656c
Value at address 0x00000070: 0x63696d65
Value at address 0x00000074: 0x75646e6f
Value at address 0x00000078: 0x726f7463
Value at address 0x0000007c: 0x6c41202e
Value at address 0x00000080: 0x6972206c

This matches the "corruption".  So, the FEC driver is DMAing from
physical address zero.  There's only two ways this can happen - either
if dma_map_single() returns zero, or if the ring already contains a
zero address.

I've thrown into the FEC driver a load of WARN_ON_ONCE(addr == 0) after
_every_ dma_map_single(), and I also have pre-standing detection of
highmem pages in fec_enet_txq_submit_frag_skb().  None of this is firing.

We know that the ARM architecture can write to memory (which includes
memory allocated via dma_alloc_coherent()) with weak ordering, so to
rule that out, I tried adding a barrier between writing the address
and size, and writing the status field everywhere where we touch the
transmit ring.  That had no effect; I still see the corruption.

Next, I've tried setting free descriptors to have a physical address
of 0x40 - which should change the pattern of the corruption (in that
the "(c)" string should appear earlier in the corruption.)  The
reasoning is to prove whether the FEC TX DMA is reading from zero
because it's being instructed to in the transmit ring, or whether
there's something else going on.  The result is (in a different module):

00001000  61 6c 5f 72 65 67 69 73  74 65 72 5f 64 72 69 76  |al_register_driv|
00001010  65 72 73 00 00 00 00 00  00 00 00 00 00 00 00 00  |ers.............|
00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001030  00 00 00 00 1d 84 2e 2b  75 73 62 5f 73 65 72 69  |.......+usb_seri|
00001040  61 6c 5f 67 65 6e 65 72  69 63 5f 6f 70 65 6e 00  |al_generic_open.|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001060  00 00 00 00 00 00 00 00  d4 ff 93 00 d8 ff 93 00  |................|
00001070  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
00001080  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
00001090  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
000010a0  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
000010b0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
000010c0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
000010d0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|

Here, the bytes which start at 0x1068 match what's in the iMX6 at address
0x40, but the bytes which would've come before are missing.  So, the FEC
is transmitting from a descriptor in the ring which contains an address
which has been freed - and that's with my additional barriers in place
which should ensure that the descriptor gets the address and size updates
_before_ it sees that it owns the descriptor.

The question is how, and this can be revealed by this bit of debugging in
fec_enet_tx_queue():

                for (i = 0; i < bdnum; i++) {
                        if (WARN_ON_ONCE(bdp->cbd_sc & BD_ENET_TX_READY))
                                fec_dump(ndev);
                        if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
                                dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
                                                 bdp->cbd_datlen, DMA_TO_DEVICE);
                        bdp->cbd_bufaddr = 0x40;
                        if (i < bdnum - 1)
                                bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
                }
                txq->tx_skbuff[index] = NULL;

Sure enough, that WARN_ON_ONCE() triggers - and we get _soo_ many dumps
of the transmit ring from fec_dump() that it shows that the driver is very
buggy here.  Here's an illustration of the ring state (un-wrapped to show
the order):

 460  H 0x1c00 0x00000040 1448   (null)
 461    0x1400 0x00000040   66   (null)
 462    0x1c00 0x391c0110 1448   (null)
 463    0x1400 0x8005e780   66   (null)
 464    0x1c00 0x3d120000 1448   (null)
 465    0x1400 0x8005e880   66   (null)
 466    0x1c00 0x391c0c60 1448   (null)
 467    0x1400 0x8005e980   66   (null)
 468    0x1c00 0x3d122000 1448   (null)
...
 508    0x1c00 0x3d136000 1448   (null)
 509    0x1400 0x8005fe80   66   (null)
 510    0x1c00 0x391c88d0 1448   (null)
 511    0x3400 0x8005ff80   66   (null)
   0    0x9c00 0x3d038000 1448   (null)
   1    0x9400 0x80050080   66   (null)
   2    0x9c00 0x391c9420 1448   (null)
   3    0x9400 0x80050180   66   (null)
...
  17    0x9400 0x80050880   66   (null)
  18    0x9c00 0x391cc160 1448   (null)
  19    0x9400 0x80050980   66   (null)
  20    0x9c00 0x3d042000  704 ec6106b0
  21 S  0x1c00 0x00000040 1448   (null)

We hit entry 462, and found that it was still owned by the FEC, so we
triggered the dump - during the time it took to produce the dump,
packets from 462 up to 511 were transmitted, clearing their ownership
bit (bit 15 of the first word.)

How can this happen?  Well, if we're submitted a _huge_ unfragmented
TSO skbuff, then:

                index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                skb = txq->tx_skbuff[index];
                while (!skb) {
                        bdp_t = fec_enet_get_nextdesc(bdp_t, fep, queue_id);
                        index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                        skb = txq->tx_skbuff[index];
                        bdnum++;
                }
                if (skb_shinfo(skb)->nr_frags &&
                    (status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
                        break;

skb_shinfo(skb)->nr_frags will be zero, and so we won't test whether the
last entry in the submitted group has finished being transmitted - instead,
we will blindly continue on to the loop which frees all the descriptors,
trampling over those which haven't yet been transmitted.

I can see no reason to test for skb_shinfo(skb)->nr_frags here.  If the
last buffer descriptor for the submitted skb is marked as still being
busy, then the skb *can't* be reaped.  There's no question whether it's
a fragmented skb or not - that's completely irrelevant.

In fact, with that bogus test removed, the corruption goes away.  Patch
below for others to test - once I've finished removing the rest of my
debugging, I'll send it properly.

(It should be noted that calling fec_dump() as per above is enough to make
the bug go away - because it delays overwriting the buffer address long
enough that the FEC can transmit the packets before we stomp over the
still-to-be-transmitted entries.)

Given that this bug can seriously screw data up in undetectable ways (TCP
checksums don't save you, because the FEC generates them on the data which
it read from memory, even if it happened to read the data from the SoC's
boot ROM) we do need to get this fixed ASAP.

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index f9c0baea12ed..8bb2a811df3e 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1227,8 +1227,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 			skb = txq->tx_skbuff[index];
 			bdnum++;
 		}
-		if (skb_shinfo(skb)->nr_frags &&
-		    (status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
+		if ((status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
 			break;
 
 		for (i = 0; i < bdnum; i++) {


-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-21 20:53             ` Russell King - ARM Linux
  0 siblings, 0 replies; 23+ messages in thread
From: Russell King - ARM Linux @ 2015-03-21 20:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 16, 2015 at 10:09:04PM +0300, ????? ?????? wrote:
> apx at ct:~$ hd zeros    
> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 001d4a80  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
> *
> 001d4aa0  1c f0 9f e5 74 fb 00 00  bc ff 93 00 c0 ff 93 00  |....t...........|
> 001d4ab0  c4 ff 93 00 c8 ff 93 00  cc ff 93 00 d0 ff 93 00  |................|
> 001d4ac0  d4 ff 93 00 d8 ff 93 00  13 00 00 00 28 63 29 20  |............(c) |
> 001d4ad0  43 6f 70 79 72 69 67 68  74 20 32 30 30 37 2d 32  |Copyright 2007-2|
> 001d4ae0  30 31 32 2c 20 46 72 65  65 73 63 61 6c 65 20 53  |012, Freescale S|
> 001d4af0  65 6d 69 63 6f 6e 64 75  63 74 6f 72 2e 20 41 6c  |emiconductor. Al|
> 001d4b00  6c 20 72 69 67 68 74 73  20 72 65 73 65 72 76 65  |l rights reserve|
> 001d4b10  64 2e 00 00 dd 00 2c 41  11 73 00 00 d3 74 00 00  |d.....,A.s...t..|
> 001d4b20  3d 75 00 00 a9 78 00 00  4f 78 00 00 75 77 00 00  |=u...x..Ox..uw..|
> 001d4b30  07 76 00 00 c3 79 00 00  09 7a 00 00 75 7a 00 00  |.v...y...z..uz..|

I'm seeing this too with 4.0-rc4 _without_ net-next:

0000ba00  2d 20 d4 e5 80 00 52 e3  01 00 52 13 00 20 a0 03  |- ....R...R.. ..|
0000ba10  01 20 a0 13 a4 00 00 0a  1c f0 9f e5 1c f0 9f e5  |. ..............|
0000ba20  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 1c f0 9f e5  |................|
0000ba30  1c f0 9f e5 1c f0 9f e5  1c f0 9f e5 74 fb 00 00  |............t...|
0000ba40  bc ff 93 00 c0 ff 93 00  c4 ff 93 00 c8 ff 93 00  |................|
0000ba50  cc ff 93 00 d0 ff 93 00  d4 ff 93 00 d8 ff 93 00  |................|
0000ba60  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
0000ba70  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
0000ba80  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
0000ba90  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
0000baa0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
0000bab0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
0000bac0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|
0000bad0  09 7a 00 00 75 7a 00 00  97 22 00 00 49 1f 00 00  |.z..uz..."..I...|
0000bae0  b9 21 00 00 ff 70 00 00  90 21 90 00 23 4a 52 f8  |.!...p...!..#JR.|
0000baf0  20 00 00 68 c8 40 00 f0  01 00 70 47 10 b5 01 23  | ..h. at ....pG...#|
0000bb00  1e 4c 1c 3c 8b 40 54 f8  20 00 01 68 0a b1 19 43  |.L.<. at T. ..h...C|

In my case, this appeared in the middle of the bluetooth.ko module.

If we look at physical memory at zero (I'll omit the devmem2 commands):

Value at address 0x00000000: 0xe59ff01c
Value at address 0x00000004: 0xe59ff01c
Value at address 0x00000008: 0xe59ff01c
Value at address 0x0000000c: 0xe59ff01c
Value at address 0x00000010: 0xe59ff01c
Value at address 0x00000014: 0xe59ff01c
Value at address 0x00000018: 0xe59ff01c
Value at address 0x0000001c: 0xe59ff01c
Value at address 0x00000020: 0xe59ff01c
Value at address 0x00000024: 0x0000fb74
Value at address 0x00000028: 0x0093ffbc
Value at address 0x0000002c: 0x0093ffc0
Value at address 0x00000030: 0x0093ffc4
Value at address 0x00000034: 0x0093ffc8
Value at address 0x00000038: 0x0093ffcc
Value at address 0x0000003c: 0x0093ffd0
Value at address 0x00000040: 0x0093ffd4
Value at address 0x00000044: 0x0093ffd8
Value at address 0x00000048: 0x00000013
Value at address 0x0000004c: 0x20296328 <== start of (c) string.
Value at address 0x00000050: 0x79706f43
Value at address 0x00000054: 0x68676972
Value at address 0x00000058: 0x30322074
Value at address 0x0000005c: 0x322d3730
Value at address 0x00000060: 0x2c323130
Value at address 0x00000064: 0x65724620
Value at address 0x00000068: 0x61637365
Value at address 0x0000006c: 0x5320656c
Value at address 0x00000070: 0x63696d65
Value at address 0x00000074: 0x75646e6f
Value at address 0x00000078: 0x726f7463
Value at address 0x0000007c: 0x6c41202e
Value at address 0x00000080: 0x6972206c

This matches the "corruption".  So, the FEC driver is DMAing from
physical address zero.  There's only two ways this can happen - either
if dma_map_single() returns zero, or if the ring already contains a
zero address.

I've thrown into the FEC driver a load of WARN_ON_ONCE(addr == 0) after
_every_ dma_map_single(), and I also have pre-standing detection of
highmem pages in fec_enet_txq_submit_frag_skb().  None of this is firing.

We know that the ARM architecture can write to memory (which includes
memory allocated via dma_alloc_coherent()) with weak ordering, so to
rule that out, I tried adding a barrier between writing the address
and size, and writing the status field everywhere where we touch the
transmit ring.  That had no effect; I still see the corruption.

Next, I've tried setting free descriptors to have a physical address
of 0x40 - which should change the pattern of the corruption (in that
the "(c)" string should appear earlier in the corruption.)  The
reasoning is to prove whether the FEC TX DMA is reading from zero
because it's being instructed to in the transmit ring, or whether
there's something else going on.  The result is (in a different module):

00001000  61 6c 5f 72 65 67 69 73  74 65 72 5f 64 72 69 76  |al_register_driv|
00001010  65 72 73 00 00 00 00 00  00 00 00 00 00 00 00 00  |ers.............|
00001020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001030  00 00 00 00 1d 84 2e 2b  75 73 62 5f 73 65 72 69  |.......+usb_seri|
00001040  61 6c 5f 67 65 6e 65 72  69 63 5f 6f 70 65 6e 00  |al_generic_open.|
00001050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00001060  00 00 00 00 00 00 00 00  d4 ff 93 00 d8 ff 93 00  |................|
00001070  13 00 00 00 28 63 29 20  43 6f 70 79 72 69 67 68  |....(c) Copyrigh|
00001080  74 20 32 30 30 37 2d 32  30 31 32 2c 20 46 72 65  |t 2007-2012, Fre|
00001090  65 73 63 61 6c 65 20 53  65 6d 69 63 6f 6e 64 75  |escale Semicondu|
000010a0  63 74 6f 72 2e 20 41 6c  6c 20 72 69 67 68 74 73  |ctor. All rights|
000010b0  20 72 65 73 65 72 76 65  64 2e 00 00 dd 00 2c 41  | reserved.....,A|
000010c0  11 73 00 00 d3 74 00 00  3d 75 00 00 a9 78 00 00  |.s...t..=u...x..|
000010d0  4f 78 00 00 75 77 00 00  07 76 00 00 c3 79 00 00  |Ox..uw...v...y..|

Here, the bytes which start at 0x1068 match what's in the iMX6@address
0x40, but the bytes which would've come before are missing.  So, the FEC
is transmitting from a descriptor in the ring which contains an address
which has been freed - and that's with my additional barriers in place
which should ensure that the descriptor gets the address and size updates
_before_ it sees that it owns the descriptor.

The question is how, and this can be revealed by this bit of debugging in
fec_enet_tx_queue():

                for (i = 0; i < bdnum; i++) {
                        if (WARN_ON_ONCE(bdp->cbd_sc & BD_ENET_TX_READY))
                                fec_dump(ndev);
                        if (!IS_TSO_HEADER(txq, bdp->cbd_bufaddr))
                                dma_unmap_single(&fep->pdev->dev, bdp->cbd_bufaddr,
                                                 bdp->cbd_datlen, DMA_TO_DEVICE);
                        bdp->cbd_bufaddr = 0x40;
                        if (i < bdnum - 1)
                                bdp = fec_enet_get_nextdesc(bdp, fep, queue_id);
                }
                txq->tx_skbuff[index] = NULL;

Sure enough, that WARN_ON_ONCE() triggers - and we get _soo_ many dumps
of the transmit ring from fec_dump() that it shows that the driver is very
buggy here.  Here's an illustration of the ring state (un-wrapped to show
the order):

 460  H 0x1c00 0x00000040 1448   (null)
 461    0x1400 0x00000040   66   (null)
 462    0x1c00 0x391c0110 1448   (null)
 463    0x1400 0x8005e780   66   (null)
 464    0x1c00 0x3d120000 1448   (null)
 465    0x1400 0x8005e880   66   (null)
 466    0x1c00 0x391c0c60 1448   (null)
 467    0x1400 0x8005e980   66   (null)
 468    0x1c00 0x3d122000 1448   (null)
...
 508    0x1c00 0x3d136000 1448   (null)
 509    0x1400 0x8005fe80   66   (null)
 510    0x1c00 0x391c88d0 1448   (null)
 511    0x3400 0x8005ff80   66   (null)
   0    0x9c00 0x3d038000 1448   (null)
   1    0x9400 0x80050080   66   (null)
   2    0x9c00 0x391c9420 1448   (null)
   3    0x9400 0x80050180   66   (null)
...
  17    0x9400 0x80050880   66   (null)
  18    0x9c00 0x391cc160 1448   (null)
  19    0x9400 0x80050980   66   (null)
  20    0x9c00 0x3d042000  704 ec6106b0
  21 S  0x1c00 0x00000040 1448   (null)

We hit entry 462, and found that it was still owned by the FEC, so we
triggered the dump - during the time it took to produce the dump,
packets from 462 up to 511 were transmitted, clearing their ownership
bit (bit 15 of the first word.)

How can this happen?  Well, if we're submitted a _huge_ unfragmented
TSO skbuff, then:

                index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                skb = txq->tx_skbuff[index];
                while (!skb) {
                        bdp_t = fec_enet_get_nextdesc(bdp_t, fep, queue_id);
                        index = fec_enet_get_bd_index(txq->tx_bd_base, bdp_t, fep);
                        skb = txq->tx_skbuff[index];
                        bdnum++;
                }
                if (skb_shinfo(skb)->nr_frags &&
                    (status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
                        break;

skb_shinfo(skb)->nr_frags will be zero, and so we won't test whether the
last entry in the submitted group has finished being transmitted - instead,
we will blindly continue on to the loop which frees all the descriptors,
trampling over those which haven't yet been transmitted.

I can see no reason to test for skb_shinfo(skb)->nr_frags here.  If the
last buffer descriptor for the submitted skb is marked as still being
busy, then the skb *can't* be reaped.  There's no question whether it's
a fragmented skb or not - that's completely irrelevant.

In fact, with that bogus test removed, the corruption goes away.  Patch
below for others to test - once I've finished removing the rest of my
debugging, I'll send it properly.

(It should be noted that calling fec_dump() as per above is enough to make
the bug go away - because it delays overwriting the buffer address long
enough that the FEC can transmit the packets before we stomp over the
still-to-be-transmitted entries.)

Given that this bug can seriously screw data up in undetectable ways (TCP
checksums don't save you, because the FEC generates them on the data which
it read from memory, even if it happened to read the data from the SoC's
boot ROM) we do need to get this fixed ASAP.

diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index f9c0baea12ed..8bb2a811df3e 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1227,8 +1227,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id)
 			skb = txq->tx_skbuff[index];
 			bdnum++;
 		}
-		if (skb_shinfo(skb)->nr_frags &&
-		    (status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
+		if ((status = bdp_t->cbd_sc) & BD_ENET_TX_READY)
 			break;
 
 		for (i = 0; i < bdnum; i++) {


-- 
FTTC broadband for 0.8mile line: currently@10.5Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-21 20:53             ` Russell King - ARM Linux
@ 2015-03-21 22:35               ` Fabio Estevam
  -1 siblings, 0 replies; 23+ messages in thread
From: Fabio Estevam @ 2015-03-21 22:35 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Панов
	Андрей,
	fugang.duan, netdev, linux-arm-kernel

Hi Russell,

On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:

> Given that this bug can seriously screw data up in undetectable ways (TCP
> checksums don't save you, because the FEC generates them on the data which
> it read from memory, even if it happened to read the data from the SoC's
> boot ROM) we do need to get this fixed ASAP.

Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
this corruption problem.

Regards,

Fabio Estevam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-21 22:35               ` Fabio Estevam
  0 siblings, 0 replies; 23+ messages in thread
From: Fabio Estevam @ 2015-03-21 22:35 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:

> Given that this bug can seriously screw data up in undetectable ways (TCP
> checksums don't save you, because the FEC generates them on the data which
> it read from memory, even if it happened to read the data from the SoC's
> boot ROM) we do need to get this fixed ASAP.

Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
this corruption problem.

Regards,

Fabio Estevam

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-21 22:35               ` Fabio Estevam
@ 2015-03-22 20:08                 ` Панов Андрей
  -1 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-22 20:08 UTC (permalink / raw)
  To: Fabio Estevam, Russell King - ARM Linux
  Cc: fugang.duan, netdev, linux-arm-kernel

Hi!

22.03.2015, 01:35, "Fabio Estevam" <festevam@gmail.com>:
> Hi Russell,
>
> On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
>>  Given that this bug can seriously screw data up in undetectable ways (TCP
>>  checksums don't save you, because the FEC generates them on the data which
>>  it read from memory, even if it happened to read the data from the SoC's
>>  boot ROM) we do need to get this fixed ASAP.
>
> Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
> this corruption problem.

I've tested with current mainline and with mainline+2b995f63987013 commit with Russell's
fix and it both works fine, without corruption.

--
 Андрей

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-22 20:08                 ` Панов Андрей
  0 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-22 20:08 UTC (permalink / raw)
  To: linux-arm-kernel

Hi!

22.03.2015, 01:35, "Fabio Estevam" <festevam@gmail.com>:
> Hi Russell,
>
> On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
>> ?Given that this bug can seriously screw data up in undetectable ways (TCP
>> ?checksums don't save you, because the FEC generates them on the data which
>> ?it read from memory, even if it happened to read the data from the SoC's
>> ?boot ROM) we do need to get this fixed ASAP.
>
> Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
> this corruption problem.

I've tested with current mainline and with mainline+2b995f63987013 commit with Russell's
fix and it both works fine, without corruption.

--
 ??????

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-21 22:35               ` Fabio Estevam
  (?)
  (?)
@ 2015-03-23  2:42               ` fugang.duan
  2015-03-23  8:22                   ` Панов Андрей
  -1 siblings, 1 reply; 23+ messages in thread
From: fugang.duan @ 2015-03-23  2:42 UTC (permalink / raw)
  To: Fabio Estevam, Russell King - ARM Linux
  Cc: Панов
	Андрей,
	netdev

From: Fabio Estevam <festevam@gmail.com> Sent: Sunday, March 22, 2015 6:36 AM
> To: Russell King - ARM Linux
> Cc: Панов Андрей; Duan Fugang-B38611; netdev@vger.kernel.org; linux-arm-
> kernel
> Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
> broken. In 4.0.0-rc3
> 
> Hi Russell,
> 
> On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> 
> > Given that this bug can seriously screw data up in undetectable ways
> > (TCP checksums don't save you, because the FEC generates them on the
> > data which it read from memory, even if it happened to read the data
> > from the SoC's boot ROM) we do need to get this fixed ASAP.
> 
> Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
> this corruption problem.
> 
> Regards,
> 
> Fabio Estevam

We cannot revert the commit 2b995f63987013, otherwise there introduce other issue. The correct fix method is Russell King's fix in the previous mail.
It is strange thing that I cannot reproduce the issue on i.MX6q sabresd board. Anyway, we must consider TSO case that it's not a fragmented skb.

Regards,
Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
  2015-03-23  2:42               ` fugang.duan
@ 2015-03-23  8:22                   ` Панов Андрей
  0 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-23  8:22 UTC (permalink / raw)
  To: fugang.duan, Fabio Estevam, Russell King - ARM Linux
  Cc: netdev, linux-arm-kernel



23.03.2015, 05:42, "fugang.duan@freescale.com" <fugang.duan@freescale.com>:
> From: Fabio Estevam <festevam@gmail.com> Sent: Sunday, March 22, 2015 6:36 AM
>>  To: Russell King - ARM Linux
>>  Cc: Панов Андрей; Duan Fugang-B38611; netdev@vger.kernel.org; linux-arm-
>>  kernel
>>  Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
>>  broken. In 4.0.0-rc3
>>
>>  Hi Russell,
>>
>>  On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
>>  <linux@arm.linux.org.uk> wrote:
>>>  Given that this bug can seriously screw data up in undetectable ways
>>>  (TCP checksums don't save you, because the FEC generates them on the
>>>  data which it read from memory, even if it happened to read the data
>>>  from the SoC's boot ROM) we do need to get this fixed ASAP.
>>  Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
>>  this corruption problem.
>>
>>  Regards,
>>
>>  Fabio Estevam
>
> We cannot revert the commit 2b995f63987013, otherwise there introduce other issue. The correct fix method is Russell King's fix in the previous mail.
> It is strange thing that I cannot reproduce the issue on i.MX6q sabresd board. Anyway, we must consider TSO case that it's not a fragmented skb.

It is just DMA_API_DEBUG=y error versus several data corruption error. DMA_API_DEBUG can be wrong too.
And did you do the check with that option enabled? This can cause delays in kernel enough to do actually write to the network before code in commit freed non-send data blocks.
I have it disabled all the time.

And you can check it by compiling a kernel over NFS, or big git merges over NFS, or doing big ftp transfer, etc.

--
 Андрей

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3
@ 2015-03-23  8:22                   ` Панов Андрей
  0 siblings, 0 replies; 23+ messages in thread
From: Панов Андрей @ 2015-03-23  8:22 UTC (permalink / raw)
  To: linux-arm-kernel



23.03.2015, 05:42, "fugang.duan at freescale.com" <fugang.duan@freescale.com>:
> From: Fabio Estevam <festevam@gmail.com> Sent: Sunday, March 22, 2015 6:36 AM
>> ?To: Russell King - ARM Linux
>> ?Cc: ????? ??????; Duan Fugang-B38611; netdev at vger.kernel.org; linux-arm-
>> ?kernel
>> ?Subject: Re: Bug in drivers/net/ethernet/freescale/fec_main.c, TX is
>> ?broken. In 4.0.0-rc3
>>
>> ?Hi Russell,
>>
>> ?On Sat, Mar 21, 2015 at 5:53 PM, Russell King - ARM Linux
>> ?<linux@arm.linux.org.uk> wrote:
>>> ?Given that this bug can seriously screw data up in undetectable ways
>>> ?(TCP checksums don't save you, because the FEC generates them on the
>>> ?data which it read from memory, even if it happened to read the data
>>> ?from the SoC's boot ROM) we do need to get this fixed ASAP.
>> ?Current mainline has 2b995f63987013 reverted, so 4.0-rc5 will not have
>> ?this corruption problem.
>>
>> ?Regards,
>>
>> ?Fabio Estevam
>
> We cannot revert the commit 2b995f63987013, otherwise there introduce other issue. The correct fix method is Russell King's fix in the previous mail.
> It is strange thing that I cannot reproduce the issue on i.MX6q sabresd board. Anyway, we must consider TSO case that it's not a fragmented skb.

It is just DMA_API_DEBUG=y error versus several data corruption error. DMA_API_DEBUG can be wrong too.
And did you do the check with that option enabled? This can cause delays in kernel enough to do actually write to the network before code in commit freed non-send data blocks.
I have it disabled all the time.

And you can check it by compiling a kernel over NFS, or big git merges over NFS, or doing big ftp transfer, etc.

--
 ??????

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-23  8:22 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-09 19:50 Bug in drivers/net/ethernet/freescale/fec_main.c, TX is broken. In 4.0.0-rc3 Панов Андрей
2015-03-10 20:12 ` Панов Андрей
2015-03-10 20:12   ` Панов Андрей
2015-03-16  9:21   ` fugang.duan
2015-03-16  9:21     ` fugang.duan at freescale.com
2015-03-16 13:37     ` Панов Андрей
2015-03-16 13:37       ` Панов Андрей
2015-03-16 14:01       ` fugang.duan
2015-03-16 14:01         ` fugang.duan at freescale.com
2015-03-16 19:09         ` Панов Андрей
2015-03-17  1:49           ` fugang.duan
2015-03-21 20:53           ` Russell King - ARM Linux
2015-03-21 20:53             ` Russell King - ARM Linux
2015-03-21 22:35             ` Fabio Estevam
2015-03-21 22:35               ` Fabio Estevam
2015-03-22 20:08               ` Панов Андрей
2015-03-22 20:08                 ` Панов Андрей
2015-03-23  2:42               ` fugang.duan
2015-03-23  8:22                 ` Панов Андрей
2015-03-23  8:22                   ` Панов Андрей
2015-03-16 19:09         ` Панов Андрей
2015-03-13 13:48 ` Ben Hutchings
2015-03-13 16:41   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.