All of lore.kernel.org
 help / color / mirror / Atom feed
* MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down
@ 2010-03-30 10:00 Roman Fietze
  2010-03-30 10:46 ` Wolfgang Grandegger
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Fietze @ 2010-03-30 10:00 UTC (permalink / raw)
  To: linuxppc-dev

Hello,

I think this is a never ending story. This error still happens under
higher load every few seconds, until I get a "PHY: f0003000:00 - Link
is Down", on my box easiliy reproducable after maybe 15 to 30 seconds.
I can recover using "ip link set down/up dev eth0".

I double checked that I'm using the most recent version of this driver
(checked with DENX, benh master/next, using Wolfgang Denk's version of
the 2.6.33), this includes the locking patches from Asier Llano, the
hard setting of mii_speed in the PHY mdio transfer routine of course.
I tried all 8 combinations of PLDIS, BSDIS and SE, with and without
CONFIG_NOT_COHERENT_CACHE.

As some of you probably remember, I'm running this controller under
high load on FEC, ATA and LPC. As soon as "the" load is going above a
certain level I get those FEC RFIFO errors, sometimes ATA errors
(MWDMA2) and sometimes even lost SDMA interrupts using BestComm with
the SCLPC (now switched back to simple PIO). I quite sure almost all
of this is the BestComm's fault.

Did somebody already try the latest NAPI patches, which might give me
a slight chance to have a workaround? Any idea or upcoming patch to
address this problem once more, and if it's just by recovering e.g.
within mpc52xx_fec_mdio_transfer's timeout using some other dirty
workaround?


Roman

=2D-=20
Roman Fietze                Telemotive AG B=FCro M=FChlhausen
Breitwiesen                              73347 M=FChlhausen
Tel.: +49(0)7335/18493-45        http://www.telemotive.de

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down
  2010-03-30 10:00 MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down Roman Fietze
@ 2010-03-30 10:46 ` Wolfgang Grandegger
  2010-03-31 10:15   ` Wolfgang Grandegger
  0 siblings, 1 reply; 5+ messages in thread
From: Wolfgang Grandegger @ 2010-03-30 10:46 UTC (permalink / raw)
  To: Roman Fietze; +Cc: linuxppc-dev

Roman Fietze wrote:
> Hello,
> 
> I think this is a never ending story. This error still happens under
> higher load every few seconds, until I get a "PHY: f0003000:00 - Link
> is Down", on my box easiliy reproducable after maybe 15 to 30 seconds.
> I can recover using "ip link set down/up dev eth0".
> 
> I double checked that I'm using the most recent version of this driver
> (checked with DENX, benh master/next, using Wolfgang Denk's version of
> the 2.6.33), this includes the locking patches from Asier Llano, the
> hard setting of mii_speed in the PHY mdio transfer routine of course.
> I tried all 8 combinations of PLDIS, BSDIS and SE, with and without
> CONFIG_NOT_COHERENT_CACHE.
> 
> As some of you probably remember, I'm running this controller under
> high load on FEC, ATA and LPC. As soon as "the" load is going above a
> certain level I get those FEC RFIFO errors, sometimes ATA errors
> (MWDMA2) and sometimes even lost SDMA interrupts using BestComm with
> the SCLPC (now switched back to simple PIO). I quite sure almost all
> of this is the BestComm's fault.

This problem shows up quickly with NAPI, but I have never observed it
with the current version. The error occurs when the software is not able
to readout the messages in time. Unfortunately, dealing with Bestcomm is
a pain.

> Did somebody already try the latest NAPI patches, which might give me
> a slight chance to have a workaround? Any idea or upcoming patch to
> address this problem once more, and if it's just by recovering e.g.
> within mpc52xx_fec_mdio_transfer's timeout using some other dirty
> workaround?

Yes, I have a NAPI version ready for testing. I will roll it out as RFC
today or tomorrow.

Wolfgang.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down
  2010-03-30 10:46 ` Wolfgang Grandegger
@ 2010-03-31 10:15   ` Wolfgang Grandegger
  2010-04-01 10:04     ` Roman Fietze
  0 siblings, 1 reply; 5+ messages in thread
From: Wolfgang Grandegger @ 2010-03-31 10:15 UTC (permalink / raw)
  To: Roman Fietze; +Cc: linuxppc-dev

Hi Roman,

Wolfgang Grandegger wrote:
> Roman Fietze wrote:
>> Hello,
>>
>> I think this is a never ending story. This error still happens under
>> higher load every few seconds, until I get a "PHY: f0003000:00 - Link
>> is Down", on my box easiliy reproducable after maybe 15 to 30 seconds.
>> I can recover using "ip link set down/up dev eth0".
>>
>> I double checked that I'm using the most recent version of this driver
>> (checked with DENX, benh master/next, using Wolfgang Denk's version of
>> the 2.6.33), this includes the locking patches from Asier Llano, the
>> hard setting of mii_speed in the PHY mdio transfer routine of course.
>> I tried all 8 combinations of PLDIS, BSDIS and SE, with and without
>> CONFIG_NOT_COHERENT_CACHE.
>>
>> As some of you probably remember, I'm running this controller under
>> high load on FEC, ATA and LPC. As soon as "the" load is going above a
>> certain level I get those FEC RFIFO errors, sometimes ATA errors
>> (MWDMA2) and sometimes even lost SDMA interrupts using BestComm with
>> the SCLPC (now switched back to simple PIO). I quite sure almost all
>> of this is the BestComm's fault.
> 
> This problem shows up quickly with NAPI, but I have never observed it
> with the current version. The error occurs when the software is not able
> to readout the messages in time. Unfortunately, dealing with Bestcomm is
> a pain.
> 
>> Did somebody already try the latest NAPI patches, which might give me
>> a slight chance to have a workaround? Any idea or upcoming patch to
>> address this problem once more, and if it's just by recovering e.g.
>> within mpc52xx_fec_mdio_transfer's timeout using some other dirty
>> workaround?
> 
> Yes, I have a NAPI version ready for testing. I will roll it out as RFC
> today or tomorrow.

I just sent out the patch. Would be nice if you, or somebody else, could
do some testing and provide some feedback. FYI, I will be out of office
next week.

Thanks,

Wolfgang.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down
  2010-03-31 10:15   ` Wolfgang Grandegger
@ 2010-04-01 10:04     ` Roman Fietze
  2010-04-05 12:21       ` Wolfgang Grandegger
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Fietze @ 2010-04-01 10:04 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: linuxppc-dev

Hallo Wolfgang,

On Wednesday 31 March 2010 12:15:47 Wolfgang Grandegger wrote:

> I just sent out the patch.

Thanks a lot.

> Would be nice if you, or somebody else, could do some testing and
> provide some feedback.

I tested the patches with the following setup:

=2D DENX 2.6.33 plus NAPI patch, kernel config with and w/o NAPI enabled

=2D Own Icecube based board using MPC5200B

=2D Two different hard drives (because the Toshiba gave my headaches),
  ext3 default settings of mkfs.ext3, MWDMA2

=2D FPGA on LPC receiving high bandwidth MOST150 data in PIO mode (for
  the test: generating them internally), small app writing the data to
  disk. Why PIO? SCLPC FIFO gave=20

=2D netcat receving data optionally writing the data to HD, sender is a
  Gigabit Intel NIC feeded using netcat (and /dev/zero) as well via a
  100MBit/s switch


And now the first and preliminary results of the tests (see legend and
description of the results below the table):

NAPI	MOST	HD		   load		bw	rx_irq		rfifo
=2D-----+-------+---------------+---------------+-------+---------------+--=
=2D----
				nc	most
=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=
=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D

on	off	MK4036GA	93		5.15	32000-35000
		-		99		10.5	72000-74000

	on	MK4036GA	49	46	crash	15000-17500	none seen
	on	HEJ421010G9AT00	48	47		15000-17500	~100-500, recovers

=2D-----+-------+---------------+-----------------------+-------+----------=
=2D----+-------

off	off	MK4036GA	90		5.15	34000-36000
		-		99		10.5	76000-77000

	on	MK4036GA	48	47	crash	17500-19000	~200, network down

Legend:
=2D------

MOST:		PIO mode access to FPGA receiving generated MOST150 data
		very high data rates possible
HD:		used disk type
load/nc:	load netcat, %
load/most:	load MOST receiver app, %
load/idle:	was always 0%
bw:		netcat network band width, MB/s
rx_irq:		FEX RX IRQ, rate in Hz
rfifo:		RX FIFO errors, time in between in seconds

Results:
=2D-------

Using the MK4036GA HD always crashes IDE after a few seconds. A reboot
does not recover the disk, I always need a power cycle. That's why I
switched to a HEJ421010G9AT00.

NAPI reduces the FEC RX interrupt rate (/proc/interrupts) "somewhat".
Could not detect an increase of the maximum bandwidth, but that's not
the "problem" of NAPI.

NAPI nicely recovers more or less nicely from link down (link down to
up about 1 second), without NAPI I have to do that manually (e.g. ip
set link down/up). That's something I was looking for since the
modular PHY drivers.

Some network applications (e.g. our Car Head Unit GN Protocol Logger)
break up their connection when the link goes down (e.g. due to
internal timeouts? Probably fixable). Ssh and netcat connections stay
up.

Transferred many GiB of data to the MPC w/o any problems except those
recoverable FEC_IEVENT_RFIFO_ERRORs.

This patch really looks good to me.

I will run some additional tests e.g. with mixed RX and TX, different
and varying data rates, etc.


> FYI, I will be out of office next week.

Lucky Guy


Roman

=2D-=20
Roman Fietze                Telemotive AG B=FCro M=FChlhausen
Breitwiesen                              73347 M=FChlhausen
Tel.: +49(0)7335/18493-45        http://www.telemotive.de

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down
  2010-04-01 10:04     ` Roman Fietze
@ 2010-04-05 12:21       ` Wolfgang Grandegger
  0 siblings, 0 replies; 5+ messages in thread
From: Wolfgang Grandegger @ 2010-04-05 12:21 UTC (permalink / raw)
  To: Roman Fietze; +Cc: linuxppc-dev

Hallo Roman,

Roman Fietze wrote:
> Hallo Wolfgang,
> 
> On Wednesday 31 March 2010 12:15:47 Wolfgang Grandegger wrote:
> 
>> I just sent out the patch.
> 
> Thanks a lot.
> 
>> Would be nice if you, or somebody else, could do some testing and
>> provide some feedback.
> 
> I tested the patches with the following setup:
> 
> - DENX 2.6.33 plus NAPI patch, kernel config with and w/o NAPI enabled
> 
> - Own Icecube based board using MPC5200B
> 
> - Two different hard drives (because the Toshiba gave my headaches),
>   ext3 default settings of mkfs.ext3, MWDMA2
> 
> - FPGA on LPC receiving high bandwidth MOST150 data in PIO mode (for
>   the test: generating them internally), small app writing the data to
>   disk. Why PIO? SCLPC FIFO gave 
> 
> - netcat receving data optionally writing the data to HD, sender is a
>   Gigabit Intel NIC feeded using netcat (and /dev/zero) as well via a
>   100MBit/s switch
> 
> 
> And now the first and preliminary results of the tests (see legend and
> description of the results below the table):
> 
> NAPI	MOST	HD		   load		bw	rx_irq		rfifo
> ------+-------+---------------+---------------+-------+---------------+-------
> 				nc	most
> ======+=======+===============+===============+=======+===============+=======
> 
> on	off	MK4036GA	93		5.15	32000-35000
> 		-		99		10.5	72000-74000
> 
> 	on	MK4036GA	49	46	crash	15000-17500	none seen
> 	on	HEJ421010G9AT00	48	47		15000-17500	~100-500, recovers
> 
> ------+-------+---------------+-----------------------+-------+---------------+-------
> 
> off	off	MK4036GA	90		5.15	34000-36000
> 		-		99		10.5	76000-77000
> 
> 	on	MK4036GA	48	47	crash	17500-19000	~200, network down
> 
> Legend:
> -------
> 
> MOST:		PIO mode access to FPGA receiving generated MOST150 data
> 		very high data rates possible
> HD:		used disk type
> load/nc:	load netcat, %
> load/most:	load MOST receiver app, %
> load/idle:	was always 0%
> bw:		netcat network band width, MB/s
> rx_irq:		FEX RX IRQ, rate in Hz
> rfifo:		RX FIFO errors, time in between in seconds
> 
> Results:
> --------
> 
> Using the MK4036GA HD always crashes IDE after a few seconds. A reboot
> does not recover the disk, I always need a power cycle. That's why I
> switched to a HEJ421010G9AT00.

That might be a different issue.

> NAPI reduces the FEC RX interrupt rate (/proc/interrupts) "somewhat".
> Could not detect an increase of the maximum bandwidth, but that's not
> the "problem" of NAPI.

I realized the same behavior.

> NAPI nicely recovers more or less nicely from link down (link down to
> up about 1 second), without NAPI I have to do that manually (e.g. ip
> set link down/up). That's something I was looking for since the
> modular PHY drivers.

Recovering from link down takes a while, unfortunately. I also do not
yet have an explanation why the link goes down, at all. The rather long
recovery time does not harm for my use case of "heavy packet storms" but
is rather annoying a high but not yet critical traffic.

> Some network applications (e.g. our Car Head Unit GN Protocol Logger)
> break up their connection when the link goes down (e.g. due to
> internal timeouts? Probably fixable). Ssh and netcat connections stay
> up.

That's a problem due to the overloaded network.

> Transferred many GiB of data to the MPC w/o any problems except those
> recoverable FEC_IEVENT_RFIFO_ERRORs.
> 
> This patch really looks good to me.

Doesn't sound too bad...

> I will run some additional tests e.g. with mixed RX and TX, different
> and varying data rates, etc.

OK.

Wolfgang.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-04-05 12:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-30 10:00 MPC5200B, many FEC_IEVENT_RFIFO_ERRORs until link down Roman Fietze
2010-03-30 10:46 ` Wolfgang Grandegger
2010-03-31 10:15   ` Wolfgang Grandegger
2010-04-01 10:04     ` Roman Fietze
2010-04-05 12:21       ` Wolfgang Grandegger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.