linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
       [not found] ` <20020518040143.GA9318@k3.hellgate.ch>
@ 2002-05-17 23:13   ` Ivan G.
  2002-05-18 19:11     ` 'Roger Luethi'
  0 siblings, 1 reply; 15+ messages in thread
From: Ivan G. @ 2002-05-17 23:13 UTC (permalink / raw)
  To: 'Roger Luethi'; +Cc: LKML


> [1] "aborted due to excessive collisions" according to the doc, but it also
>     mentions that for "excessive collisions", bit 13 would have to be set.
>     It isn't.
>     (I have seen it on my VT6102, though, with interrupt status of 0x2008
>     for instance)

bit 13 = IntrTxAborted
I don't see it below.

/* Enable interrupts by setting the interrupt mask. */
writew(IntrRxDone | IntrRxErr | IntrRxEmpty| IntrRxOverflow| IntrRxDropped|
   IntrTxDone | IntrTxAbort | IntrTxUnderrun | IntrPCIErr | IntrStatsMax | 
IntrLinkChange | IntrMIIChange, ioaddr + IntrEnable);

Interrupts referenced in the driver and not listed here are: IntrRxNoBuf,
IntrRxWakeUp, IntrTxAborted

Interrupts included here but not used in the driver are:
IntrRxOverflow, IntrRxDropped.

I've known about this for a while but I wasn't sure how to fix everything...
I wasn't sure if every interrupt was handled correctly.
For example what exactly is the difference between IntrTxAbort and 
IntrTxAborted. 

I was also puzzled as to why the docs say:
Transmit Descriptor Underflow for IntrMIIChange

I am talking about the newest VT86C100A docs.
I believe Urban Widmark had a patch that redefined that interrupt but it was 
never included.








^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-17 23:13   ` [PATCH] #2 VIA Rhine stalls: TxAbort handling Ivan G.
@ 2002-05-18 19:11     ` 'Roger Luethi'
  0 siblings, 0 replies; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-18 19:11 UTC (permalink / raw)
  To: Ivan G.; +Cc: LKML

On Fri, 17 May 2002 17:13:33 -0600, Ivan G. wrote:
> 
> > [1] "aborted due to excessive collisions" according to the doc, but it also
> >     mentions that for "excessive collisions", bit 13 would have to be set.
> >     It isn't.
> >     (I have seen it on my VT6102, though, with interrupt status of 0x2008
> >     for instance)
> 
> bit 13 = IntrTxAborted
> I don't see it below.
> 
> /* Enable interrupts by setting the interrupt mask. */
> writew(IntrRxDone | IntrRxErr | IntrRxEmpty| IntrRxOverflow| IntrRxDropped|
>    IntrTxDone | IntrTxAbort | IntrTxUnderrun | IntrPCIErr | IntrStatsMax | 
> IntrLinkChange | IntrMIIChange, ioaddr + IntrEnable);

Hey, somebody's paying attention here! I'm impressed ;-).

We don't need to enable an interrupt in order to have the status bit set by
the chip. Enabling an interrupt basically means that the ISR gets called
when the the status flag goes up.

To put it differently: if you don't enable both TxAbort or TxAborted, you
get to handle those events whenever the ISR is active anyway. So either you
get, say 0x2009 (errors + rxdone) or you enter the ISR with 0x0001 (rxdone)
and while you're doing your first round the chip updates the status
register and next time you hit

while ((intr_status = readw(ioaddr + IntrStatus))) {

you find it's 0x2008 now. No matter what interrupts you enable: as long as
you receive a steady flow of interrupts of any kind, the driver should
still work. Sort of.

Not enabling the TxAborted interrupt is easy to justify: the chip just
failed to send out a frame due to massive collisions. Therefore, we are not
in a hurry to be notified and restart Tx, and as we learned recently it
also takes the chip some time to halt the Tx engine.

Of course that's all a moot point if TxAborted always implies TxAbort:
_that_ interrupt is enabled. I think I'm going to enable TxAborted
regardless. It shouldn't make a difference, and if it did, we're better off
having it enabled.

The contrast between interrupts enabled and handled in the current LK
driver is not pretty, but for most of them it's probably rather a clean-up
than a bug fixing issue if anything. I am currently focusing on what I
suspect to be a problem, so my comments below are mostly based on reading
src and docs, not actual experiments.

> Interrupts referenced in the driver and not listed here are: IntrRxNoBuf,
> IntrRxWakeUp, IntrTxAborted

IMO IntrRxNoBuf should be enabled. If there are no Rx buffers left, we want
to know ASAP and free them.

IntrRxWakeUp has been used to indicate that a magic wake up packet was
received (hence the name). On the VT6102, the same bit is now a "general
purpose interrupt". Whatever that means. Proper support for magic packets
might be desirable later on. Not a problem now.

> Interrupts included here but not used in the driver are:
> IntrRxOverflow, IntrRxDropped.

Since we have problems with Tx, I don't care much about Rx right now. May
be a mistake ;-). For IntrRxDropped, though, we call via_rhine_rx(). We do
use it.

> For example what exactly is the difference between IntrTxAbort and 
> IntrTxAborted. 

Depends on the docs you read. Currently I tend to think that IntrTxAbort is
a misnomer and I have renamed it IntrTxError in my code.

> I was also puzzled as to why the docs say:
> Transmit Descriptor Underflow for IntrMIIChange
> 
> I am talking about the newest VT86C100A docs.

I agree, and I am talking about three different docs. I'd be very curious
where this is coming from, I can't imagine that somebody just pulled it out
of thin air. The name IntrMIIChange and its use in the driver suggest that
the 0x0200 interrupt announces changes in the MII status register.
According to all docs I've checked 0x0200 means a Tx underflow related to a
descriptor (as opposed to 0x0010 which is a Tx underflow related to a
frame).

But even if these suspicions are correct, the impact seems rather small.
The driver takes action based on the MII status register anyway, so worst
case is probably we ignore Tx descriptor underflows (which I don't expect
to be very common).

I'm not sure if it was a good idea to take this back to LKML at this point.
I doubt that the greater public is interested in the gory details of VIA
Rhine programming. I suggest we keep the traffic among the interested
parties and send a patch along with a summary once we have come to a
conclusion or a patch needs wider testing. Comments?

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-17 18:46 Manfred Spraul
  2002-05-17 19:56 ` 'Roger Luethi'
@ 2002-05-18 10:08 ` David Woodhouse
  1 sibling, 0 replies; 15+ messages in thread
From: David Woodhouse @ 2002-05-18 10:08 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: 'Roger Luethi', linux-kernel, Shing Chuang


manfred@colorfullife.com said:
> IIRC all register reads from the addresses that belong to a pulled out
>  PCMCIA card return 0xFFFFFFFF ;-) 

Not necessarily. It can lock up the bus and require a hard reset.

That's why they make some pins longer than others -- so you can get an
interrupt which lets you know it's going away a split second before it's
actually gone, and you have time to remove power from the socket and call
some kind of abort method on the driver so it knows it must never touch
the hardware again.

--
dwmw2



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-17 18:46 Manfred Spraul
@ 2002-05-17 19:56 ` 'Roger Luethi'
  2002-05-18 10:08 ` David Woodhouse
  1 sibling, 0 replies; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-17 19:56 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: linux-kernel, Shing Chuang

> >>do {} while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));
> >
> >The driver "waits a little" in the interrupt handler? How long can that
> >take, worst case? I don't know of many places where the kernel stops to
> >wait for an external device to change some value.
> 
> It's not that uncommon: Most network drivers busy-wait after stopping 
> the tx process during netif_close().

Yeah, but they don't depend on the chip to behave. They will break out at
some point. That's the issue several people pointed out.

> Shing, I don't like the empty body of the while loop. It's not a bug, 
> but doesn't that generate a large load on the pci bus?
> 
> I've always added an udelay(1), i.e. wait one microsecond, into such loops.

Sounds reasonable. I will add that later. Currently I have no delay,
either, because I want to collect some numbers about how long we expect to
wait.

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
@ 2002-05-17 18:46 Manfred Spraul
  2002-05-17 19:56 ` 'Roger Luethi'
  2002-05-18 10:08 ` David Woodhouse
  0 siblings, 2 replies; 15+ messages in thread
From: Manfred Spraul @ 2002-05-17 18:46 UTC (permalink / raw)
  To: 'Roger Luethi', linux-kernel, Shing Chuang

>> All the three conditions caused the TXON bit of CR1 went off. the
>> driver must wait  a little while until the bit go off, reset the pointer of
>> [...]
>> do {} while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));
> 
> The driver "waits a little" in the interrupt handler? How long can that
> take, worst case? I don't know of many places where the kernel stops to
> wait for an external device to change some value.
> 

It's not that uncommon: Most network drivers busy-wait after stopping 
the tx process during netif_close().

But I would add a maximum timeout and a printk - just to avoid 
unexplainable system hangs. One example would be natsemi_stop_rxtx() in 
drivers/net/natsemi.c.
IIRC all register reads from the addresses that belong to a pulled out 
PCMCIA card return 0xFFFFFFFF ;-)

Shing, I don't like the empty body of the while loop. It's not a bug, 
but doesn't that generate a large load on the pci bus?

I've always added an udelay(1), i.e. wait one microsecond, into such loops.

--
	Manfred


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-17 12:51           ` Richard B. Johnson
@ 2002-05-17 16:25             ` 'Roger Luethi'
  0 siblings, 0 replies; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-17 16:25 UTC (permalink / raw)
  To: Richard B. Johnson; +Cc: linux-kernel

> Well, maybe the fashion of the day. Do `grep karound *.c` in
> ../linux/drivers/net and see all the 'workarounds' that exist for
> chip problems. Some of the problems are induced by the coding and
> most are real hardware problems.

Nobody's debating the need for workarounds. I just prefer to look for a
more subtle method before taking out the sledge-hammer several times a
second(!) to reprogram the chip from scratch.

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-17  0:16         ` 'Roger Luethi'
@ 2002-05-17 12:51           ` Richard B. Johnson
  2002-05-17 16:25             ` 'Roger Luethi'
  0 siblings, 1 reply; 15+ messages in thread
From: Richard B. Johnson @ 2002-05-17 12:51 UTC (permalink / raw)
  To: 'Roger Luethi'
  Cc: Shing Chuang, Urban Widmark, Ivan G.,
	Jeff Garzik, linux-kernel, AJ Jiang

On Fri, 17 May 2002, 'Roger Luethi' wrote:

> > I think one has to <somehow> find that the chip has halted besides
> > the current way (noticing that it can't transmit anymore). I don't
> 
> There seems to be a misunderstanding. We already get an interrupt and a
> status to indicate what kind of problem occured. Thanks to Shing's recent
> posting we even have confirmed information about what events stop the Tx
> engine. _Plus_ there is a bit flag TXON in a chip status register which
> indicates whether the Tx engine is active.
>
[SNIPPED...]
> 
> > In the chip-halted work-around that everybody seems to use now,
> > reprogram it from scratch. The last program operation being to remove
> > loop-back. I don't even know if this chip can be set to loop-back,
> > though, so the whole idea may be moot.
> 
> It can be set to loopback, but I'm not keen on having my chip reprogrammed
> by every traffic burst (excessive collisions -> abort). Is that really the
> fashion of the year now?

Well, maybe the fashion of the day. Do `grep karound *.c` in
../linux/drivers/net and see all the 'workarounds' that exist for
chip problems. Some of the problems are induced by the coding and
most are real hardware problems.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 21:05       ` Richard B. Johnson
@ 2002-05-17  0:16         ` 'Roger Luethi'
  2002-05-17 12:51           ` Richard B. Johnson
  0 siblings, 1 reply; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-17  0:16 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Shing Chuang, Urban Widmark, Ivan G.,
	Jeff Garzik, linux-kernel, AJ Jiang

> I think one has to <somehow> find that the chip has halted besides
> the current way (noticing that it can't transmit anymore). I don't

There seems to be a misunderstanding. We already get an interrupt and a
status to indicate what kind of problem occured. Thanks to Shing's recent
posting we even have confirmed information about what events stop the Tx
engine. _Plus_ there is a bit flag TXON in a chip status register which
indicates whether the Tx engine is active.

So what's left as a (potential) problem? -- The code snippet that Shing
shared with us suggests that there is potential for a race between the chip
and an ISR which is already scavenging Tx buffers: the chip has updated the
buffer descriptors and set the interrupt status to reflect the error, but
is not yet done halting the Tx engine (if it had only failed to update the
TXON status bit, there would be no special handling required, since we are
writing that bit anyway in a next step, so the issue has to be that the
chip is in a transitional state and restarting the Tx engine at this point
would be premature). Of course this description assumes that the VIA coders
made that particular recent change in their driver for a reason.

> In the chip-halted work-around that everybody seems to use now,
> reprogram it from scratch. The last program operation being to remove
> loop-back. I don't even know if this chip can be set to loop-back,
> though, so the whole idea may be moot.

It can be set to loopback, but I'm not keen on having my chip reprogrammed
by every traffic burst (excessive collisions -> abort). Is that really the
fashion of the year now?

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 20:31     ` 'Roger Luethi'
  2002-05-16 16:39       ` Ivan G.
@ 2002-05-16 21:05       ` Richard B. Johnson
  2002-05-17  0:16         ` 'Roger Luethi'
  1 sibling, 1 reply; 15+ messages in thread
From: Richard B. Johnson @ 2002-05-16 21:05 UTC (permalink / raw)
  To: 'Roger Luethi'
  Cc: Shing Chuang, Urban Widmark, Ivan G.,
	Jeff Garzik, linux-kernel, AJ Jiang

On Thu, 16 May 2002, 'Roger Luethi' wrote:

> > > The driver "waits a little" in the interrupt handler? How long can that
> > > take, worst case?
> > 
> > Forever..........^;)
> 
> We should assume that this is indeed the case, but it often helps to know
> what the expected values and their distribution are.
> 
> It's a weird situation anyway: both the buffer descriptor and the interrupt
> status have been updated by the chip to reflect the abort, but by the time
> we handle the error it may still be busy coming to a halt.
> 
> What tickles my curiosity is that my previous patch didn't fix the stalling
> for Ivan G. on his VT86C100A. Maybe the chip just wasn't ready to be
> restarted.
> 
> > Even if the chip never breaks, you end up with reports like..
> > "Strange, I make frisbees when buring CDs while M$ machines do
> > backups over the network..."
> 
> Not if the chip is guaranteed to have its thing done after one or two
> iterations. We make some inb and outb calls in the ISR either way.
> 
> That was hypothetically speaking of course, I'm not suggesting we rely on
> such a "guarantee".
> 
> Roger

I think one has to <somehow> find that the chip has halted besides
the current way (noticing that it can't transmit anymore). I don't
know how to do this, of course, but; if you could know that the
chip is hung, the first thing to do is to turn off its interrupt
request(s) (the chip, not the interrupt controller). Some older
(National) devices needed to have the chip then set to loopback
mode because they couldn't be programmed properly if data kept
coming in on the wire. The internal buffer pointers kept changing
in response to incoming data while the chip was being programmed.
By the time you got the chip programmed, it was hung by pointer-wrap.

In the chip-halted work-around that everybody seems to use now,
reprogram it from scratch. The last program operation being to remove
loop-back. I don't even know if this chip can be set to loop-back,
though, so the whole idea may be moot.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 18:25   ` Richard B. Johnson
@ 2002-05-16 20:31     ` 'Roger Luethi'
  2002-05-16 16:39       ` Ivan G.
  2002-05-16 21:05       ` Richard B. Johnson
  0 siblings, 2 replies; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-16 20:31 UTC (permalink / raw)
  To: Richard B. Johnson
  Cc: Shing Chuang, Urban Widmark, Ivan G.,
	Jeff Garzik, linux-kernel, AJ Jiang

> > The driver "waits a little" in the interrupt handler? How long can that
> > take, worst case?
> 
> Forever..........^;)

We should assume that this is indeed the case, but it often helps to know
what the expected values and their distribution are.

It's a weird situation anyway: both the buffer descriptor and the interrupt
status have been updated by the chip to reflect the abort, but by the time
we handle the error it may still be busy coming to a halt.

What tickles my curiosity is that my previous patch didn't fix the stalling
for Ivan G. on his VT86C100A. Maybe the chip just wasn't ready to be
restarted.

> Even if the chip never breaks, you end up with reports like..
> "Strange, I make frisbees when buring CDs while M$ machines do
> backups over the network..."

Not if the chip is guaranteed to have its thing done after one or two
iterations. We make some inb and outb calls in the ISR either way.

That was hypothetically speaking of course, I'm not suggesting we rely on
such a "guarantee".

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 18:03 ` 'Roger Luethi'
@ 2002-05-16 18:25   ` Richard B. Johnson
  2002-05-16 20:31     ` 'Roger Luethi'
  0 siblings, 1 reply; 15+ messages in thread
From: Richard B. Johnson @ 2002-05-16 18:25 UTC (permalink / raw)
  To: 'Roger Luethi'
  Cc: Shing Chuang, Urban Widmark, Ivan G.,
	Jeff Garzik, linux-kernel, AJ Jiang

On Thu, 16 May 2002, 'Roger Luethi' wrote:

> On Thu, 16 May 2002 18:03:25 +0800, Shing Chuang wrote:
> >      As following three error conditions occurred,   the VT6102 & VT86C100A
> > chip are designed to shutdown TX for driver to examine the error frame.
> > 
> >      1. Tx fifo underrun.                   
> > 
> >      2. Tx Abort (Too many collisions occurred).
> > 
> >      3. TxDescriptors status write back  error. (Only on VT6102 chip)
> 
> Hey, thanks! That's exactly the piece of information I've been looking for.
> 
> >      All the three conditions caused the TXON bit of CR1 went off. the
> > driver must wait  a little while until the bit go off, reset the pointer of
> > [...]
> >           do {} while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));
> 
> The driver "waits a little" in the interrupt handler? How long can that
> take, worst case?

Forever..........^;)

> I don't know of many places where the kernel stops to
> wait for an external device to change some value.
> 

Yep... should never wait inside an ISR, to say nothing about
the potential wait-forever.

Even if the chip never breaks, you end up with reports like..
"Strange, I make frisbees when buring CDs while M$ machines do
backups over the network..."

Or, I can't play ".wav" files anymore unless I unplug from the
network...

Stuff has to play together. Sometimes this means you can't get
the maximum-theoretical-possible through-put from your connected
devices.

The worse-case driver is where the programmer decided to turn
interrupts back on in the ISR.... to let higher-priority interrupts
occur...  FYI, there are always higher-priority interrupts that
will take the CPU away... you lose big-time.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

                 Windows-2000/Professional isn't.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 10:03 Shing Chuang
@ 2002-05-16 18:03 ` 'Roger Luethi'
  2002-05-16 18:25   ` Richard B. Johnson
  0 siblings, 1 reply; 15+ messages in thread
From: 'Roger Luethi' @ 2002-05-16 18:03 UTC (permalink / raw)
  To: Shing Chuang; +Cc: Urban Widmark, Ivan G., Jeff Garzik, linux-kernel, AJ Jiang

On Thu, 16 May 2002 18:03:25 +0800, Shing Chuang wrote:
>      As following three error conditions occurred,   the VT6102 & VT86C100A
> chip are designed to shutdown TX for driver to examine the error frame.
> 
>      1. Tx fifo underrun.                   
> 
>      2. Tx Abort (Too many collisions occurred).
> 
>      3. TxDescriptors status write back  error. (Only on VT6102 chip)

Hey, thanks! That's exactly the piece of information I've been looking for.

>      All the three conditions caused the TXON bit of CR1 went off. the
> driver must wait  a little while until the bit go off, reset the pointer of
> [...]
>           do {} while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));

The driver "waits a little" in the interrupt handler? How long can that
take, worst case? I don't know of many places where the kernel stops to
wait for an external device to change some value.

I have no numbers on the expected number of iterations, but I'd rather drop
out of the handler after a few failed checks and try again later (or just
reset the chip and log an error, if dropping out is rare enough :-)).

> current Tx descriptor, and  then turn on TXON bit of CR1 again. Those may be

ITYM the TXON bit of CR0. TDMD1 is the one you are setting in CR1. Which
takes me to the next question:

According to my docs, internal registers are like this:

VT86C100A
Byte Bit
0x08 (CR0) 5   TDMD
0x08 (CR0) 6   RDMD
0x09 (CR1) 5   Reserved
0x09 (CR1) 6   Reserved

VT6102
Byte Bit
0x08 (CR0) 5   TDMD
0x08 (CR0) 6   RDMD
0x09 (CR1) 5   TDMD1
0x09 (CR1) 6   RDMD1

The descriptions in both data sheets are somewhat unclear, so maybe you
could enlighten me about why you chose to set TDMD1 instead of TDMD (which
is what the LK driver does)?

Roger

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] #2 VIA Rhine stalls: TxAbort handling
  2002-05-16 20:31     ` 'Roger Luethi'
@ 2002-05-16 16:39       ` Ivan G.
  2002-05-16 21:05       ` Richard B. Johnson
  1 sibling, 0 replies; 15+ messages in thread
From: Ivan G. @ 2002-05-16 16:39 UTC (permalink / raw)
  To: LKML


> What tickles my curiosity is that my previous patch didn't fix the stalling
> for Ivan G. on his VT86C100A. Maybe the chip just wasn't ready to be
> restarted.
>

With your patch #2, the chip would actually "wait forever"
in some cases...it didn't timeout and recover 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] #2 VIA Rhine stalls: TxAbort handling
@ 2002-05-16 10:03 Shing Chuang
  2002-05-16 18:03 ` 'Roger Luethi'
  0 siblings, 1 reply; 15+ messages in thread
From: Shing Chuang @ 2002-05-16 10:03 UTC (permalink / raw)
  To: 'Roger Luethi', Urban Widmark, Ivan G., Jeff Garzik
  Cc: linux-kernel, AJ Jiang

Hi, 

     As following three error conditions occurred,   the VT6102 & VT86C100A
chip are designed to shutdown TX for driver to examine the error frame.

     1. Tx fifo underrun.                   

     2. Tx Abort (Too many collisions occurred).

     3. TxDescriptors status write back  error. (Only on VT6102 chip)

     All the three conditions caused the TXON bit of CR1 went off. the
driver must wait  a little while until the bit go off, reset the pointer of
current Tx descriptor, and  then turn on TXON bit of CR1 again. Those may be
the reasons why the  via-rhine sometimes hangs, or the watchdog timeout. 
     
     The following are codes of our new driver to handle those errors. 
      (The new driver is under testing now, and will be release very sooner)

     1. For Tx fifo underrun.

          Increase_tx_threshold();

          do {} while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));
          writel(cpu_to_le32(pTD->pInfo->curr_desc),
		&pMacRegs->CurrTxDescAddr);
           //Re-send the packet
           pTD->tdesc0.f1Owner=OWNED_BY_NIC;
           BYTE_REG_BITS_ON(CR0_TXON,&pMacRegs->byCR0); 
           BYTE_REG_BITS_ON(CR1_TDMD1,&pMacRegs->byCR1);

     2. For Tx Abort (Too many collisions occurred).

          do {}  while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));

           //Drop the frame
           pCurrTD=pCurrTD->next;	
           writel(cpu_to_le32(pCurrTD->pInfo->curr_desc),
		&pMacRegs->CurrTxDescAddr);
			
           BYTE_REG_BITS_ON(CR0_TXON,&pMacRegs->byCR0);           	

           BYTE_REG_BITS_ON(CR1_TDMD1,&pMacRegs->byCR1);

     3. For Tx Descripts status write back  error.        
          do {}
          while (BYTE_REG_BITS_IS_ON(CR0_TXON,&pMacRegs->byCR0));

         // As Tx descriptors status write back error occurred, the status
of transmited Tx descriptor is incorrect. 
         //So, all frame must be droped.
          drop_all_transmited_frame();

          BYTE_REG_BITS_ON(CR0_TXON,&pMacRegs->byCR0);           	

          BYTE_REG_BITS_ON(CR1_TDMD1,&pMacRegs->byCR1);

         

Shing,

> -----Original Message-----
> From: Roger Luethi [mailto:rl@hellgate.ch]
> Sent: Thursday, May 16, 2002 11:14 AM
> To: Urban Widmark; Ivan G.; Jeff Garzik
> Cc: linux-kernel@vger.kernel.org
> Subject: [PATCH] #2 VIA Rhine stalls: TxAbort handling
> 
> 
> This patch is against 2.4.19-pre8.
> 
> Patch description (changes over previous patch marked *):
> * Recover gracefully from TxAbort (the actual fix, new version)
> * Be more quiet about aborts, no need to fill log files or 
> scare people
> - Explicitly pick a backoff algorithm (alternative "fix")
> * Rename backoff bits
> - Remove full_duplex, duplex_lock, and advertising from netdev_private
> - Make use of MII register names somewhat more consistent
> - Update comment regarding config information at 0x78
> - Move comment on *_desc_status where it belongs
> * Remove DescEndPacket, DescIntr; unused and hardly correct
> - More comment details
> * Add TXDESC plus comments for desc_length
> * Reverted comment change "Tune configuration???". I am 
> genuinely puzzled
>   by that line. It sets "store & forward" in a way that 
> according to the
>   documentation is bound to be overriden by the threshold values set
>   elsewhere.
> 
> Note that the abort recovery piece is down to one additional 
> line of code
> compared to vanilla 2.4.19-pre8.
> 
> The summary of what happens: when an abort occurs at frame n, 
> the TxRingPtr
> has already been upped to n+2. Frame n will have a status 
> indicating an
> abort, whereas frame n+1 and following are still owned by the NIC. The
> problem is that the NIC forgets about that. When the driver issues a
> TxDemand after an abort, the NIC won't go back to update the 
> status for
> frame n+1. It will happily continue and send all the remaining frames
> starting with n+2. The driver will receive a bunch of 
> interrupts for sent
> frames, but it will never again scavenge another slot because the chip
> skipped one. Until a time out resets the chip, that is.
> 
> With the new patch, we don't break out to retransmit an aborted frame.
> Instead we scavenge all finished slots after the aborted one 
> (not that I
> think there will be any, but it doesn't hurt to be safe). So 
> once we enter
> the error handler, the aborted frame is reaped, and we _know_ 
> what the next
> frame we need to have sent is -- we just failed to scavenge 
> it because it's
> still owned by the NIC. And that's what we hammer into TxRingPtr. Now
> either the NIC was right (hypothetically speaking, it seems 
> to be wrong
> always), then the writel() is a nop. Or the NIC was confused, 
> then it's
> back on track again.
> 
> While TxAbort is the only error I have encountered 
> frequently, it is still
> tempting to think that the same problem hits us with other 
> errors as well,
> TxUnderrun being the most obvious candidate.
> 
> If this patch brings no improvement for the VT86C100A, you may want to
> watch the state of the rx/tx descriptors and the associated pointers.
> 
> The numbers haven't changed, by the way: I am still seeing 
> about 20% higher
> troughput with what is now called BackModify, despite the aborts it
> produces. Abort handling and resends are cheap compared to 
> the benefits of
> flooding the network with traffic, it seems. YMMV, as always.
> 
> Roger
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] #2 VIA Rhine stalls: TxAbort handling
@ 2002-05-16  3:13 Roger Luethi
  0 siblings, 0 replies; 15+ messages in thread
From: Roger Luethi @ 2002-05-16  3:13 UTC (permalink / raw)
  To: Urban Widmark, Ivan G., Jeff Garzik; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2871 bytes --]

This patch is against 2.4.19-pre8.

Patch description (changes over previous patch marked *):
* Recover gracefully from TxAbort (the actual fix, new version)
* Be more quiet about aborts, no need to fill log files or scare people
- Explicitly pick a backoff algorithm (alternative "fix")
* Rename backoff bits
- Remove full_duplex, duplex_lock, and advertising from netdev_private
- Make use of MII register names somewhat more consistent
- Update comment regarding config information at 0x78
- Move comment on *_desc_status where it belongs
* Remove DescEndPacket, DescIntr; unused and hardly correct
- More comment details
* Add TXDESC plus comments for desc_length
* Reverted comment change "Tune configuration???". I am genuinely puzzled
  by that line. It sets "store & forward" in a way that according to the
  documentation is bound to be overriden by the threshold values set
  elsewhere.

Note that the abort recovery piece is down to one additional line of code
compared to vanilla 2.4.19-pre8.

The summary of what happens: when an abort occurs at frame n, the TxRingPtr
has already been upped to n+2. Frame n will have a status indicating an
abort, whereas frame n+1 and following are still owned by the NIC. The
problem is that the NIC forgets about that. When the driver issues a
TxDemand after an abort, the NIC won't go back to update the status for
frame n+1. It will happily continue and send all the remaining frames
starting with n+2. The driver will receive a bunch of interrupts for sent
frames, but it will never again scavenge another slot because the chip
skipped one. Until a time out resets the chip, that is.

With the new patch, we don't break out to retransmit an aborted frame.
Instead we scavenge all finished slots after the aborted one (not that I
think there will be any, but it doesn't hurt to be safe). So once we enter
the error handler, the aborted frame is reaped, and we _know_ what the next
frame we need to have sent is -- we just failed to scavenge it because it's
still owned by the NIC. And that's what we hammer into TxRingPtr. Now
either the NIC was right (hypothetically speaking, it seems to be wrong
always), then the writel() is a nop. Or the NIC was confused, then it's
back on track again.

While TxAbort is the only error I have encountered frequently, it is still
tempting to think that the same problem hits us with other errors as well,
TxUnderrun being the most obvious candidate.

If this patch brings no improvement for the VT86C100A, you may want to
watch the state of the rx/tx descriptors and the associated pointers.

The numbers haven't changed, by the way: I am still seeing about 20% higher
troughput with what is now called BackModify, despite the aborts it
produces. Abort handling and resends are cheap compared to the benefits of
flooding the network with traffic, it seems. YMMV, as always.

Roger

[-- Attachment #2: via-rhine.c.2.patch --]
[-- Type: text/plain, Size: 5660 bytes --]

--- via-rhine.c.org	Sun May 12 21:53:41 2002
+++ via-rhine.c	Thu May 16 05:04:49 2002
@@ -224,7 +224,8 @@ II. Board-specific settings
 Boards with this chip are functional only in a bus-master PCI slot.
 
 Many operational settings are loaded from the EEPROM to the Config word at
-offset 0x78.  This driver assumes that they are correct.
+offset 0x78. For most of these settings, this driver assumes that they are
+correct.
 If this driver is compiled to use PCI memory space operations the EEPROM
 must be configured to enable memory ops.
 
@@ -376,6 +377,14 @@ enum register_offsets {
 	StickyHW=0x83, WOLcrClr=0xA4, WOLcgClr=0xA7, PwrcsrClr=0xAC,
 };
 
+/* Bits in ConfigD (select backoff algorithm (Ethernet capture effect)) */
+enum backoff_bits {
+	BackOptional=0x01,
+	BackModify=0x02,
+	BackCaptureEffect=0x04,
+	BackRandom=0x08
+};
+
 #ifdef USE_MEM
 /* Registers we check that mmio and reg are the same. */
 int mmio_verify_registers[] = {
@@ -413,24 +422,27 @@ enum mii_status_bits {
 /* The Rx and Tx buffer descriptors. */
 struct rx_desc {
 	s32 rx_status;
-	u32 desc_length;
+	u32 desc_length; /* Chain flag, Buffer/frame length */
 	u32 addr;
 	u32 next_desc;
 };
 struct tx_desc {
 	s32 tx_status;
-	u32 desc_length;
+	u32 desc_length; /* Chain flag, Tx Config, Frame length */
 	u32 addr;
 	u32 next_desc;
 };
 
-/* Bits in *_desc.status */
+/* Initial value for tx_desc.desc_length, Buffer size goes to bits 0-10 */
+#define TXDESC 0x00e08000
+
 enum rx_status_bits {
 	RxOK=0x8000, RxWholePkt=0x0300, RxErr=0x008F
 };
 
+/* Bits in *_desc.*_status */
 enum desc_status_bits {
-	DescOwn=0x80000000, DescEndPacket=0x4000, DescIntr=0x1000,
+	DescOwn=0x80000000
 };
 
 /* Bits in ChipCmd. */
@@ -476,13 +488,10 @@ struct netdev_private {
 	u16 chip_cmd;						/* Current setting for ChipCmd */
 
 	/* These values are keep track of the transceiver/media in use. */
-	unsigned int full_duplex:1;			/* Full-duplex operation requested. */
-	unsigned int duplex_lock:1;
 	unsigned int default_port:4;		/* Last dev->if_port value. */
 	u8 tx_thresh, rx_thresh;
 
 	/* MII transceiver section. */
-	u16 advertising;					/* NWay media advertisement */
 	unsigned char phys[MAX_MII_CNT];			/* MII device addresses. */
 	unsigned int mii_cnt;			/* number of MIIs found, but only the first one is used */
 	u16 mii_status;						/* last read MII status */
@@ -693,6 +702,9 @@ static int __devinit via_rhine_init_one 
 		writeb(readb(ioaddr + ConfigA) & 0xFE, ioaddr + ConfigA);
 	}
 
+	/* Select backoff algorithm */
+	writeb(readb(ioaddr + ConfigD) & (0xF0 | BackModify), ioaddr + ConfigD);
+
 	dev->irq = pdev->irq;
 
 	np = dev->priv;
@@ -784,7 +796,7 @@ static int __devinit via_rhine_init_one 
 				   (option & 0x300 ? 100 : 10),
 				   (option & 0x220 ? "full" : "half"));
 			if (np->mii_cnt)
-				mdio_write(dev, np->phys[0], 0,
+				mdio_write(dev, np->phys[0], MII_BMCR,
 						   ((option & 0x300) ? 0x2000 : 0) |  /* 100mbps? */
 						   ((option & 0x220) ? 0x0100 : 0));  /* Full duplex? */
 		}
@@ -927,7 +939,7 @@ static void alloc_tbufs(struct net_devic
 	for (i = 0; i < TX_RING_SIZE; i++) {
 		np->tx_skbuff[i] = 0;
 		np->tx_ring[i].tx_status = 0;
-		np->tx_ring[i].desc_length = cpu_to_le32(0x00e08000);
+		np->tx_ring[i].desc_length = cpu_to_le32(TXDESC);
 		next += sizeof(struct tx_desc);
 		np->tx_ring[i].next_desc = cpu_to_le32(next);
 		np->tx_buf[i] = &np->tx_bufs[i * PKT_BUF_SZ];
@@ -943,7 +955,7 @@ static void free_tbufs(struct net_device
 
 	for (i = 0; i < TX_RING_SIZE; i++) {
 		np->tx_ring[i].tx_status = 0;
-		np->tx_ring[i].desc_length = cpu_to_le32(0x00e08000);
+		np->tx_ring[i].desc_length = cpu_to_le32(TXDESC);
 		np->tx_ring[i].addr = cpu_to_le32(0xBADF00D0); /* An invalid address. */
 		if (np->tx_skbuff[i]) {
 			if (np->tx_skbuff_dma[i]) {
@@ -969,8 +981,8 @@ static void init_registers(struct net_de
 
 	/* Initialize other registers. */
 	writew(0x0006, ioaddr + PCIBusConfig);	/* Tune configuration??? */
-	/* Configure the FIFO thresholds. */
-	writeb(0x20, ioaddr + TxConfig);	/* Initial threshold 32 bytes */
+	/* Configure initial FIFO thresholds. */
+	writeb(0x20, ioaddr + TxConfig);
 	np->tx_thresh = 0x20;
 	np->rx_thresh = 0x60;			/* Written in via_rhine_set_rx_mode(). */
 
@@ -1029,13 +1041,13 @@ static void mdio_write(struct net_device
 
 	if (phy_id == np->phys[0]) {
 		switch (regnum) {
-		case 0:							/* Is user forcing speed/duplex? */
+		case MII_BMCR:					/* Is user forcing speed/duplex? */
 			if (value & 0x9000)			/* Autonegotiation. */
 				np->mii_if.duplex_lock = 0;
 			else
 				np->mii_if.full_duplex = (value & 0x0100) ? 1 : 0;
 			break;
-		case 4:
+		case MII_ADVERTISE:
 			np->mii_if.advertising = value;
 			break;
 		}
@@ -1227,7 +1239,7 @@ static int via_rhine_start_tx(struct sk_
 	}
 
 	np->tx_ring[entry].desc_length = 
-		cpu_to_le32(0x00E08000 | (skb->len >= ETH_ZLEN ? skb->len : ETH_ZLEN));
+		cpu_to_le32(TXDESC | (skb->len >= ETH_ZLEN ? skb->len : ETH_ZLEN));
 
 	/* lock eth irq */
 	spin_lock_irq (&np->lock);
@@ -1492,8 +1504,14 @@ static void via_rhine_error(struct net_d
 		clear_tally_counters(ioaddr);
 	}
 	if (intr_status & IntrTxAbort) {
-		/* Stats counted in Tx-done handler, just restart Tx. */
+		/* No skipping frames we have no results for! Bad chip! */
+		writel(virt_to_bus(&np->tx_ring[np->dirty_tx % TX_RING_SIZE]),
+			   ioaddr + TxRingPtr);
+		/* Prevent hanging on a full queue */
 		writew(CmdTxDemand | np->chip_cmd, dev->base_addr + ChipCmd);
+		if (debug > 1)
+			printk(KERN_INFO "%s: Abort %4.4x, frame dropped.\n",
+				   dev->name, intr_status);
 	}
 	if (intr_status & IntrTxUnderrun) {
 		if (np->tx_thresh < 0xE0)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-05-18 19:11 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <369B0912E1F5D511ACA5003048222B75A3C06E@EXCHANGE2>
     [not found] ` <20020518040143.GA9318@k3.hellgate.ch>
2002-05-17 23:13   ` [PATCH] #2 VIA Rhine stalls: TxAbort handling Ivan G.
2002-05-18 19:11     ` 'Roger Luethi'
2002-05-17 18:46 Manfred Spraul
2002-05-17 19:56 ` 'Roger Luethi'
2002-05-18 10:08 ` David Woodhouse
  -- strict thread matches above, loose matches on Subject: below --
2002-05-16 10:03 Shing Chuang
2002-05-16 18:03 ` 'Roger Luethi'
2002-05-16 18:25   ` Richard B. Johnson
2002-05-16 20:31     ` 'Roger Luethi'
2002-05-16 16:39       ` Ivan G.
2002-05-16 21:05       ` Richard B. Johnson
2002-05-17  0:16         ` 'Roger Luethi'
2002-05-17 12:51           ` Richard B. Johnson
2002-05-17 16:25             ` 'Roger Luethi'
2002-05-16  3:13 Roger Luethi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).