linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* eepro100 (PCI ID 82820) lockups/failure
@ 2001-08-04  6:06 Colin Walters
  2001-08-06  9:27 ` Andrey Savochkin
  0 siblings, 1 reply; 11+ messages in thread
From: Colin Walters @ 2001-08-04  6:06 UTC (permalink / raw)
  To: linux-kernel

I have an ia32 motherboard (MSI 815EM Pro) with an integrated Intel
ethernet controller, about which lspci -v has to say:

01:08.0 Ethernet controller: Intel Corporation 82820 820 (Camino 2) Chipset Ethernet (rev 01)
        Subsystem: Intel Corporation: Unknown device 3013
        Flags: bus master, medium devsel, latency 32, IRQ 10
        Memory at d5001000 (32-bit, non-prefetchable) [size=4K]
        I/O ports at ac00 [size=64]
        Capabilities: [dc] Power Management version 2

And /proc/pci says:

  Bus  1, device   8, function  0:
    Ethernet controller: Intel Corporation 82801BA(M) Ethernet (rev 1).
      IRQ 10.
      Master Capable.  Latency=32.  Min Gnt=8.Max Lat=56.
      Non-prefetchable 32 bit memory at 0xd5001000 [0xd5001fff].
      I/O at 0xac00 [0xac3f].

I'm using the 2.4.7 eepro100 driver, and the machine consistently
locks up under any kind of heavy network load.  I've tried
2.4.8-pre{1,2,3} with the same results.  A message sometimes printed
to syslog before the machine locks completely is:

Aug  3 20:56:17 debian kernel: eepro100: wait_for_cmd_done timeout!
Aug  3 21:01:22 debian kernel: eepro100: wait_for_cmd_done timeout!
Aug  3 21:01:29 debian kernel: eepro100: wait_for_cmd_done timeout!

Sometimes it's just the network that goes down, but usually the
machine will lock not long thereafter.

I noticed a patch posted to this mailing list:

<URL:http://mailman.real-time.com/pipermail/linux-kernel/Week-of-Mon-20010618/041187.html>

But it doesn't seem to have been applied.

Anyone have any advice?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-04  6:06 eepro100 (PCI ID 82820) lockups/failure Colin Walters
@ 2001-08-06  9:27 ` Andrey Savochkin
  2001-08-06 15:48   ` Joseph Cheek
  2001-08-06 18:39   ` Colin Walters
  0 siblings, 2 replies; 11+ messages in thread
From: Andrey Savochkin @ 2001-08-06  9:27 UTC (permalink / raw)
  To: Colin Walters, linux-kernel

Hi,

On Sat, Aug 04, 2001 at 02:06:10AM -0400, Colin Walters wrote:
> I have an ia32 motherboard (MSI 815EM Pro) with an integrated Intel
> ethernet controller, about which lspci -v has to say:
[snip]
> I'm using the 2.4.7 eepro100 driver, and the machine consistently
> locks up under any kind of heavy network load.  I've tried
> 2.4.8-pre{1,2,3} with the same results.  A message sometimes printed
> to syslog before the machine locks completely is:
> 
> Aug  3 20:56:17 debian kernel: eepro100: wait_for_cmd_done timeout!
> Aug  3 21:01:22 debian kernel: eepro100: wait_for_cmd_done timeout!
> Aug  3 21:01:29 debian kernel: eepro100: wait_for_cmd_done timeout!
> 
> Sometimes it's just the network that goes down, but usually the
> machine will lock not long thereafter.
> 
> I noticed a patch posted to this mailing list:
> 
> <URL:http://mailman.real-time.com/pipermail/linux-kernel/Week-of-Mon-20010618/041187.html>
> 
> But it doesn't seem to have been applied.

Someone who experiences such timeouts needs to figure out how much time it
really can take before a command is accepted.
Some time ago the timeout was increased by the factor of 10, and now the
current timeout looks being insufficient.
It might be a problem with the time of specific commands or specific chip
revisions.
Or some hardware is too clever and somehow optimizes those repeated read
operations, so that they no longer take a fixed number of bus cycles.

In short, that patch isn't a real solution.
If someone provides me with the information which commands times-out and how
much time they really need, we could have a real solution.

Best regards
		Andrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06  9:27 ` Andrey Savochkin
@ 2001-08-06 15:48   ` Joseph Cheek
  2001-08-07 10:48     ` Andrey Savochkin
  2001-08-06 18:39   ` Colin Walters
  1 sibling, 1 reply; 11+ messages in thread
From: Joseph Cheek @ 2001-08-06 15:48 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: Colin Walters, linux-kernel

i applied the usleep(1) patch and i still get lockups on 2.4.7-ac5.  not
sure how i could get you the info you need, but i would certainly be
willing to help.

my machine locks hard before anything gets to syslog.

thanks!

joe

--
Joseph Cheek, CTO, Redmond Linux Corp.
joseph@redmondlinux.org, www.redmondlinux.org
Redmond Linux.  Linux is for everyone.

On Mon, 6 Aug 2001, Andrey Savochkin wrote:

> Someone who experiences such timeouts needs to figure out how much time it
> really can take before a command is accepted.
> Some time ago the timeout was increased by the factor of 10, and now the
> current timeout looks being insufficient.
> It might be a problem with the time of specific commands or specific chip
> revisions.
> Or some hardware is too clever and somehow optimizes those repeated read
> operations, so that they no longer take a fixed number of bus cycles.
> 
> In short, that patch isn't a real solution.
> If someone provides me with the information which commands times-out and how
> much time they really need, we could have a real solution.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06  9:27 ` Andrey Savochkin
  2001-08-06 15:48   ` Joseph Cheek
@ 2001-08-06 18:39   ` Colin Walters
  2001-08-06 19:00     ` Richard B. Johnson
                       ` (2 more replies)
  1 sibling, 3 replies; 11+ messages in thread
From: Colin Walters @ 2001-08-06 18:39 UTC (permalink / raw)
  To: linux-kernel

Andrey Savochkin <saw@saw.sw.com.sg> writes:

> Someone who experiences such timeouts needs to figure out how much
> time it really can take before a command is accepted.  Some time ago
> the timeout was increased by the factor of 10, and now the current
> timeout looks being insufficient.  It might be a problem with the
> time of specific commands or specific chip revisions.  Or some
> hardware is too clever and somehow optimizes those repeated read
> operations, so that they no longer take a fixed number of bus
> cycles.

Shouldn't a udelay(1) always take one microsecond, regardless of
hardware optimizations?

> In short, that patch isn't a real solution.  If someone provides me
> with the information which commands times-out and how much time they
> really need, we could have a real solution.

How can I help?  Instrument the code by hand with printk statements?
Or is there a better way?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06 18:39   ` Colin Walters
@ 2001-08-06 19:00     ` Richard B. Johnson
  2001-08-07 10:24       ` Andrey Savochkin
  2001-08-06 19:14     ` Alan Cox
  2001-08-07 10:46     ` Andrey Savochkin
  2 siblings, 1 reply; 11+ messages in thread
From: Richard B. Johnson @ 2001-08-06 19:00 UTC (permalink / raw)
  To: Colin Walters; +Cc: linux-kernel

On Mon, 6 Aug 2001, Colin Walters wrote:

> Andrey Savochkin <saw@saw.sw.com.sg> writes:
> 
> > Someone who experiences such timeouts needs to figure out how much
> > time it really can take before a command is accepted.  Some time ago
> > the timeout was increased by the factor of 10, and now the current
> > timeout looks being insufficient.  It might be a problem with the
> > time of specific commands or specific chip revisions.  Or some
> > hardware is too clever and somehow optimizes those repeated read
> > operations, so that they no longer take a fixed number of bus
> > cycles.
> 
[SNIPPED...]

This may not be a timing problem, but rather a problem that was
attempted to be fixed with some timing change.

Possible problem (and solution). Given:

	writel(value, pci_reg);
	status = readl(pci_reg);

The second readl() may (read will) complete before the writel().
This is because writes to the PCI bus may be posted (queued). The
first read will force all writes to complete, however the value
read may be something that was not yet affected by the write.

	writel(value, pci_reg);
	status = readl(pci_reg);
	status = readl(pci_reg);

Would fix, but gcc may "optimize" one of these away, therefore I
suggest reading something, within the boards address space that
is never used, i.e., some offset that gives the model number or
something. It must actually respond to a read because otherwise
performance will degrade while waiting for the PCI bus error.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06 18:39   ` Colin Walters
  2001-08-06 19:00     ` Richard B. Johnson
@ 2001-08-06 19:14     ` Alan Cox
  2001-08-07 10:46     ` Andrey Savochkin
  2 siblings, 0 replies; 11+ messages in thread
From: Alan Cox @ 2001-08-06 19:14 UTC (permalink / raw)
  To: Colin Walters; +Cc: linux-kernel

> Shouldn't a udelay(1) always take one microsecond, regardless of
> hardware optimizations?

A udelay(1) should always take 1 microsecond or a bit longer. There are some
funnies with PCI posting to beware of - notably

	writel(0x1, foo->reg);
	udelay(1);
	writel(0x0, foo->reg)

Does _not_ guarantee the two writes hit the PCI device with a 1 uS delay 
where its PCI access timing that matters you need to do

	writel(0x1, foo->reg)
	readl(foo->somethingthatdoesnothing);
	udelay(1);
	writel(0x0, foo->reg)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06 19:00     ` Richard B. Johnson
@ 2001-08-07 10:24       ` Andrey Savochkin
  2001-08-07 12:11         ` Richard B. Johnson
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Savochkin @ 2001-08-07 10:24 UTC (permalink / raw)
  To: root; +Cc: Colin Walters, linux-kernel

Hi,

On Mon, Aug 06, 2001 at 03:00:07PM -0400, Richard B. Johnson wrote:
[snip]
> This may not be a timing problem, but rather a problem that was
> attempted to be fixed with some timing change.
> 
> Possible problem (and solution). Given:
> 
> 	writel(value, pci_reg);
> 	status = readl(pci_reg);
> 
> The second readl() may (read will) complete before the writel().
> This is because writes to the PCI bus may be posted (queued). The
> first read will force all writes to complete, however the value
> read may be something that was not yet affected by the write.
> 
> 	writel(value, pci_reg);
> 	status = readl(pci_reg);
> 	status = readl(pci_reg);

Thanks for the note, I'll keep it in mind.

However, for this particular case I'm interested about a loop like
	while((a = readb(reg)) && --count >= 0);
I wonder if there are circumstances in which the repeated read's can return
"cached" values or whatever, so that the loop will result in significantly less
number of bus cycles than it's supposed?
My understanding is that there shouldn't be such.

Best regards
		Andrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06 18:39   ` Colin Walters
  2001-08-06 19:00     ` Richard B. Johnson
  2001-08-06 19:14     ` Alan Cox
@ 2001-08-07 10:46     ` Andrey Savochkin
  2 siblings, 0 replies; 11+ messages in thread
From: Andrey Savochkin @ 2001-08-07 10:46 UTC (permalink / raw)
  To: Colin Walters, linux-kernel

On Mon, Aug 06, 2001 at 02:39:14PM -0400, Colin Walters wrote:
> 
> > In short, that patch isn't a real solution.  If someone provides me
> > with the information which commands times-out and how much time they
> > really need, we could have a real solution.
> 
> How can I help?  Instrument the code by hand with printk statements?
> Or is there a better way?

I would do it by just printk.
The first round is to check how many `udelay(1)' loops are necessary to get
an ack for longest commands (and what that commands are).

Then it's interesting to know how long the wait_for_cmd_done loop has been
executed when it times out.
Not in loop counter, of course, but in clock time.
It can be measured by CPU cycle counter.

This way we check how much time a command may need and whether the timeout in
the loop works as expected.

Another possibility is that the new chip revisions have some unknown timing
constraints, like requirements for delays between certain commands or register
accesses.  Those `udelay(1)', executed after every command, may provide such
delays as a side effect.
It's not clear what the easiest way to check it is.

Best regards
		Andrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-06 15:48   ` Joseph Cheek
@ 2001-08-07 10:48     ` Andrey Savochkin
  2001-08-07 16:35       ` Joseph Cheek
  0 siblings, 1 reply; 11+ messages in thread
From: Andrey Savochkin @ 2001-08-07 10:48 UTC (permalink / raw)
  To: Joseph Cheek; +Cc: linux-kernel

On Mon, Aug 06, 2001 at 08:48:22AM -0700, Joseph Cheek wrote:
> i applied the usleep(1) patch and i still get lockups on 2.4.7-ac5.  not
> sure how i could get you the info you need, but i would certainly be
> willing to help.
> 
> my machine locks hard before anything gets to syslog.

Are you able to check the screen?
Had the driver printed anything before the lockup?

	Andrey

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-07 10:24       ` Andrey Savochkin
@ 2001-08-07 12:11         ` Richard B. Johnson
  0 siblings, 0 replies; 11+ messages in thread
From: Richard B. Johnson @ 2001-08-07 12:11 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: Colin Walters, linux-kernel

On Tue, 7 Aug 2001, Andrey Savochkin wrote:
[SNIPPED...]


> 
> However, for this particular case I'm interested about a loop like
> 	while((a = readb(reg)) && --count >= 0);
> I wonder if there are circumstances in which the repeated read's can return
> "cached" values or whatever, so that the loop will result in significantly less
> number of bus cycles than it's supposed?
> My understanding is that there shouldn't be such.
> 
> Best regards
> 		Andrey

You should have obtained access to the PCI address space by using
ioremap_nocache(). If so, every read will go to the bus. However,
what is actually read will be long-words so, if the device assumes
something else, its broken. FYI, some ISA word-addressed boards
have been "converted" to PCI by glueing on a PCI interface chip.
They require some strange hacks to work, like reading a byte register
by using the lowest long-word-aligned address that will contain that
byte and shifting the result.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

    I was going to compile a list of innovations that could be
    attributed to Microsoft. Once I realized that Ctrl-Alt-Del
    was handled in the BIOS, I found that there aren't any.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: eepro100 (PCI ID 82820) lockups/failure
  2001-08-07 10:48     ` Andrey Savochkin
@ 2001-08-07 16:35       ` Joseph Cheek
  0 siblings, 0 replies; 11+ messages in thread
From: Joseph Cheek @ 2001-08-07 16:35 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: linux-kernel

once i changed the udelay() to 10 i was able to see the wait_for_timeout 
errors others have reported.  with the udelay at 1 i saw nothing on the 
screen before the lockup.

thanks!

joe

Andrey Savochkin wrote:

>On Mon, Aug 06, 2001 at 08:48:22AM -0700, Joseph Cheek wrote:
>
>>i applied the usleep(1) patch and i still get lockups on 2.4.7-ac5.  not
>>sure how i could get you the info you need, but i would certainly be
>>willing to help.
>>
>>my machine locks hard before anything gets to syslog.
>>
>
>Are you able to check the screen?
>Had the driver printed anything before the lockup?
>
>	Andrey
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-08-07 16:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-04  6:06 eepro100 (PCI ID 82820) lockups/failure Colin Walters
2001-08-06  9:27 ` Andrey Savochkin
2001-08-06 15:48   ` Joseph Cheek
2001-08-07 10:48     ` Andrey Savochkin
2001-08-07 16:35       ` Joseph Cheek
2001-08-06 18:39   ` Colin Walters
2001-08-06 19:00     ` Richard B. Johnson
2001-08-07 10:24       ` Andrey Savochkin
2001-08-07 12:11         ` Richard B. Johnson
2001-08-06 19:14     ` Alan Cox
2001-08-07 10:46     ` Andrey Savochkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).