linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
@ 2004-04-03 11:30 Manfred Spraul
  2004-04-03 12:45 ` Francois Romieu
  2004-04-04  0:55 ` Malvineous
  0 siblings, 2 replies; 13+ messages in thread
From: Manfred Spraul @ 2004-04-03 11:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Adam Nielsen

Andrew wrote:

>The logic is faulty, or at least very odd.
>
>	tx_left = tp->cur_tx - dirty_tx;
>
>	while (tx_left > 0) {
>		int entry = dirty_tx % NUM_TX_DESC;
>
>		if (!(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)) {
>			...
>		}
>	}
>
>Why is that `if' test there at all?  If it ever returns false, the box
>locks up.  A BUG_ON(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)
>might make more sense.
>  
>
tx_left counts packets submitted by hard_xmit_start to the hardware. 
Initially OWNbit is set, the packet is owned by the nic. The OWNbit is 
cleared by the hardware after the packet was sent. A packet with OWNbit 
set means that the nic didn't send it yet to the wire. I think the "else 
break;" patch is correct, but someone with docs should confirm that.

Adam: did you see deadlocks that disappeared after applying your patch? 
It shouldn't deadlock - it should loop until the nic sends the packet to 
the wire. It might take a few msecs, but then it should continue. 
Perhaps gcc optimized away the reload from memory and loops on a 
register. Or there is another bug that is hidden by your patch.

--
    Manfred


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-03 11:30 [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver Manfred Spraul
@ 2004-04-03 12:45 ` Francois Romieu
  2004-04-04  0:55 ` Malvineous
  1 sibling, 0 replies; 13+ messages in thread
From: Francois Romieu @ 2004-04-03 12:45 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrew Morton, linux-kernel, Adam Nielsen

Manfred Spraul <manfred@colorfullife.com> :
[...]
> tx_left counts packets submitted by hard_xmit_start to the hardware. 
> Initially OWNbit is set, the packet is owned by the nic. The OWNbit is 
> cleared by the hardware after the packet was sent. A packet with OWNbit 
> set means that the nic didn't send it yet to the wire. I think the "else 
> break;" patch is correct, but someone with docs should confirm that.

Realtek Gigabit Ethernet Media Access Controller with power management R8169
Rev.1.21, p.54
[...]
Ownership: This bit, when set, indicates that the descriptor is owned by
the NIC. When cleared, it indicates that the descriptor is owned by the host
system. NIC clears this bit when the relative buffer data is already
transmitted. In this case, OWN=0.

[...]
> Perhaps gcc optimized away the reload from memory and loops on a 

Point taken.

--
Ueimor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-03 11:30 [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver Manfred Spraul
  2004-04-03 12:45 ` Francois Romieu
@ 2004-04-04  0:55 ` Malvineous
  2004-04-04  9:15   ` Francois Romieu
  1 sibling, 1 reply; 13+ messages in thread
From: Malvineous @ 2004-04-04  0:55 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: akpm, linux-kernel, a.nielsen

> Adam: did you see deadlocks that disappeared after applying your
> patch? It shouldn't deadlock - it should loop until the nic sends the
> packet to the wire. It might take a few msecs, but then it should
> continue. Perhaps gcc optimized away the reload from memory and loops
> on a register. Or there is another bug that is hidden by your patch.

Yes, it used to deadlock within a second after starting a large transfer
across the network (e.g. copying a 1GB+ file over NFS) however smaller
transfers (e.g. background gkrellmd traffic, web browsing, etc.) was
less likely to cause problems (I could go for a few hours before it
would deadlock.)

I initially tried to move the counters outside the loop, as I thought it
was just the one entry in the array causing the problem, however this
slowed network traffic down to approx 5kB/sec.  Upon looking at the
RTL8139 code it looked like "else break" was the correct action, and
this brought network traffic back up to full speed and it's now been 4.5
days since I booted the kernel with the patch and it's all working
perfectly.

When it did deadlock it was more or less permanent, as any programs
accessing the NIC (or indeed the hard drive) would immediately deadlock
- so no program could send network data, thus it would loop forever
waiting for more traffic.

I did add a 'printk' line in to see what the variables were just to make
sure this was the location of the bug, and as I expected none of
the values changed at all.

Cheers,
Adam.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-04  0:55 ` Malvineous
@ 2004-04-04  9:15   ` Francois Romieu
  2004-04-05 21:51     ` Adam Nielsen
  0 siblings, 1 reply; 13+ messages in thread
From: Francois Romieu @ 2004-04-04  9:15 UTC (permalink / raw)
  To: Malvineous; +Cc: Manfred Spraul, akpm, linux-kernel, a.nielsen

Malvineous <malvineous@optushome.com.au> :
[...]
> I did add a 'printk' line in to see what the variables were just to make
> sure this was the location of the bug, and as I expected none of
> the values changed at all.

Can you send the unpatched r8169.o module ?

Thanks in advance.

--
Ueimor


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-04  9:15   ` Francois Romieu
@ 2004-04-05 21:51     ` Adam Nielsen
  2004-04-06 17:11       ` Manfred Spraul
  0 siblings, 1 reply; 13+ messages in thread
From: Adam Nielsen @ 2004-04-05 21:51 UTC (permalink / raw)
  To: Francois Romieu; +Cc: manfred, akpm, linux-kernel, a.nielsen

> Can you send the unpatched r8169.o module ?

Here is the compiled source file (r8169.c -> r8169.o) from 2.6.5-rc1
(which had the same problem and has an identical .c file) but I'm not
sure if it's different to the final module so please let me know if you
wanted something else:

http://members.optushome.com.au/a.nielsen/r8169.o.bz2  (66 kB)

Cheers,
Adam.

$ gcc -v
Reading specs from /usr/lib/gcc-lib/i486-slackware-linux/3.2.3/specs
Configured with: ../gcc-3.2.3/configure --prefix=/usr --enable-shared
--enable-threads=posix --enable-__cxa_atexit --disable-checking
--with-gnu-ld --verbose --target=i486-slackware-linux
--host=i486-slackware-linux
Thread model: posix
gcc version 3.2.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-05 21:51     ` Adam Nielsen
@ 2004-04-06 17:11       ` Manfred Spraul
  2004-04-06 20:50         ` Francois Romieu
  0 siblings, 1 reply; 13+ messages in thread
From: Manfred Spraul @ 2004-04-06 17:11 UTC (permalink / raw)
  To: Adam Nielsen; +Cc: Francois Romieu, akpm, linux-kernel

Adam Nielsen wrote:

>>Can you send the unpatched r8169.o module ?
>>    
>>
>
>Here is the compiled source file (r8169.c -> r8169.o) from 2.6.5-rc1
>(which had the same problem and has an identical .c file) but I'm not
>sure if it's different to the final module so please let me know if you
>wanted something else:
>
>http://members.optushome.com.au/a.nielsen/r8169.o.bz2  (66 kB)
>  
>
Thanks. The code reloads the tx ring value from memory, thus I don't 
understand why it deadlocks.

--
    Manfred


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-06 17:11       ` Manfred Spraul
@ 2004-04-06 20:50         ` Francois Romieu
  0 siblings, 0 replies; 13+ messages in thread
From: Francois Romieu @ 2004-04-06 20:50 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Adam Nielsen, akpm, linux-kernel

Manfred Spraul <manfred@colorfullife.com> :
[...]
> Thanks. The code reloads the tx ring value from memory, thus I don't 
> understand why it deadlocks.

Well...
- rtl8169_interrupt() acks all events before rtl8169_tx_interrupt() is called
- the count of descriptors handled in rtl8169_tx_interrupt() is only limited
  by the number of packets submitted for TX at the time rtl8169_tx_interrupt()
  is called

-> if there is a stream of Tx events, it is possible that Tx descriptors are
   processed before the relevant event is notified to the host by the network
   adapter.

--
Ueimor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-05 23:08 Dieter Nützel
@ 2004-04-06  7:47 ` Francois Romieu
  0 siblings, 0 replies; 13+ messages in thread
From: Francois Romieu @ 2004-04-06  7:47 UTC (permalink / raw)
  To: =?unknown-8bit?Q?Dieter_N=FCtzel?=
  Cc: Linux-Kernel List, Andrew Morton, Manfred Spraul, netdev, jgarzik

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown-8bit, Size: 618 bytes --]

Dieter Nützel <Dieter.Nuetzel@hamburg.de> :
[...]
> Isn't the "current" RTL8169s (32/64) driver much outdated?
> 
> I got a version # 1.6 with all my cards.

This version has been merged in -mm since late 2003. Complete history is
available on the netdev mailing list.

>From memory, there should be a Tx descriptor overflow bug in # 1.6 (see
tx_interrupt). You may want to change some bits in your #1.6 version.

There is a regression related to link removal in the current -mm/jg-netdev
version (-mm stands behind jg-netdev where I hope to push what is posted
on the netdev mailing list). Stay tuned. :o)

--
Ueimor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
@ 2004-04-05 23:08 Dieter Nützel
  2004-04-06  7:47 ` Francois Romieu
  0 siblings, 1 reply; 13+ messages in thread
From: Dieter Nützel @ 2004-04-05 23:08 UTC (permalink / raw)
  To: Linux-Kernel List
  Cc: Francois Romieu, Andrew Morton, Manfred Spraul, netdev, jgarzik

Francois Romieu <romieu@fr.zoreil.com> wrote:
> Manfred Spraul <manfred@colorfullife.com> :
> [...]
> > tx_left counts packets submitted by hard_xmit_start to the hardware. 
> > Initially OWNbit is set, the packet is owned by the nic. The OWNbit is 
> > cleared by the hardware after the packet was sent. A packet with OWNbit 
> > set means that the nic didn't send it yet to the wire. I think the "else 
> > break;" patch is correct, but someone with docs should confirm that.
> 
> Realtek Gigabit Ethernet Media Access Controller with power management R8169
> Rev.1.21, p.54
> [...]
> Ownership: This bit, when set, indicates that the descriptor is owned by
> the NIC. When cleared, it indicates that the descriptor is owned by the host
> system. NIC clears this bit when the relative buffer data is already
> transmitted. In this case, OWN=0.
> 
> [...]
> > Perhaps gcc optimized away the reload from memory and loops on a 
> 
> Point taken.

Isn't the "current" RTL8169s (32/64) driver much outdated?

I got a version # 1.6 with all my cards.
Anyone what a copy?

/*
=========================================================================
 r8169.c: A RealTek RTL8169s/8110s Gigabit Ethernet driver for Linux kernel 
2.4.x.
 --------------------------------------------------------------------

 History:
 Feb  4 2002	- created initially by ShuChen <shuchen@realtek.com.tw>.
 May 20 2002	- Add link status force-mode and TBI mode support.
=========================================================================
  1. The media can be forced in 5 modes.
	 Command: 'insmod r8169 media = SET_MEDIA'
	 Ex:	  'insmod r8169 media = 0x04' will force PHY to operate in 100Mpbs 
Half-duplex.

	 SET_MEDIA can be:
 		_10_Half	= 0x01
 		_10_Full	= 0x02
 		_100_Half	= 0x04
 		_100_Full	= 0x08
 		_1000_Full	= 0x10

  2. Support TBI mode.
//=========================================================================
RTL8169_VERSION "1.1"	<2002/10/4>

	The bit4:0 of MII register 4 is called "selector field", and have to be
	00001b to indicate support of IEEE std 802.3 during NWay process of
	exchanging Link Code Word (FLP).

RTL8169_VERSION "1.2"	<2003/6/17>
	Update driver module name.
	Modify ISR.
        Add chip mac_version.

RTL8169_VERSION "1.3"	<2003/6/20>
        Add chip phy_version.
	Add priv->phy_timer_t, rtl8169_phy_timer_t_handler()
	Add rtl8169_hw_PHY_config()
	Add rtl8169_hw_PHY_reset()

RTL8169_VERSION "1.4"	<2003/7/14>
	Add tx_bytes, rx_bytes.

RTL8169_VERSION "1.5"	<2003/7/18>
	Set 0x0000 to PHY at offset 0x0b.
	Modify chip mac_version, phy_version
	Force media for multiple card.
RTL8169_VERSION "1.6"	<2003/8/25>
	Modify receive data buffer.
*/

Greetings,
	Dieter


-- 
Dieter Nützel
@home: <Dieter.Nuetzel () hamburg ! de>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-03 10:07   ` Andrew Morton
@ 2004-04-03 12:13     ` Francois Romieu
  0 siblings, 0 replies; 13+ messages in thread
From: Francois Romieu @ 2004-04-03 12:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: a.nielsen, linux-kernel, torvalds, degger

Andrew Morton <akpm@osdl.org> :
[...]
> The logic is faulty, or at least very odd.
> 
> 	tx_left = tp->cur_tx - dirty_tx;
> 
> 	while (tx_left > 0) {
> 		int entry = dirty_tx % NUM_TX_DESC;
> 
> 		if (!(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)) {
> 			...
> 		}
> 	}
> 
> Why is that `if' test there at all?  If it ever returns false, the box

It checks that the network card has returned the buffer to the host computer.
The code will loop in the irq handler until every buffer submitted
for Tx is processed
- it is gross;
- it is not robust;
- I assume that's why the driver sucks cpu like hell on high Tx traffic;
- ... but it is not faulty. 

> locks up.  A BUG_ON(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)
> might make more sense.

In Linus's current tree, I doubt it: assume the host has submitted a few
packets for Tx and the network card issue an interrupt for the first one 
whereas the second descriptor still belongs to the network card -> boom.

--
Ueimor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-03  9:27 ` Francois Romieu
@ 2004-04-03 10:07   ` Andrew Morton
  2004-04-03 12:13     ` Francois Romieu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2004-04-03 10:07 UTC (permalink / raw)
  To: Francois Romieu; +Cc: a.nielsen, linux-kernel, torvalds, degger

Francois Romieu <romieu@fr.zoreil.com> wrote:
>
> Adam Nielsen <a.nielsen@optushome.com.au> :
> [...]
> > in the Realtek 8169 gigabit ethernet driver.  Due to a logic error,
> > there is a loop in an interrupt handler that often goes infinite, thus
> > locking up the entire computer.  The attached patch fixes the problem.
> 
> - until there is an explanation on _why_ this condition happens, this is a
>   band-aid for an unexplained condition, not a fix for a "logic error" (it
>   may have interesting performance effects though);
> 

The logic is faulty, or at least very odd.

	tx_left = tp->cur_tx - dirty_tx;

	while (tx_left > 0) {
		int entry = dirty_tx % NUM_TX_DESC;

		if (!(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)) {
			...
		}
	}

Why is that `if' test there at all?  If it ever returns false, the box
locks up.  A BUG_ON(le32_to_cpu(tp->TxDescArray[entry].status) & OWNbit)
might make more sense.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
  2004-04-03  5:02 Adam Nielsen
@ 2004-04-03  9:27 ` Francois Romieu
  2004-04-03 10:07   ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Francois Romieu @ 2004-04-03  9:27 UTC (permalink / raw)
  To: Adam Nielsen; +Cc: linux-kernel, Linus Torvalds, degger

Adam Nielsen <a.nielsen@optushome.com.au> :
[...]
> in the Realtek 8169 gigabit ethernet driver.  Due to a logic error,
> there is a loop in an interrupt handler that often goes infinite, thus
> locking up the entire computer.  The attached patch fixes the problem.

- until there is an explanation on _why_ this condition happens, this is a
  band-aid for an unexplained condition, not a fix for a "logic error" (it
  may have interesting performance effects though);

- the included comments may be ok in the bk repository but they do not really
  add anything to the driver itself;

- please avoid the "else break" in just one line;

- last change to this file found its way through the bk thing on 02/26/2004
  but a set of changes for this driver is available 1) in -mm tree 2) in
  Jeff Garzik's -netdev patches 3) near my fridge. A patch addressing the
  same issue has been posted on l-k the 03/29/2004 as an answer to a remark
  made by Daniel Egger (feedback anyone ?). If you can wait until the whole
  thing is included, it will make my life easier;

- please Cc: netdev@oss.sgi.com for network related patches as well as
  jgarzik@pobox.com for network drivers related patches

--
Ueimor

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver
@ 2004-04-03  5:02 Adam Nielsen
  2004-04-03  9:27 ` Francois Romieu
  0 siblings, 1 reply; 13+ messages in thread
From: Adam Nielsen @ 2004-04-03  5:02 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds

Hi everyone,

This is a tiny patch that fixes a particularly annoying bug
in the Realtek 8169 gigabit ethernet driver.  Due to a logic error,
there is a loop in an interrupt handler that often goes infinite, thus
locking up the entire computer.  The attached patch fixes the problem.

I have patched against linux-2.6.5-rc2-bk6, however the source file in
question hasn't been modified for a long time, so the patch should apply
cleanly to any recent kernel version.

Cheers,
Adam.

--- linux-2.6.5-rc2-bk6/drivers/net/r8169.c	2004-03-27 17:38:03.000000000 +1000
+++ linux-2.6.5-rc2-bk6a/drivers/net/r8169.c	2004-03-31 18:45:10.000000000 +1000
@@ -33,6 +33,12 @@
 	- Copy mc_filter setup code from 8139cp
 	  (includes an optimization, and avoids set_bit use)
 
+VERSION 1.2a <2004/03/31> Adam Nielsen (a.nielsen@optushome.com.au)
+
+	"else break;" added to the if-statement in rtl8169_tx_interrupt() to prevent
+	an infinite loop and the resulting kernel lockup when the interrupt is called
+	with a dirty buffer (perhaps when there's nothing to transmit?)
+
 */
 
 #include <linux/module.h>
@@ -892,7 +898,7 @@
 			tp->Tx_skbuff[entry] = NULL;
 			dirty_tx++;
 			tx_left--;
-		}
+		} else break;
 	}
 
 	if (tp->dirty_tx != dirty_tx) {

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-04-06 20:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-03 11:30 [PATCH] Fix kernel lockup in RTL-8169 gigabit ethernet driver Manfred Spraul
2004-04-03 12:45 ` Francois Romieu
2004-04-04  0:55 ` Malvineous
2004-04-04  9:15   ` Francois Romieu
2004-04-05 21:51     ` Adam Nielsen
2004-04-06 17:11       ` Manfred Spraul
2004-04-06 20:50         ` Francois Romieu
  -- strict thread matches above, loose matches on Subject: below --
2004-04-05 23:08 Dieter Nützel
2004-04-06  7:47 ` Francois Romieu
2004-04-03  5:02 Adam Nielsen
2004-04-03  9:27 ` Francois Romieu
2004-04-03 10:07   ` Andrew Morton
2004-04-03 12:13     ` Francois Romieu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).