All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: watchdog timeout panic in e1000 driver
@ 2006-11-16 17:20 Brandeburg, Jesse
  2006-11-21 10:16 ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Brandeburg, Jesse @ 2006-11-16 17:20 UTC (permalink / raw)
  To: Kenzo Iwami, Kok, Auke-jan H; +Cc: Shaw Vrana, netdev, Ronciak, John

Kenzo Iwami wrote:
>   ethtool processing holding semaphore
>     INTERRUPT
>       e1000_watchdog waits for semaphore to be released
> 
> The semaphore e1000_watchdog is waiting for can only be released when
> ethtool resumes from interrupt after e1000_watchdog finishes
> (basically a deadlock)
> 
> In order to solve this problem, interrupts has to be disabled on the
> interrupted side (during ethtool processing) and not during
> e1000_watchdog within the interrupt handler.
> 
> e1000_get_hw_eeprom_semaphore is called from both the interrupt level
> and the normal level, and needs to be protected by irq code. The
> reason the patch disables interrupts when freeing the semaphore is
> because e1000_swfw_sync_release also calls
> e1000_get_hw_eeprom_semaphore. 

Doesn't this just mean that we need a spinlock or some other kind of
semaphore around acquiring, using, and releasing this resource?  We keep
going around and around about this but I'm pretty sure spinlocks are
meant to be able to solve exactly this issue.

The problem is going to get considerably more nasty if we need to hold a
spinlock with interrupts disabled for a significant amount of time, at
which point a semaphore of some kind with a spinlock around it would
seem to be more useful.

I'll work with Auke to see if we can come up with another try.

Jesse

^ permalink raw reply	[flat|nested] 29+ messages in thread
* watchdog timeout panic in e1000 driver
@ 2006-10-19 10:19 Kenzo Iwami
  2006-10-19 15:39 ` Auke Kok
  2007-02-20  9:26 ` Kenzo Iwami
  0 siblings, 2 replies; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-19 10:19 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 3560 bytes --]

Hi,

A watchdog timeout panic occurred in e1000 driver (7.2.9-NAPI).
If e1000_watchdog is called when processing ioctl from ethtool, the system
could stop inside e1000_watchdog interrupt handler for about 16 seconds,
and the system panicked as a result of a watchdog timeout.

This problem only occurs on a server using ethernet controller inside
631xESB/632xESB, and NMI watchdog enabled.

Environment:
  OS     : RHEL4U3(x86_64)
  kernel : 2.6.9-34.ELsmp
  e1000  : 7.2.9-NAPI
  Ethernet controller : Intel Corporation 631x/632xESB DPT LAN Controller
                        Copper (rev 01)
  Watchdog timer should be enabled with a timeout period of less than 16
  seconds.

Steps to reproduce:
  Please apply the attached patch (ethtool.patch) to ethtool (VERSION 5) source
  code. Run make, and rename the freshly built ethtool as gsetloop.
  Put gsetloop and the attached shell script (gloop.sh) in the same directory,
  and execute gloop.sh. The problem should occur within about 5 minutes.

Cause:
  The problem occurs in the following steps.
   - ioctl is executed in ethtool.
      - e1000_read_phy_reg() is called from ioctl to read the value from phy
        register.
      - e1000_get_hw_eeprom_semaphore() is called from e1000_read_phy_reg() to
        acquire a semaphore.
      - E1000_SWSM_SWESMBI bit that is FW semaphore bit is set in
        e1000_get_hw_eeprom_semaphore().
      - When this bit was set, E1000_SWSM_SMBI bit that is driver's semaphore
        bit is also set.
   - e1000_watchdog() of interrupt handler is executed before the
     E1000_SWSM_SMBI bit is unset.
      - e1000_read_phy_reg() is called from e1000_watchdog() to read the value
        from phy register.
      - e1000_get_software_semaphore() is called from e1000_watchdog to confirm
        whether interruption handler can acquire a semaphore.
        This function confirms whether E1000_SWSM_SMBI bit is being set.
      - Therefore the process does loop for "hw->eeprom.word_size + 1" msec
        in e1000_get_software_semaphore().
        The value of "hw->eeprom.word_size + 1" was 16385 on my system.
        In other words it loops for 16.385 sec in
        e1000_get_software_semaphore().
        If NMI watchdog is enabled, the system will panic by NMI watchdog
        within this loop.

Fix:
  In kernels before 2.6.17, the e1000_watchdog() interrupt handler schedules
  e1000_watchdog_task(). The semaphore is acquired within this task, after
  ioctl processing for ethtool is finished, and this problem is avoided.

  e1000_watchdog_task() was remove by the following patch.

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2db10a081c5c1082d58809a1bcf1a6073f4db160
     e1000: rework driver hardware reset locking
     >After studying the driver mac reset code it was found that there
     >were multiple race conditions possible to reset the unit twice or
     >bring it e1000_up() double. This fixes all occurences where the
     >driver needs to reset the mac.
     >
     >We also remove irq requesting/releasing into _open and _close so
     >that while the device is _up we will never touch the irq's. This fixes
     >the double free irq bug that people saw.
     >
     >To make sure that the watchdog task doesn't cause another race we let
     >it run as a non-scheduled task.

  I'm not sure whether there was any reason to actively remove
  e1000_watchdog_task(). I think that removing e1000_watchdog_task() was a
  mistake, and it should be brought back in.
--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)



[-- Attachment #2: gloop.sh --]
[-- Type: text/plain, Size: 106 bytes --]

#!/bin/sh

n=64
i=1
[ $# -ge 1 ] && n=$1

while [ $i -le $n ]
do
	./gsetloop eth0 &
	i=`expr $i + 1`
done

[-- Attachment #3: ethtool.patch --]
[-- Type: text/plain, Size: 452 bytes --]

diff -urpN ethtool_git/ethtool.c my-ethtool/ethtool.c
--- ethtool_git/ethtool.c	2006-10-18 14:17:10.000000000 +0900
+++ my-ethtool/ethtool.c	2006-10-18 14:18:45.000000000 +0900
@@ -1576,7 +1576,9 @@ static int do_gset(int fd, struct ifreq 
 
 	ecmd.cmd = ETHTOOL_GSET;
 	ifr->ifr_data = (caddr_t)&ecmd;
-	err = ioctl(fd, SIOCETHTOOL, ifr);
+	for (;;) {
+		err = ioctl(fd, SIOCETHTOOL, ifr);
+	}
 	if (err == 0) {
 		err = dump_ecmd(&ecmd);
 		if (err)

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-02-21  5:18 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-11-16 17:20 watchdog timeout panic in e1000 driver Brandeburg, Jesse
2006-11-21 10:16 ` Kenzo Iwami
2006-12-04  9:14   ` Kenzo Iwami
2006-12-05  0:46     ` Auke Kok
2006-12-12  7:58       ` Kenzo Iwami
2006-12-19  0:13         ` Kenzo Iwami
2007-01-15  9:12           ` Kenzo Iwami
2007-01-15 16:14             ` Auke Kok
2007-01-16  8:42               ` Kenzo Iwami
2007-01-18  9:22                 ` Kenzo Iwami
  -- strict thread matches above, loose matches on Subject: below --
2006-10-19 10:19 Kenzo Iwami
2006-10-19 15:39 ` Auke Kok
     [not found]   ` <4538BFF2.2040207@cj.jp.nec.com>
2006-10-20 15:51     ` Auke Kok
2006-10-24  9:01       ` Kenzo Iwami
2006-10-24 16:15         ` Auke Kok
2006-10-25 13:41           ` Kenzo Iwami
2006-10-25 15:09             ` Auke Kok
2006-10-26 10:35               ` Kenzo Iwami
2006-10-26 14:34                 ` Auke Kok
2006-10-30 11:36                   ` Kenzo Iwami
2006-10-30 17:30                     ` Auke Kok
2006-10-31  3:22                       ` Shaw Vrana
2006-11-01 13:21                         ` Kenzo Iwami
2006-11-15 10:33                           ` Kenzo Iwami
2006-11-15 16:11                             ` Auke Kok
2006-11-16  9:23                               ` Kenzo Iwami
2007-02-20  9:26 ` Kenzo Iwami
2007-02-20 16:10   ` Auke Kok
2007-02-21  5:17     ` Kenzo Iwami

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.