All of lore.kernel.org
 help / color / mirror / Atom feed
From: Auke Kok <sofar@foo-projects.org>
To: Kenzo Iwami <k-iwami@cj.jp.nec.com>
Cc: netdev@vger.kernel.org,
	Jesse Brandeburg <jesse.brandeburg@intel.com>,
	"Ronciak, John" <john.ronciak@intel.com>
Subject: Re: watchdog timeout panic in e1000 driver
Date: Thu, 19 Oct 2006 08:39:00 -0700	[thread overview]
Message-ID: <45379C14.5050901@foo-projects.org> (raw)
In-Reply-To: <45375135.5050206@cj.jp.nec.com>

Kenzo Iwami wrote:
> A watchdog timeout panic occurred in e1000 driver (7.2.9-NAPI).

where's the panic message ?

Please CC the maintainers of the driver at all times. Our e-mail addresses are widely 
visible everywhere.

> If e1000_watchdog is called when processing ioctl from ethtool, the system
> could stop inside e1000_watchdog interrupt handler for about 16 seconds
 >
> and the system panicked as a result of a watchdog timeout.
> 
> This problem only occurs on a server using ethernet controller inside
> 631xESB/632xESB, and NMI watchdog enabled.

why only this system? have you seen/tried it on other machines?

> Environment:
>   OS     : RHEL4U3(x86_64)
>   kernel : 2.6.9-34.ELsmp
>   e1000  : 7.2.9-NAPI
>   Ethernet controller : Intel Corporation 631x/632xESB DPT LAN Controller
>                         Copper (rev 01)
>   Watchdog timer should be enabled with a timeout period of less than 16
>   seconds.
> Steps to reproduce:
>   Please apply the attached patch (ethtool.patch) to ethtool (VERSION 5) source
>   code. Run make, and rename the freshly built ethtool as gsetloop.
>   Put gsetloop and the attached shell script (gloop.sh) in the same directory,
>   and execute gloop.sh. The problem should occur within about 5 minutes.
 >
> 
> Cause:
>   The problem occurs in the following steps.
>    - ioctl is executed in ethtool.
>       - e1000_read_phy_reg() is called from ioctl to read the value from phy
>         register.
>       - e1000_get_hw_eeprom_semaphore() is called from e1000_read_phy_reg() to
>         acquire a semaphore.
>       - E1000_SWSM_SWESMBI bit that is FW semaphore bit is set in
>         e1000_get_hw_eeprom_semaphore().
>       - When this bit was set, E1000_SWSM_SMBI bit that is driver's semaphore
>         bit is also set.
>    - e1000_watchdog() of interrupt handler is executed before the
>      E1000_SWSM_SMBI bit is unset.
>       - e1000_read_phy_reg() is called from e1000_watchdog() to read the value
>         from phy register.
>       - e1000_get_software_semaphore() is called from e1000_watchdog to confirm
>         whether interruption handler can acquire a semaphore.
>         This function confirms whether E1000_SWSM_SMBI bit is being set.
>       - Therefore the process does loop for "hw->eeprom.word_size + 1" msec
>         in e1000_get_software_semaphore().
>         The value of "hw->eeprom.word_size + 1" was 16385 on my system.
>         In other words it loops for 16.385 sec in
>         e1000_get_software_semaphore().
>         If NMI watchdog is enabled, the system will panic by NMI watchdog
>         within this loop.
> 
> Fix:
>   In kernels before 2.6.17, the e1000_watchdog() interrupt handler schedules
>   e1000_watchdog_task(). The semaphore is acquired within this task, after
>   ioctl processing for ethtool is finished, and this problem is avoided.
> 
>   e1000_watchdog_task() was remove by the following patch.
> 
> http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2db10a081c5c1082d58809a1bcf1a6073f4db160
>      e1000: rework driver hardware reset locking
>      >After studying the driver mac reset code it was found that there
>      >were multiple race conditions possible to reset the unit twice or
>      >bring it e1000_up() double. This fixes all occurences where the
>      >driver needs to reset the mac.
>      >
>      >We also remove irq requesting/releasing into _open and _close so
>      >that while the device is _up we will never touch the irq's. This fixes
>      >the double free irq bug that people saw.
>      >
>      >To make sure that the watchdog task doesn't cause another race we let
>      >it run as a non-scheduled task.
> 
>   I'm not sure whether there was any reason to actively remove
>   e1000_watchdog_task(). I think that removing e1000_watchdog_task() was a
>   mistake, and it should be brought back in.


Reverting this could would not be a fix, but only a workaround that leaves the problem 
still in the code, and as such not progress in the right direction.

I find this report extremely edgy, but I'll look into the fact that the driver attempts 
to sleep for 16384 + 1 msec, which seems overly long :)

As a side note, most other e1000 NIC's use hardcoded word_size numbers, but esb2 systems 
read it from a register/eeprom. Can you send me the output of `ethtool -e ethX` ? 
off-list is OK, it might be large.

Thanks,

Auke

  reply	other threads:[~2006-10-19 15:41 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-10-19 10:19 watchdog timeout panic in e1000 driver Kenzo Iwami
2006-10-19 15:39 ` Auke Kok [this message]
     [not found]   ` <4538BFF2.2040207@cj.jp.nec.com>
2006-10-20 15:51     ` Auke Kok
2006-10-24  9:01       ` Kenzo Iwami
2006-10-24 16:15         ` Auke Kok
2006-10-25 13:41           ` Kenzo Iwami
2006-10-25 15:09             ` Auke Kok
2006-10-26 10:35               ` Kenzo Iwami
2006-10-26 14:34                 ` Auke Kok
2006-10-30 11:36                   ` Kenzo Iwami
2006-10-30 17:30                     ` Auke Kok
2006-10-31  3:22                       ` Shaw Vrana
2006-11-01 13:21                         ` Kenzo Iwami
2006-11-15 10:33                           ` Kenzo Iwami
2006-11-15 16:11                             ` Auke Kok
2006-11-16  9:23                               ` Kenzo Iwami
2007-02-20  9:26 ` Kenzo Iwami
2007-02-20 16:10   ` Auke Kok
2007-02-21  5:17     ` Kenzo Iwami
2006-11-16 17:20 Brandeburg, Jesse
2006-11-21 10:16 ` Kenzo Iwami
2006-12-04  9:14   ` Kenzo Iwami
2006-12-05  0:46     ` Auke Kok
2006-12-12  7:58       ` Kenzo Iwami
2006-12-19  0:13         ` Kenzo Iwami
2007-01-15  9:12           ` Kenzo Iwami
2007-01-15 16:14             ` Auke Kok
2007-01-16  8:42               ` Kenzo Iwami
2007-01-18  9:22                 ` Kenzo Iwami

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45379C14.5050901@foo-projects.org \
    --to=sofar@foo-projects.org \
    --cc=jesse.brandeburg@intel.com \
    --cc=john.ronciak@intel.com \
    --cc=k-iwami@cj.jp.nec.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.