All of lore.kernel.org
 help / color / mirror / Atom feed
* watchdog timeout panic in e1000 driver
@ 2006-10-19 10:19 Kenzo Iwami
  2006-10-19 15:39 ` Auke Kok
  2007-02-20  9:26 ` Kenzo Iwami
  0 siblings, 2 replies; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-19 10:19 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 3560 bytes --]

Hi,

A watchdog timeout panic occurred in e1000 driver (7.2.9-NAPI).
If e1000_watchdog is called when processing ioctl from ethtool, the system
could stop inside e1000_watchdog interrupt handler for about 16 seconds,
and the system panicked as a result of a watchdog timeout.

This problem only occurs on a server using ethernet controller inside
631xESB/632xESB, and NMI watchdog enabled.

Environment:
  OS     : RHEL4U3(x86_64)
  kernel : 2.6.9-34.ELsmp
  e1000  : 7.2.9-NAPI
  Ethernet controller : Intel Corporation 631x/632xESB DPT LAN Controller
                        Copper (rev 01)
  Watchdog timer should be enabled with a timeout period of less than 16
  seconds.

Steps to reproduce:
  Please apply the attached patch (ethtool.patch) to ethtool (VERSION 5) source
  code. Run make, and rename the freshly built ethtool as gsetloop.
  Put gsetloop and the attached shell script (gloop.sh) in the same directory,
  and execute gloop.sh. The problem should occur within about 5 minutes.

Cause:
  The problem occurs in the following steps.
   - ioctl is executed in ethtool.
      - e1000_read_phy_reg() is called from ioctl to read the value from phy
        register.
      - e1000_get_hw_eeprom_semaphore() is called from e1000_read_phy_reg() to
        acquire a semaphore.
      - E1000_SWSM_SWESMBI bit that is FW semaphore bit is set in
        e1000_get_hw_eeprom_semaphore().
      - When this bit was set, E1000_SWSM_SMBI bit that is driver's semaphore
        bit is also set.
   - e1000_watchdog() of interrupt handler is executed before the
     E1000_SWSM_SMBI bit is unset.
      - e1000_read_phy_reg() is called from e1000_watchdog() to read the value
        from phy register.
      - e1000_get_software_semaphore() is called from e1000_watchdog to confirm
        whether interruption handler can acquire a semaphore.
        This function confirms whether E1000_SWSM_SMBI bit is being set.
      - Therefore the process does loop for "hw->eeprom.word_size + 1" msec
        in e1000_get_software_semaphore().
        The value of "hw->eeprom.word_size + 1" was 16385 on my system.
        In other words it loops for 16.385 sec in
        e1000_get_software_semaphore().
        If NMI watchdog is enabled, the system will panic by NMI watchdog
        within this loop.

Fix:
  In kernels before 2.6.17, the e1000_watchdog() interrupt handler schedules
  e1000_watchdog_task(). The semaphore is acquired within this task, after
  ioctl processing for ethtool is finished, and this problem is avoided.

  e1000_watchdog_task() was remove by the following patch.

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2db10a081c5c1082d58809a1bcf1a6073f4db160
     e1000: rework driver hardware reset locking
     >After studying the driver mac reset code it was found that there
     >were multiple race conditions possible to reset the unit twice or
     >bring it e1000_up() double. This fixes all occurences where the
     >driver needs to reset the mac.
     >
     >We also remove irq requesting/releasing into _open and _close so
     >that while the device is _up we will never touch the irq's. This fixes
     >the double free irq bug that people saw.
     >
     >To make sure that the watchdog task doesn't cause another race we let
     >it run as a non-scheduled task.

  I'm not sure whether there was any reason to actively remove
  e1000_watchdog_task(). I think that removing e1000_watchdog_task() was a
  mistake, and it should be brought back in.
--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)



[-- Attachment #2: gloop.sh --]
[-- Type: text/plain, Size: 106 bytes --]

#!/bin/sh

n=64
i=1
[ $# -ge 1 ] && n=$1

while [ $i -le $n ]
do
	./gsetloop eth0 &
	i=`expr $i + 1`
done

[-- Attachment #3: ethtool.patch --]
[-- Type: text/plain, Size: 452 bytes --]

diff -urpN ethtool_git/ethtool.c my-ethtool/ethtool.c
--- ethtool_git/ethtool.c	2006-10-18 14:17:10.000000000 +0900
+++ my-ethtool/ethtool.c	2006-10-18 14:18:45.000000000 +0900
@@ -1576,7 +1576,9 @@ static int do_gset(int fd, struct ifreq 
 
 	ecmd.cmd = ETHTOOL_GSET;
 	ifr->ifr_data = (caddr_t)&ecmd;
-	err = ioctl(fd, SIOCETHTOOL, ifr);
+	for (;;) {
+		err = ioctl(fd, SIOCETHTOOL, ifr);
+	}
 	if (err == 0) {
 		err = dump_ecmd(&ecmd);
 		if (err)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-19 10:19 watchdog timeout panic in e1000 driver Kenzo Iwami
@ 2006-10-19 15:39 ` Auke Kok
       [not found]   ` <4538BFF2.2040207@cj.jp.nec.com>
  2007-02-20  9:26 ` Kenzo Iwami
  1 sibling, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-19 15:39 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> A watchdog timeout panic occurred in e1000 driver (7.2.9-NAPI).

where's the panic message ?

Please CC the maintainers of the driver at all times. Our e-mail addresses are widely 
visible everywhere.

> If e1000_watchdog is called when processing ioctl from ethtool, the system
> could stop inside e1000_watchdog interrupt handler for about 16 seconds
 >
> and the system panicked as a result of a watchdog timeout.
> 
> This problem only occurs on a server using ethernet controller inside
> 631xESB/632xESB, and NMI watchdog enabled.

why only this system? have you seen/tried it on other machines?

> Environment:
>   OS     : RHEL4U3(x86_64)
>   kernel : 2.6.9-34.ELsmp
>   e1000  : 7.2.9-NAPI
>   Ethernet controller : Intel Corporation 631x/632xESB DPT LAN Controller
>                         Copper (rev 01)
>   Watchdog timer should be enabled with a timeout period of less than 16
>   seconds.
> Steps to reproduce:
>   Please apply the attached patch (ethtool.patch) to ethtool (VERSION 5) source
>   code. Run make, and rename the freshly built ethtool as gsetloop.
>   Put gsetloop and the attached shell script (gloop.sh) in the same directory,
>   and execute gloop.sh. The problem should occur within about 5 minutes.
 >
> 
> Cause:
>   The problem occurs in the following steps.
>    - ioctl is executed in ethtool.
>       - e1000_read_phy_reg() is called from ioctl to read the value from phy
>         register.
>       - e1000_get_hw_eeprom_semaphore() is called from e1000_read_phy_reg() to
>         acquire a semaphore.
>       - E1000_SWSM_SWESMBI bit that is FW semaphore bit is set in
>         e1000_get_hw_eeprom_semaphore().
>       - When this bit was set, E1000_SWSM_SMBI bit that is driver's semaphore
>         bit is also set.
>    - e1000_watchdog() of interrupt handler is executed before the
>      E1000_SWSM_SMBI bit is unset.
>       - e1000_read_phy_reg() is called from e1000_watchdog() to read the value
>         from phy register.
>       - e1000_get_software_semaphore() is called from e1000_watchdog to confirm
>         whether interruption handler can acquire a semaphore.
>         This function confirms whether E1000_SWSM_SMBI bit is being set.
>       - Therefore the process does loop for "hw->eeprom.word_size + 1" msec
>         in e1000_get_software_semaphore().
>         The value of "hw->eeprom.word_size + 1" was 16385 on my system.
>         In other words it loops for 16.385 sec in
>         e1000_get_software_semaphore().
>         If NMI watchdog is enabled, the system will panic by NMI watchdog
>         within this loop.
> 
> Fix:
>   In kernels before 2.6.17, the e1000_watchdog() interrupt handler schedules
>   e1000_watchdog_task(). The semaphore is acquired within this task, after
>   ioctl processing for ethtool is finished, and this problem is avoided.
> 
>   e1000_watchdog_task() was remove by the following patch.
> 
> http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2db10a081c5c1082d58809a1bcf1a6073f4db160
>      e1000: rework driver hardware reset locking
>      >After studying the driver mac reset code it was found that there
>      >were multiple race conditions possible to reset the unit twice or
>      >bring it e1000_up() double. This fixes all occurences where the
>      >driver needs to reset the mac.
>      >
>      >We also remove irq requesting/releasing into _open and _close so
>      >that while the device is _up we will never touch the irq's. This fixes
>      >the double free irq bug that people saw.
>      >
>      >To make sure that the watchdog task doesn't cause another race we let
>      >it run as a non-scheduled task.
> 
>   I'm not sure whether there was any reason to actively remove
>   e1000_watchdog_task(). I think that removing e1000_watchdog_task() was a
>   mistake, and it should be brought back in.


Reverting this could would not be a fix, but only a workaround that leaves the problem 
still in the code, and as such not progress in the right direction.

I find this report extremely edgy, but I'll look into the fact that the driver attempts 
to sleep for 16384 + 1 msec, which seems overly long :)

As a side note, most other e1000 NIC's use hardcoded word_size numbers, but esb2 systems 
read it from a register/eeprom. Can you send me the output of `ethtool -e ethX` ? 
off-list is OK, it might be large.

Thanks,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
       [not found]   ` <4538BFF2.2040207@cj.jp.nec.com>
@ 2006-10-20 15:51     ` Auke Kok
  2006-10-24  9:01       ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-20 15:51 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
> Thank you for your comment.
> 
>>> A watchdog timeout panic occurred in e1000 driver (7.2.9-NAPI).
>> where's the panic message ?
> 
> attached the panic message (e1000_panic).
> 
> [...]
>>> This problem only occurs on a server using ethernet controller inside
>>> 631xESB/632xESB, and NMI watchdog enabled.
>> why only this system? have you seen/tried it on other machines?
> 
> This problem is caused by e1000_get_software_semaphore() being called from
> within the interrupt handler, while the interrupted code is still holding
> this semaphore.  e1000_get_software_semaphore() is called from
> e1000_get_hw_eeprom_semaphore() only when hw->mac_type is e1000_80003es2lan.
> This condition is true only for MACs inside 631xESB/632xESB.
> 
> When this problem happens e1000_get_software_semaphore() will wait for
> 16 seconds (inside the interrupt handler) before it fails, thus causing
> the watchdog timeout.
> 
> I haven't actually tried it on other machines, but theoretically, it will
> only happen on MAC inside 631xESB/632xESB chip set.
> 
> [...]
>> Reverting this could would not be a fix, but only a workaround that leaves the problem 
>> still in the code, and as such not progress in the right direction.
>>
>> I find this report extremely edgy, but I'll look into the fact that the driver attempts 
>> to sleep for 16384 + 1 msec, which seems overly long :)
>>
>> As a side note, most other e1000 NIC's use hardcoded word_size numbers, but esb2 systems 
>> read it from a register/eeprom. Can you send me the output of `ethtool -e ethX` ? 
>> off-list is OK, it might be large.
> 
> attached is the output of "ethtool -e ethX" (eeprom_eth0).

thanks.

This panic report falls in the category "how hard can I break my system as root". 
Explicitly abusing the system performing restricted calls depletes resources and 
harasses the sw lock (in this case). The reason that the driver attempts to wait that 
long is that in the case of ESB2 systems, the SPI interface to the EEPROM can be slow, 
thus taking a long time to complete certain commands.

We're looking into making this theoretical lock time shorter in the mean time, thanks 
for reporting this.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-20 15:51     ` Auke Kok
@ 2006-10-24  9:01       ` Kenzo Iwami
  2006-10-24 16:15         ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-24  9:01 UTC (permalink / raw)
  To: Auke Kok; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Hi,

Thank you for your comment.

> This panic report falls in the category "how hard can I break my system as root". 
> Explicitly abusing the system performing restricted calls depletes resources and 
> harasses the sw lock (in this case). The reason that the driver attempts to wait that 
> long is that in the case of ESB2 systems, the SPI interface to the EEPROM can be slow, 
> thus taking a long time to complete certain commands.

This problem originally occurred in a very large cluster system using snmp
for server management. About two servers panicked each day. The program I sent
is to reproduce this problem in a very short time. It does occur under normal
load when there is a lot of servers.

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-24  9:01       ` Kenzo Iwami
@ 2006-10-24 16:15         ` Auke Kok
  2006-10-25 13:41           ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-24 16:15 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Auke Kok, netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
> Thank you for your comment.
> 
>> This panic report falls in the category "how hard can I break my system as root". 
>> Explicitly abusing the system performing restricted calls depletes resources and 
>> harasses the sw lock (in this case). The reason that the driver attempts to wait that 
>> long is that in the case of ESB2 systems, the SPI interface to the EEPROM can be slow, 
>> thus taking a long time to complete certain commands.
> 
> This problem originally occurred in a very large cluster system using snmp
> for server management. About two servers panicked each day. The program I sent
> is to reproduce this problem in a very short time. It does occur under normal
> load when there is a lot of servers.

hmm, not good - does your snmp daemon use ethtool excessively? That would certainly be 
painful to the driver (any driver!).

Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
reasonable time. This will unfortunately take some time, as we need to change some major 
components in the driver to make sure this doesn't happen.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-24 16:15         ` Auke Kok
@ 2006-10-25 13:41           ` Kenzo Iwami
  2006-10-25 15:09             ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-25 13:41 UTC (permalink / raw)
  To: Auke Kok; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Hi,

>> This problem originally occurred in a very large cluster system using snmp
>> for server management. About two servers panicked each day. The program I sent
>> is to reproduce this problem in a very short time. It does occur under normal
>> load when there is a lot of servers.
> 
> hmm, not good - does your snmp daemon use ethtool excessively? That would certainly be 
> painful to the driver (any driver!).

I only looked at the panic message after this problem occurred.
I could tell that the snmp daemon caused the panic while trying to process
the ethtool's ioctl, but I don't know how often this was called.
However, it shouldn't be excessively called because it occurred on a production
system while it was idle.

> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
> reasonable time. This will unfortunately take some time, as we need to change some major 
> components in the driver to make sure this doesn't happen.

How about the following approach?
If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
is abandoned immediately without waiting for timeout.
However, I don't know whether this method affects other processes.

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-25 13:41           ` Kenzo Iwami
@ 2006-10-25 15:09             ` Auke Kok
  2006-10-26 10:35               ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-25 15:09 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
>>> This problem originally occurred in a very large cluster system using snmp
>>> for server management. About two servers panicked each day. The program I sent
>>> is to reproduce this problem in a very short time. It does occur under normal
>>> load when there is a lot of servers.
>> hmm, not good - does your snmp daemon use ethtool excessively? That would certainly be 
>> painful to the driver (any driver!).
> 
> I only looked at the panic message after this problem occurred.
> I could tell that the snmp daemon caused the panic while trying to process
> the ethtool's ioctl, but I don't know how often this was called.
> However, it shouldn't be excessively called because it occurred on a production
> system while it was idle.
> 
>> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
>> reasonable time. This will unfortunately take some time, as we need to change some major 
>> components in the driver to make sure this doesn't happen.
> 
> How about the following approach?
> If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
> is abandoned immediately without waiting for timeout.
> However, I don't know whether this method affects other processes.

with the current hardware being accessed simultaneously from several users in the 
kernel, that would lead to large problems - the watchdog task accesses it every 2 
seconds as it reads the PHY link status, so when one of those fails the driver would 
have no choice but to reset the entire device.

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-25 15:09             ` Auke Kok
@ 2006-10-26 10:35               ` Kenzo Iwami
  2006-10-26 14:34                 ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-26 10:35 UTC (permalink / raw)
  To: Auke Kok; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Hi,

Thank you for your comment.

>>> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
>>> reasonable time. This will unfortunately take some time, as we need to change some major 
>>> components in the driver to make sure this doesn't happen.
>>
>> How about the following approach?
>> If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
>> is abandoned immediately without waiting for timeout.
>> However, I don't know whether this method affects other processes.
> 
> with the current hardware being accessed simultaneously from several users in the 
> kernel, that would lead to large problems - the watchdog task accesses it every 2 
> seconds as it reads the PHY link status, so when one of those fails the driver would 
> have no choice but to reset the entire device.

This problem occurs because interrupt handler is executed while the
interrupted code is still holding the semaphore. Acquiring the semaphore
fails regardless of the timeout period.

I think the watchdog task will fail trying to read the PHY link status,
even if the lock timeout period has been reduced.

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-26 10:35               ` Kenzo Iwami
@ 2006-10-26 14:34                 ` Auke Kok
  2006-10-30 11:36                   ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-26 14:34 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
> Thank you for your comment.
> 
>>>> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
>>>> reasonable time. This will unfortunately take some time, as we need to change some major 
>>>> components in the driver to make sure this doesn't happen.
>>> How about the following approach?
>>> If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
>>> is abandoned immediately without waiting for timeout.
>>> However, I don't know whether this method affects other processes.
>> with the current hardware being accessed simultaneously from several users in the 
>> kernel, that would lead to large problems - the watchdog task accesses it every 2 
>> seconds as it reads the PHY link status, so when one of those fails the driver would 
>> have no choice but to reset the entire device.
> 
> This problem occurs because interrupt handler is executed while the
> interrupted code is still holding the semaphore. Acquiring the semaphore
> fails regardless of the timeout period.
> 
> I think the watchdog task will fail trying to read the PHY link status,
> even if the lock timeout period has been reduced.

correct, we're not looking into reducing the lock timeout but towards reducing the total 
lock time. Once we have reduced that to something acceptable, we can reduce the timout 
accordingly.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-26 14:34                 ` Auke Kok
@ 2006-10-30 11:36                   ` Kenzo Iwami
  2006-10-30 17:30                     ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-10-30 11:36 UTC (permalink / raw)
  To: Auke Kok; +Cc: netdev, Jesse Brandeburg, Ronciak, John

Hi,

Thank you for your comment.

>>>>> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
>>>>> reasonable time. This will unfortunately take some time, as we need to change some major 
>>>>> components in the driver to make sure this doesn't happen.
>>>> How about the following approach?
>>>> If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
>>>> is abandoned immediately without waiting for timeout.
>>>> However, I don't know whether this method affects other processes.
>>> with the current hardware being accessed simultaneously from several users in the 
>>> kernel, that would lead to large problems - the watchdog task accesses it every 2 
>>> seconds as it reads the PHY link status, so when one of those fails the driver would 
>>> have no choice but to reset the entire device.
>> This problem occurs because interrupt handler is executed while the
>> interrupted code is still holding the semaphore. Acquiring the semaphore
>> fails regardless of the timeout period.
>>
>> I think the watchdog task will fail trying to read the PHY link status,
>> even if the lock timeout period has been reduced.
> 
> correct, we're not looking into reducing the lock timeout but towards reducing the total 
> lock time. Once we have reduced that to something acceptable, we can reduce the timout 
> accordingly.

Even if the total lock time can be reduced, it's possible that interrupt
handler is executed while the interrupted code is still holding the semaphore.
I think your method only decrease the frequency of this problem.
Why does reducing the lock time solve this problem?

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-30 11:36                   ` Kenzo Iwami
@ 2006-10-30 17:30                     ` Auke Kok
  2006-10-31  3:22                       ` Shaw Vrana
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-10-30 17:30 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Auke Kok, netdev, Jesse Brandeburg, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
> Thank you for your comment.
> 
>>>>>> Anyway as I said in the same e-mail, we're working on reducing the lock timeout to a 
>>>>>> reasonable time. This will unfortunately take some time, as we need to change some major 
>>>>>> components in the driver to make sure this doesn't happen.
>>>>> How about the following approach?
 >>>>>
>>>>> If acquiring semaphore fails inside the interrupt handler, acquiring semaphore
>>>>> is abandoned immediately without waiting for timeout.
>>>>> However, I don't know whether this method affects other processes.
 >>>>
>>>> with the current hardware being accessed simultaneously from several users in the 
>>>> kernel, that would lead to large problems - the watchdog task accesses it every 2 
>>>> seconds as it reads the PHY link status, so when one of those fails the driver would 
>>>> have no choice but to reset the entire device.
 >>>
>>> This problem occurs because interrupt handler is executed while the
>>> interrupted code is still holding the semaphore. Acquiring the semaphore
>>> fails regardless of the timeout period.
>>>
>>> I think the watchdog task will fail trying to read the PHY link status,
>>> even if the lock timeout period has been reduced.
 >>
>> correct, we're not looking into reducing the lock timeout but towards reducing the total 
>> lock time. Once we have reduced that to something acceptable, we can reduce the timout 
>> accordingly.
> 
> Even if the total lock time can be reduced, it's possible that interrupt
> handler is executed while the interrupted code is still holding the semaphore.
> I think your method only decrease the frequency of this problem.
> Why does reducing the lock time solve this problem?

there are several problems here that need addressing. It's not acceptable for our driver 
to wait up to 15 seconds, and we can (presumably) reduce it to milliseconds, so that 
would help a lot. We should in no case at all hold it for any period longer than (give 
or take) half a second, so working towards that is a very good step in the right direction.

Adding the timer task back may also help, as we are no longer trying to aqcuire the 
sw_fw_semaphore in interrupt context, but we removed it for a reason, and I need to dig 
up what reason this exactly was before we can revert it. Jesse might know, so I'll talk 
to him. But this will not fix the fact that the semaphore is held for a long time :)

so, this needs two fixes. much to do.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-30 17:30                     ` Auke Kok
@ 2006-10-31  3:22                       ` Shaw Vrana
  2006-11-01 13:21                         ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Shaw Vrana @ 2006-10-31  3:22 UTC (permalink / raw)
  To: Auke Kok; +Cc: Kenzo Iwami, netdev, Jesse Brandeburg, Ronciak, John

On Mon, Oct 30, 2006 at 09:30:24AM -0800, Auke Kok wrote:
> >Even if the total lock time can be reduced, it's possible that interrupt
> >handler is executed while the interrupted code is still holding the 
> >semaphore.
> >I think your method only decrease the frequency of this problem.
> >Why does reducing the lock time solve this problem?
> 
> there are several problems here that need addressing. It's not acceptable 
> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
> to milliseconds, so that would help a lot. We should in no case at all hold 
> it for any period longer than (give or take) half a second, so working 
> towards that is a very good step in the right direction.
> 
> Adding the timer task back may also help, as we are no longer trying to 
> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
> reason, and I need to dig up what reason this exactly was before we can 
> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
> fact that the semaphore is held for a long time :)


Timer tasks that reschedule themselves are a pain.  The watchdog timer task
had a couple of race conditions that were thought to be better fixed by
removing it all together.  Please, let's not go down that road again!

Check out what you have to say about it, Auke. ;)
http://www.spinics.net/lists/netdev/msg03656.html


Shaw

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-31  3:22                       ` Shaw Vrana
@ 2006-11-01 13:21                         ` Kenzo Iwami
  2006-11-15 10:33                           ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-11-01 13:21 UTC (permalink / raw)
  To: Shaw Vrana; +Cc: Auke Kok, netdev, Jesse Brandeburg, Ronciak, John

Hi,

>>> Even if the total lock time can be reduced, it's possible that interrupt
>>> handler is executed while the interrupted code is still holding the 
>>> semaphore.
>>> I think your method only decrease the frequency of this problem.
>>> Why does reducing the lock time solve this problem?
>> there are several problems here that need addressing. It's not acceptable 
>> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
>> to milliseconds, so that would help a lot. We should in no case at all hold 
>> it for any period longer than (give or take) half a second, so working 
>> towards that is a very good step in the right direction.
>>
>> Adding the timer task back may also help, as we are no longer trying to 
>> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
>> reason, and I need to dig up what reason this exactly was before we can 
>> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
>> fact that the semaphore is held for a long time :)
> 
> Timer tasks that reschedule themselves are a pain.  The watchdog timer task
> had a couple of race conditions that were thought to be better fixed by
> removing it all together.  Please, let's not go down that road again!

I understand that the watchdog_task could cause a race when the timer task
and e1000_down runs concurrently, resulting in memory double free.

I think this problem occurs because interrupt handler is executed in same
CPU as process that acquires semaphore.
How about disabling interrupt while the process is holding the semaphore?
I think this is possible, if the total lock time has been reduced.

-- 
  Kenzi Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-11-01 13:21                         ` Kenzo Iwami
@ 2006-11-15 10:33                           ` Kenzo Iwami
  2006-11-15 16:11                             ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-11-15 10:33 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Shaw Vrana, Auke Kok, netdev, Jesse Brandeburg, Ronciak, John

Hi,

>>>> Even if the total lock time can be reduced, it's possible that interrupt
>>>> handler is executed while the interrupted code is still holding the 
>>>> semaphore.
>>>> I think your method only decrease the frequency of this problem.
>>>> Why does reducing the lock time solve this problem?
>>> there are several problems here that need addressing. It's not acceptable 
>>> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
>>> to milliseconds, so that would help a lot. We should in no case at all hold 
>>> it for any period longer than (give or take) half a second, so working 
>>> towards that is a very good step in the right direction.
>>>
>>> Adding the timer task back may also help, as we are no longer trying to 
>>> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
>>> reason, and I need to dig up what reason this exactly was before we can 
>>> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
>>> fact that the semaphore is held for a long time :)
[...]
> I think this problem occurs because interrupt handler is executed in same
> CPU as process that acquires semaphore.
> How about disabling interrupt while the process is holding the semaphore?
> I think this is possible, if the total lock time has been reduced.

I created the attached patch based on the method described above.
This patch disables interrupt while the process is holding the semaphore.

I measured how long interrupts are being disabled 10,000 times using the
following method. TSC was read by rdtscll when interrupt was disabled
and interrupt was enabled again, then I subtract these two value.

The longest period interrupt was disabled is under 10usec, which seems
acceptable. The evaluation environment is;
  CPU    : Intel Xeon CPU 3.73GHz
  kernel : 2.6.19-rc5

I also ran the reproduction TP I sent previously and confirmed that the
system didn't panic.
  http://marc.theaimsgroup.com/?l=linux-netdev&m=116125332319850&w=2

How about this method?

I welcome any comments.
-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

Signed-off-by: Kenzo Iwami <k-iwami@cj.jp.nec.com>

diff -urpN linux-2.6.19-rc5_org/drivers/net/e1000/e1000_hw.c
linux-2.6.19-rc5/drivers/net/e1000/e1000_hw.c
--- linux-2.6.19-rc5_org/drivers/net/e1000/e1000_hw.c	2006-11-14
17:47:28.000000000 +0900
+++ linux-2.6.19-rc5/drivers/net/e1000/e1000_hw.c	2006-11-15 17:49:39.000000000
+0900
@@ -3379,6 +3379,7 @@ e1000_swfw_sync_acquire(struct e1000_hw
     uint32_t swmask = mask;
     uint32_t fwmask = mask << 16;
     int32_t timeout = 200;
+    unsigned long flags;

     DEBUGFUNC("e1000_swfw_sync_acquire");

@@ -3389,8 +3390,11 @@ e1000_swfw_sync_acquire(struct e1000_hw
         return e1000_get_hw_eeprom_semaphore(hw);

     while (timeout) {
-            if (e1000_get_hw_eeprom_semaphore(hw))
+            local_irq_save(flags);
+            if (e1000_get_hw_eeprom_semaphore(hw)) {
+                local_irq_restore(flags);
                 return -E1000_ERR_SWFW_SYNC;
+            }

             swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC);
             if (!(swfw_sync & (fwmask | swmask))) {
@@ -3400,6 +3404,7 @@ e1000_swfw_sync_acquire(struct e1000_hw
             /* firmware currently using resource (fwmask) */
             /* or other software thread currently using resource (swmask) */
             e1000_put_hw_eeprom_semaphore(hw);
+            local_irq_restore(flags);
             mdelay(5);
             timeout--;
     }
@@ -3413,6 +3418,7 @@ e1000_swfw_sync_acquire(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    local_irq_restore(flags);
     return E1000_SUCCESS;
 }

@@ -3421,6 +3427,7 @@ e1000_swfw_sync_release(struct e1000_hw
 {
     uint32_t swfw_sync;
     uint32_t swmask = mask;
+    unsigned long flags;

     DEBUGFUNC("e1000_swfw_sync_release");

@@ -3434,6 +3441,7 @@ e1000_swfw_sync_release(struct e1000_hw
         return;
     }

+    local_irq_save(flags);
     /* if (e1000_get_hw_eeprom_semaphore(hw))
      *    return -E1000_ERR_SWFW_SYNC; */
     while (e1000_get_hw_eeprom_semaphore(hw) != E1000_SUCCESS);
@@ -3444,6 +3452,7 @@ e1000_swfw_sync_release(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    local_irq_restore(flags);
 }

 /*****************************************************************************

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-11-15 10:33                           ` Kenzo Iwami
@ 2006-11-15 16:11                             ` Auke Kok
  2006-11-16  9:23                               ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-11-15 16:11 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Shaw Vrana, netdev, Jesse Brandeburg, Ronciak, John

[-- Attachment #1: Type: text/plain, Size: 2835 bytes --]

Kenzo Iwami wrote:
> Hi,
> 
>>>>> Even if the total lock time can be reduced, it's possible that interrupt
>>>>> handler is executed while the interrupted code is still holding the 
>>>>> semaphore.
>>>>> I think your method only decrease the frequency of this problem.
>>>>> Why does reducing the lock time solve this problem?
>>>> there are several problems here that need addressing. It's not acceptable 
>>>> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
>>>> to milliseconds, so that would help a lot. We should in no case at all hold 
>>>> it for any period longer than (give or take) half a second, so working 
>>>> towards that is a very good step in the right direction.
>>>>
>>>> Adding the timer task back may also help, as we are no longer trying to 
>>>> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
>>>> reason, and I need to dig up what reason this exactly was before we can 
>>>> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
>>>> fact that the semaphore is held for a long time :)
> [...]
>> I think this problem occurs because interrupt handler is executed in same
>> CPU as process that acquires semaphore.
>> How about disabling interrupt while the process is holding the semaphore?
>> I think this is possible, if the total lock time has been reduced.
> 
> I created the attached patch based on the method described above.
> This patch disables interrupt while the process is holding the semaphore.
> 
> I measured how long interrupts are being disabled 10,000 times using the
> following method. TSC was read by rdtscll when interrupt was disabled
> and interrupt was enabled again, then I subtract these two value.
> 
> The longest period interrupt was disabled is under 10usec, which seems
> acceptable. The evaluation environment is;
>   CPU    : Intel Xeon CPU 3.73GHz
>   kernel : 2.6.19-rc5
> 
> I also ran the reproduction TP I sent previously and confirmed that the
> system didn't panic.
>   http://marc.theaimsgroup.com/?l=linux-netdev&m=116125332319850&w=2
> 
> How about this method?

I'm not sure why you would have to disable interrupts when freeing the semaphore, but 
more importantly I don't want to introduce irq code in the swfw handling functions.

Since the major (only) user running this piece of code in intterupt context is the 
watchdog, we might as well see if we can only disable interrupts for that code path, 
which would only be once per 2 seconds. We don't need to protect the ethtool path into 
this code as it doesn't run in irq context.

Would you mind giving attached patch a try and let me know if it works for you? It will 
disable irqs for a bit longer time than your patch, and it begs for a special 
check-link-in-watchdog function that doesn't take so damn long :(

Thanks,

Auke


[-- Attachment #2: e1000_git_disable_irqs_in_watchdog_link_status_check.patch --]
[-- Type: text/x-patch, Size: 673 bytes --]


Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index a2f1464..0b52ded 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2430,8 +2430,12 @@ e1000_watchdog(unsigned long data)
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
 	int32_t ret_val;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	ret_val = e1000_check_for_link(&adapter->hw);
+	local_irq_restore(flags);
+
 	if ((ret_val == E1000_ERR_PHY) &&
 	    (adapter->hw.phy_type == e1000_phy_igp_3) &&
 	    (E1000_READ_REG(&adapter->hw, CTRL) & E1000_PHY_CTRL_GBE_DISABLE)) {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-11-15 16:11                             ` Auke Kok
@ 2006-11-16  9:23                               ` Kenzo Iwami
  0 siblings, 0 replies; 29+ messages in thread
From: Kenzo Iwami @ 2006-11-16  9:23 UTC (permalink / raw)
  To: Auke Kok; +Cc: Shaw Vrana, netdev, Jesse Brandeburg, Ronciak, John

Hi,

Thank you for your comment.

>>> I think this problem occurs because interrupt handler is executed in same
>>> CPU as process that acquires semaphore.
>>> How about disabling interrupt while the process is holding the semaphore?
>>> I think this is possible, if the total lock time has been reduced.
>> I created the attached patch based on the method described above.
>> This patch disables interrupt while the process is holding the semaphore.
[...]
> I'm not sure why you would have to disable interrupts when freeing the semaphore, but 
> more importantly I don't want to introduce irq code in the swfw handling functions.
> 
> Since the major (only) user running this piece of code in intterupt context is the 
> watchdog, we might as well see if we can only disable interrupts for that code path, 
> which would only be once per 2 seconds. We don't need to protect the ethtool path into 
> this code as it doesn't run in irq context.
> 
> Would you mind giving attached patch a try and let me know if it works for you? It will 
> disable irqs for a bit longer time than your patch, and it begs for a special 
> check-link-in-watchdog function that doesn't take so damn long :(

I tried your patch, but the system panicked with the same symptom.

This problem occurs because e1000_watchdog is called from the interrupt
handler while ethtool processing is holding the semaphore.

  ethtool processing holding semaphore
    INTERRUPT
      e1000_watchdog waits for semaphore to be released

The semaphore e1000_watchdog is waiting for can only be released when
ethtool resumes from interrupt after e1000_watchdog finishes (basically
a deadlock)

In order to solve this problem, interrupts has to be disabled on the
interrupted side (during ethtool processing) and not during
e1000_watchdog within the interrupt handler.

e1000_get_hw_eeprom_semaphore is called from both the interrupt level
and the normal level, and needs to be protected by irq code. The reason
the patch disables interrupts when freeing the semaphore is because
e1000_swfw_sync_release also calls e1000_get_hw_eeprom_semaphore.

--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-10-19 10:19 watchdog timeout panic in e1000 driver Kenzo Iwami
  2006-10-19 15:39 ` Auke Kok
@ 2007-02-20  9:26 ` Kenzo Iwami
  2007-02-20 16:10   ` Auke Kok
  1 sibling, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2007-02-20  9:26 UTC (permalink / raw)
  To: Auke Kok, Jesse Brandeburg; +Cc: netdev, Ronciak, John

Hi,

I created a patch that uses watchdog_task but fixes the race condition
that occurred in old the e1000 driver.

I've obtained information about the panic caused by the old e1000 driver
using e1000_watchdog_task. According to the crash dump, the panic was
caused by a timer_list whose contents were NULLs. Further trace
information revealed that the function in the timer list was
e1000_watchdog().

This function is registered in timer_list during e1000_watchdog_task.
It seems that e1000_watchdog_task could be called after the adapter is
removed, and freed memory is registered to timer_list.

By looking at the source code, e1000_watchdog_task will be scheduled if
e1000_watchdog is invoked during e1000_remove, after flush_scheduled_work()
is called, but before del_timer_sync() is called in e1000_down().

The attached patch adds back the e1000_watchdog_task(), but it will
prevent the old race condition from happening by deleting e1000_watchdog
from timer_list before flush_scheduled_work() is called.

Please incorporate this patch to e1000 source code.
--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

Signed-off-by: Kenzo Iwami <k-iwami@cj.jp.nec.com>

diff -urpN linux-2.6.20_org/drivers/net/e1000/e1000.h linux-2.6.20_fix/drivers/net/e1000/e1000.h
--- linux-2.6.20_org/drivers/net/e1000/e1000.h	2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20_fix/drivers/net/e1000/e1000.h	2007-02-19 09:52:22.000000000 +0900
@@ -268,6 +268,7 @@ struct e1000_adapter {
 	uint16_t tx_itr;
 	uint16_t rx_itr;

+	struct work_struct watchdog_task;
 	struct work_struct reset_task;
 	uint8_t fc_autoneg;

@@ -355,6 +356,8 @@ struct e1000_adapter {
 	boolean_t quad_port_a;
 	unsigned long flags;
 	uint32_t eeprom_wol;
+	/* Indicates that the adapter is being removed if remove_state==1.*/
+	int remove_state;
 };

 enum e1000_state_t {
diff -urpN linux-2.6.20_org/drivers/net/e1000/e1000_main.c linux-2.6.20_fix/drivers/net/e1000/e1000_main.c
--- linux-2.6.20_org/drivers/net/e1000/e1000_main.c	2007-02-05 03:44:54.000000000 +0900
+++ linux-2.6.20_fix/drivers/net/e1000/e1000_main.c	2007-02-20 15:14:40.000000000 +0900
@@ -152,6 +152,7 @@ static void e1000_clean_rx_ring(struct e
 static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
+static void e1000_watchdog_task(struct work_struct *work);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
@@ -1044,10 +1045,14 @@ e1000_probe(struct pci_dev *pdev,
 	adapter->tx_fifo_stall_timer.function = &e1000_82547_tx_fifo_stall;
 	adapter->tx_fifo_stall_timer.data = (unsigned long) adapter;

+	adapter->remove_state = 0;
+
 	init_timer(&adapter->watchdog_timer);
 	adapter->watchdog_timer.function = &e1000_watchdog;
 	adapter->watchdog_timer.data = (unsigned long) adapter;

+	INIT_WORK(&adapter->watchdog_task, e1000_watchdog_task);
+
 	init_timer(&adapter->phy_info_timer);
 	adapter->phy_info_timer.function = &e1000_update_phy_info;
 	adapter->phy_info_timer.data = (unsigned long) adapter;
@@ -1219,7 +1224,15 @@ e1000_remove(struct pci_dev *pdev)
 #ifdef CONFIG_E1000_NAPI
 	int i;
 #endif
-
+	/* If e1000_watchdog runs between flush_scheduled_work and
+	 * del_timer_sync(in e1000_down), e1000_watchdog_task is
+	 * called after the adapter is removed. This will cause
+	 * freed memory to be registered to timer list and
+	 * result in system panic. Delete e1000_watchdog from
+	 * timer list to prevent this from happening.
+	 */
+	adapter->remove_state = 1;
+	del_timer_sync(&adapter->watchdog_timer);
 	flush_scheduled_work();

 	e1000_release_manageability(adapter);
@@ -2555,6 +2568,16 @@ static void
 e1000_watchdog(unsigned long data)
 {
 	struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+
+	/* Do the rest outside of interrupt context */
+	schedule_work(&adapter->watchdog_task);
+}
+
+static void
+e1000_watchdog_task(struct work_struct *work)
+{
+	struct e1000_adapter *adapter =
+		container_of(work, struct e1000_adapter, watchdog_task);
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
@@ -2724,7 +2747,8 @@ e1000_watchdog(unsigned long data)
 		e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0);

 	/* Reset the timer */
-	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
+	if (!adapter->remove_state)
+		mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
 }

 enum latency_range {

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2007-02-20  9:26 ` Kenzo Iwami
@ 2007-02-20 16:10   ` Auke Kok
  2007-02-21  5:17     ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2007-02-20 16:10 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Jesse Brandeburg, netdev, Ronciak, John

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

Kenzo Iwami wrote:
> Hi,
> 
> I created a patch that uses watchdog_task but fixes the race condition
> that occurred in old the e1000 driver.
> 
> I've obtained information about the panic caused by the old e1000 driver
> using e1000_watchdog_task. According to the crash dump, the panic was
> caused by a timer_list whose contents were NULLs. Further trace
> information revealed that the function in the timer list was
> e1000_watchdog().
> 
> This function is registered in timer_list during e1000_watchdog_task.
> It seems that e1000_watchdog_task could be called after the adapter is
> removed, and freed memory is registered to timer_list.
> 
> By looking at the source code, e1000_watchdog_task will be scheduled if
> e1000_watchdog is invoked during e1000_remove, after flush_scheduled_work()
> is called, but before del_timer_sync() is called in e1000_down().
> 
> The attached patch adds back the e1000_watchdog_task(), but it will
> prevent the old race condition from happening by deleting e1000_watchdog
> from timer_list before flush_scheduled_work() is called.


Kenzo,

this looks a lot better than the previous patch!! However, we already have a 
state marker for _down_ that we should probably reuse. Can you try the attached 
patch and see if it works for you? It's basically your patch without the added 
remove flag and instead using the already available atomic state trackers.

If this works for you then that is great news and I'll push this patch to the 
upstream kernel maintainers after testing.

Cheers,

Auke




[-- Attachment #2: e1000_git_kenzo_use_state_flags.patch --]
[-- Type: text/x-patch, Size: 2555 bytes --]

diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 689f158..bd4026d 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -264,6 +264,7 @@ struct e1000_adapter {
 	uint16_t rx_itr;
 
 	struct work_struct reset_task;
+	struct work_struct watchdog_task;
 	uint8_t fc_autoneg;
 
 	struct timer_list blink_timer;
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 619c892..0548e65 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -152,6 +152,7 @@ static void e1000_clean_rx_ring(struct e1000_adapter *adapter,
 static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
+static void e1000_watchdog_task(struct work_struct *work);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
@@ -1049,6 +1050,7 @@ e1000_probe(struct pci_dev *pdev,
 	adapter->phy_info_timer.data = (unsigned long) adapter;
 
 	INIT_WORK(&adapter->reset_task, e1000_reset_task);
+	INIT_WORK(&adapter->watchdog_task, e1000_watchdog_task);
 
 	e1000_check_options(adapter);
 
@@ -1216,6 +1218,11 @@ e1000_remove(struct pci_dev *pdev)
 	int i;
 #endif
 
+	/* flush_scheduled work may reschedule our watchdog task, so
+	 * explicitly disable watchdog tasks from being rescheduled  */
+	set_bit(__E1000_DOWN, &adapter->flags);
+	del_timer_sync(&adapter->watchdog_timer);
+
 	flush_scheduled_work();
 
 	e1000_release_manageability(adapter);
@@ -2551,6 +2558,17 @@ static void
 e1000_watchdog(unsigned long data)
 {
 	struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+
+	/* Do the rest outside of interrupt context */
+	schedule_work(&adapter->watchdog_task);
+}
+
+static void
+e1000_watchdog_task(struct work_struct *work)
+{
+	struct e1000_adapter *adapter = container_of(work,
+	                                struct e1000_adapter, watchdog_task);
+
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
@@ -2721,7 +2739,8 @@ e1000_watchdog(unsigned long data)
 		e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0);
 
 	/* Reset the timer */
-	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
+	if (!test_bit(__E1000_DOWN, &adapter->flags))
+		mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
 }
 
 enum latency_range {

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2007-02-20 16:10   ` Auke Kok
@ 2007-02-21  5:17     ` Kenzo Iwami
  0 siblings, 0 replies; 29+ messages in thread
From: Kenzo Iwami @ 2007-02-21  5:17 UTC (permalink / raw)
  To: Auke Kok; +Cc: Jesse Brandeburg, netdev, Ronciak, John

Hi,

Thank you for your comment.

> this looks a lot better than the previous patch!! However, we already have a 
> state marker for _down_ that we should probably reuse. Can you try the attached 
> patch and see if it works for you? It's basically your patch without the added 
> remove flag and instead using the already available atomic state trackers.
> 
> If this works for you then that is great news and I'll push this patch to the 
> upstream kernel maintainers after testing.

I tested your patch and confirmed that the system didn't panic.

Please push the patch to the upstream kernel maintainers.
--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2007-01-16  8:42               ` Kenzo Iwami
@ 2007-01-18  9:22                 ` Kenzo Iwami
  0 siblings, 0 replies; 29+ messages in thread
From: Kenzo Iwami @ 2007-01-18  9:22 UTC (permalink / raw)
  To: Auke Kok; +Cc: Jesse Brandeburg, Ronciak, John, Shaw Vrana, netdev

Hi,

> My patch may seem like a huge change, but in essence the change is
> pretty simple.
> 
> In my patch, the interrupt handler code will check whether the interrupted
> code is holding the swfw semaphore. If it is held, the watchdog function
> is deferred until swfw semaphore is released.
> The modification is for the interrupted code which is holding the
> semaphore, and the interrupt handler, so they are both directly related
> to this problem.
> 
> I will try to add some comments to my code to make it more readable.

I have rebased this patch for 2.6.20-rc5 and added some comments to this
patch to make it more readable.

Does this patch have problems? Or do you have any other ideas?

--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

Signed-off-by: Kenzo Iwami <k-iwami@cj.jp.nec.com>

diff -urpN linux-2.6.20-rc5_org/drivers/net/e1000/e1000_hw.c linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_hw.c
--- linux-2.6.20-rc5_org/drivers/net/e1000/e1000_hw.c	2007-01-13 03:54:26.000000000 +0900
+++ linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_hw.c	2007-01-18 18:04:04.000000000 +0900
@@ -35,6 +35,7 @@

 static int32_t e1000_swfw_sync_acquire(struct e1000_hw *hw, uint16_t mask);
 static void e1000_swfw_sync_release(struct e1000_hw *hw, uint16_t mask);
+static void e1000_check_watchdog_deferred(struct e1000_hw *hw);
 static int32_t e1000_read_kmrn_reg(struct e1000_hw *hw, uint32_t reg_addr, uint16_t *data);
 static int32_t e1000_write_kmrn_reg(struct e1000_hw *hw, uint32_t reg_addr, uint16_t data);
 static int32_t e1000_get_software_semaphore(struct e1000_hw *hw);
@@ -3396,6 +3397,29 @@ e1000_shift_in_mdi_bits(struct e1000_hw
     return data;
 }

+static void
+e1000_check_watchdog_deferred(struct e1000_hw *hw)
+{
+    /* If watchdog interrupt was deferred while this process was holding
+     * the swfw semaphore, process the watchdog now.
+     */
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry:
+        e1000_do_watchdog(hw);
+    }
+    atomic_dec(&hw->swfw_sem_count);
+
+    /* Check watchdog_deferred once more just in case watchdog interrupt
+     * occurred between atomic_xchg and atomic_dec.
+     * By nature of the watchdog interrupt, we shouldn't go through this
+     * more than once.
+     */
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+        atomic_inc(&hw->swfw_sem_count);
+        goto retry;
+    }
+}
+
 static int32_t
 e1000_swfw_sync_acquire(struct e1000_hw *hw, uint16_t mask)
 {
@@ -3413,8 +3437,11 @@ e1000_swfw_sync_acquire(struct e1000_hw
         return e1000_get_hw_eeprom_semaphore(hw);

     while (timeout) {
-            if (e1000_get_hw_eeprom_semaphore(hw))
+            atomic_inc(&hw->swfw_sem_count);
+            if (e1000_get_hw_eeprom_semaphore(hw)) {
+                e1000_check_watchdog_deferred(hw);
                 return -E1000_ERR_SWFW_SYNC;
+            }

             swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC);
             if (!(swfw_sync & (fwmask | swmask))) {
@@ -3424,6 +3451,7 @@ e1000_swfw_sync_acquire(struct e1000_hw
             /* firmware currently using resource (fwmask) */
             /* or other software thread currently using resource (swmask) */
             e1000_put_hw_eeprom_semaphore(hw);
+            e1000_check_watchdog_deferred(hw);
             mdelay(5);
             timeout--;
     }
@@ -3437,6 +3465,8 @@ e1000_swfw_sync_acquire(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    e1000_check_watchdog_deferred(hw);
+
     return E1000_SUCCESS;
 }

@@ -3458,6 +3488,7 @@ e1000_swfw_sync_release(struct e1000_hw
         return;
     }

+    atomic_inc(&hw->swfw_sem_count);
     /* if (e1000_get_hw_eeprom_semaphore(hw))
      *    return -E1000_ERR_SWFW_SYNC; */
     while (e1000_get_hw_eeprom_semaphore(hw) != E1000_SUCCESS);
@@ -3468,6 +3499,7 @@ e1000_swfw_sync_release(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    e1000_check_watchdog_deferred(hw);
 }

 /*****************************************************************************
diff -urpN linux-2.6.20-rc5_org/drivers/net/e1000/e1000_hw.h linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_hw.h
--- linux-2.6.20-rc5_org/drivers/net/e1000/e1000_hw.h	2007-01-13 03:54:26.000000000 +0900
+++ linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_hw.h	2007-01-18 18:04:42.000000000 +0900
@@ -306,6 +306,7 @@ typedef enum {
 #define E1000_BYTE_SWAP_WORD(_value) ((((_value) & 0x00ff) << 8) | \
                                      (((_value) & 0xff00) >> 8))

+extern void e1000_do_watchdog(struct e1000_hw *hw);
 /* Function prototypes */
 /* Initialization */
 int32_t e1000_reset_hw(struct e1000_hw *hw);
@@ -1465,6 +1466,8 @@ struct e1000_hw {
 	boolean_t		has_manc2h;
 	boolean_t		rx_needs_kicking;
 	boolean_t		has_smbus;
+	atomic_t		swfw_sem_count;	   /* >0 if swfw_sem held on ESB2 */
+	atomic_t		watchdog_deferred; /* watchdog deferred on ESB2 */
 };


diff -urpN linux-2.6.20-rc5_org/drivers/net/e1000/e1000_main.c linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_main.c
--- linux-2.6.20-rc5_org/drivers/net/e1000/e1000_main.c	2007-01-13 03:54:26.000000000 +0900
+++ linux-2.6.20-rc5_fix/drivers/net/e1000/e1000_main.c	2007-01-18 18:05:54.000000000 +0900
@@ -152,6 +152,7 @@ static void e1000_clean_rx_ring(struct e
 static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
+void e1000_do_watchdog(struct e1000_hw *hw);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
@@ -1313,6 +1314,9 @@ e1000_sw_init(struct e1000_adapter *adap
 	hw->tbi_compatibility_en = TRUE;
 	hw->adaptive_ifs = TRUE;

+	atomic_set(&hw->swfw_sem_count, 0);
+	atomic_set(&hw->watchdog_deferred, 0);
+
 	/* Copper options */

 	if (hw->media_type == e1000_media_type_copper) {
@@ -2555,6 +2559,29 @@ static void
 e1000_watchdog(unsigned long data)
 {
 	struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+	struct e1000_hw *hw = &adapter->hw;
+
+	if (hw->swfw_sync_present) {
+		/* Check whether the interrupted code is holding the swfw
+		 * semaphore. If it is held, defer processing the watchdog
+		 * until swfw semaphore is released.
+		 */
+		if (atomic_read(&hw->swfw_sem_count))
+			atomic_set(&hw->watchdog_deferred, 1);
+		else
+			e1000_do_watchdog(hw);
+	} else {
+		e1000_do_watchdog(hw);
+	}
+
+	/* Reset the timer */
+	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
+}
+
+void
+e1000_do_watchdog(struct e1000_hw *hw)
+{
+	struct e1000_adapter *adapter = hw->back;
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
@@ -2722,9 +2749,6 @@ e1000_watchdog(unsigned long data)
 	 * reset from the other port. Set the appropriate LAA in RAR[0] */
 	if (adapter->hw.mac_type == e1000_82571 && adapter->hw.laa_is_present)
 		e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0);
-
-	/* Reset the timer */
-	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
 }

 enum latency_range {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2007-01-15 16:14             ` Auke Kok
@ 2007-01-16  8:42               ` Kenzo Iwami
  2007-01-18  9:22                 ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2007-01-16  8:42 UTC (permalink / raw)
  To: Auke Kok; +Cc: Jesse Brandeburg, Ronciak, John, Shaw Vrana, netdev

Hi,

Thank you for your comment.

> thanks for staying patient while most of us were out or busy. Apart from acknowledging 
> that you might have fixed a problem with your patch, we're very reluctant to merge such 
> a huge change in our driver that touches much more cases then the one that seems to be 
> giving you problems.
> 
> I've thought up a much more elegant solution that prevents the driver from asserting the 
> swfw semaphore during normal operations by checking the mac LU (link up) register in the 
> watchdog. This allows the watchdog task to bypass all PHY checking in case all link 
> statuses are OK, and thus removes the big problem that you are seeing.
> 
> Attached a version that should apply against most current trees. Please give it a try 
> and let us know if this also fixes the problem for you. I will most likely push this 
> patch to the netdev tree in any case.

I tried your patch. Unfortunately, the system still panicked with the
same symptom.

In your patch, e1000_update_stats() is still called by e1000_watchdog().
And, e1000_update_stats() calls e1000_read_phy_reg().
Therefore, interrupt handler tires to acquire the semaphore.
As a result, the same problem still occurs.

To fix this problem, interrupt handler must not call e1000_read_phy_reg()
while the interrupted code is holding the semaphore.

My patch may seem like a huge change, but in essence the change is
pretty simple.

In my patch, the interrupt handler code will check whether the interrupted
code is holding the swfw semaphore. If it is held, the watchdog function
is deferred until swfw semaphore is released.
The modification is for the interrupted code which is holding the
semaphore, and the interrupt handler, so they are both directly related
to this problem.

I will try to add some comments to my code to make it more readable.
--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2007-01-15  9:12           ` Kenzo Iwami
@ 2007-01-15 16:14             ` Auke Kok
  2007-01-16  8:42               ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2007-01-15 16:14 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Jesse Brandeburg, Ronciak, John, Shaw Vrana, netdev

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

Kenzo Iwami wrote:
> With this patch applied, I confirmed that the system doesn't panic.
> I think this patch can fix this problem.
> Does this patch have problems.

Kenzo,

thanks for staying patient while most of us were out or busy. Apart from acknowledging 
that you might have fixed a problem with your patch, we're very reluctant to merge such 
a huge change in our driver that touches much more cases then the one that seems to be 
giving you problems.

I've thought up a much more elegant solution that prevents the driver from asserting the 
swfw semaphore during normal operations by checking the mac LU (link up) register in the 
watchdog. This allows the watchdog task to bypass all PHY checking in case all link 
statuses are OK, and thus removes the big problem that you are seeing.

Attached a version that should apply against most current trees. Please give it a try 
and let us know if this also fixes the problem for you. I will most likely push this 
patch to the netdev tree in any case.

Cheers,

Auke
---

[-- Attachment #2: e1000_git_link_check_for_change_only.patch --]
[-- Type: text/x-patch, Size: 1230 bytes --]


From: Auke Kok <auke-jan.h.kok@intel.com>

e1000: Don't do PHY reads in watchdog unless link status is down

The watchdog runs code that every 2 seconds performs several PHY reads
that are locked with the swfw semaphore, causing the semaphore to be
unavailable for a short time. This is completely unneeded in case the
MAC detects PHY link up (LU).

Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>

---
 drivers/net/e1000/e1000_main.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 34d8e5d..9660925 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2556,6 +2556,10 @@ e1000_watchdog(unsigned long data)
 	uint32_t link, tctl;
 	int32_t ret_val;
 
+	if ((netif_carrier_ok(netdev)) &&
+	    (E1000_READ_REG(&adapter->hw, STATUS) & E1000_STATUS_LU))
+		goto link_up;
+
 	ret_val = e1000_check_for_link(&adapter->hw);
 	if ((ret_val == E1000_ERR_PHY) &&
 	    (adapter->hw.phy_type == e1000_phy_igp_3) &&
@@ -2684,6 +2688,7 @@ e1000_watchdog(unsigned long data)
 		e1000_smartspeed(adapter);
 	}
 
+link_up:
 	e1000_update_stats(adapter);
 
 	adapter->hw.tx_packet_delta = adapter->stats.tpt - adapter->tpt_old;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-12-19  0:13         ` Kenzo Iwami
@ 2007-01-15  9:12           ` Kenzo Iwami
  2007-01-15 16:14             ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2007-01-15  9:12 UTC (permalink / raw)
  To: Auke Kok; +Cc: Jesse Brandeburg, Ronciak, John, Shaw Vrana, netdev

Hi,

During the holiday season, I posted a patch that fixed this problem without
using spinlocks nor disabling interrupts.
  http://marc.theaimsgroup.com/?l=linux-netdev&m=116649413613845&w=2

With this patch applied, I confirmed that the system doesn't panic.
I think this patch can fix this problem.
Does this patch have problems.

I welcome any comments.

--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-12-12  7:58       ` Kenzo Iwami
@ 2006-12-19  0:13         ` Kenzo Iwami
  2007-01-15  9:12           ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-12-19  0:13 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Auke Kok, Jesse Brandeburg, Shaw Vrana, netdev, Ronciak, John

Hi,

Previously, I posted a patch that fixed this problem without using spinlocks
nor disabling interrupts.
I have rebased this patch for 2.6.20-rc1.

Does this patch have problems?

I welcome any comments.

--
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

Signed-off-by: Kenzo Iwami <k-iwami@cj.jp.nec.com>

diff -urpN linux-2.6.20-rc1-org/drivers/net/e1000/e1000_hw.c
linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_hw.c
--- linux-2.6.20-rc1-org/drivers/net/e1000/e1000_hw.c	2006-12-14
10:14:23.000000000 +0900
+++ linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_hw.c	2006-12-18
13:10:21.000000000 +0900
@@ -3394,8 +3394,19 @@ e1000_swfw_sync_acquire(struct e1000_hw
         return e1000_get_hw_eeprom_semaphore(hw);

     while (timeout) {
-            if (e1000_get_hw_eeprom_semaphore(hw))
+            atomic_inc(&hw->swfw_sem_count);
+            if (e1000_get_hw_eeprom_semaphore(hw)) {
+                if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry1:
+                    e1000_do_watchdog(hw);
+                }
+                atomic_dec(&hw->swfw_sem_count);
+                if (atomic_read(&hw->watchdog_deferred)) {
+                    atomic_inc(&hw->swfw_sem_count);
+                    goto retry1;
+                }
                 return -E1000_ERR_SWFW_SYNC;
+            }

             swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC);
             if (!(swfw_sync & (fwmask | swmask))) {
@@ -3405,6 +3416,15 @@ e1000_swfw_sync_acquire(struct e1000_hw
             /* firmware currently using resource (fwmask) */
             /* or other software thread currently using resource (swmask) */
             e1000_put_hw_eeprom_semaphore(hw);
+            if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry2:
+                e1000_do_watchdog(hw);
+            }
+            atomic_dec(&hw->swfw_sem_count);
+            if (atomic_read(&hw->watchdog_deferred)) {
+                atomic_inc(&hw->swfw_sem_count);
+                goto retry2;
+            }
             mdelay(5);
             timeout--;
     }
@@ -3418,6 +3438,15 @@ e1000_swfw_sync_acquire(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry3:
+        e1000_do_watchdog(hw);
+    }
+    atomic_dec(&hw->swfw_sem_count);
+    if (atomic_read(&hw->watchdog_deferred)) {
+        atomic_inc(&hw->swfw_sem_count);
+        goto retry3;
+    }
     return E1000_SUCCESS;
 }

@@ -3439,6 +3468,7 @@ e1000_swfw_sync_release(struct e1000_hw
         return;
     }

+    atomic_inc(&hw->swfw_sem_count);
     /* if (e1000_get_hw_eeprom_semaphore(hw))
      *    return -E1000_ERR_SWFW_SYNC; */
     while (e1000_get_hw_eeprom_semaphore(hw) != E1000_SUCCESS);
@@ -3449,6 +3479,15 @@ e1000_swfw_sync_release(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry:
+        e1000_do_watchdog(hw);
+    }
+    atomic_dec(&hw->swfw_sem_count);
+    if (atomic_read(&hw->watchdog_deferred)) {
+        atomic_inc(&hw->swfw_sem_count);
+        goto retry;
+    }
 }

 /*****************************************************************************
diff -urpN linux-2.6.20-rc1-org/drivers/net/e1000/e1000_hw.h
linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_hw.h
--- linux-2.6.20-rc1-org/drivers/net/e1000/e1000_hw.h	2006-12-14
10:14:23.000000000 +0900
+++ linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_hw.h	2006-12-18
13:09:15.000000000 +0900
@@ -306,6 +306,7 @@ typedef enum {
 #define E1000_BYTE_SWAP_WORD(_value) ((((_value) & 0x00ff) << 8) | \
                                      (((_value) & 0xff00) >> 8))

+extern void e1000_do_watchdog(struct e1000_hw *hw);
 /* Function prototypes */
 /* Initialization */
 int32_t e1000_reset_hw(struct e1000_hw *hw);
@@ -1460,6 +1461,8 @@ struct e1000_hw {
     boolean_t mng_reg_access_disabled;
     boolean_t leave_av_bit_off;
     boolean_t kmrn_lock_loss_workaround_disabled;
+    atomic_t swfw_sem_count;
+    atomic_t watchdog_deferred;
 };


diff -urpN linux-2.6.20-rc1-org/drivers/net/e1000/e1000_main.c
linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_main.c
--- linux-2.6.20-rc1-org/drivers/net/e1000/e1000_main.c	2006-12-14
10:14:23.000000000 +0900
+++ linux-2.6.20-rc1-fix/drivers/net/e1000/e1000_main.c	2006-12-18
10:03:05.000000000 +0900
@@ -152,6 +152,7 @@ static void e1000_clean_rx_ring(struct e
 static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
+void e1000_do_watchdog(struct e1000_hw *hw);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
@@ -1184,6 +1185,9 @@ e1000_sw_init(struct e1000_adapter *adap
 	hw->tbi_compatibility_en = TRUE;
 	hw->adaptive_ifs = TRUE;

+	atomic_set(&hw->swfw_sem_count, 0);
+	atomic_set(&hw->watchdog_deferred, 0);
+
 	/* Copper options */

 	if (hw->media_type == e1000_media_type_copper) {
@@ -2426,6 +2430,25 @@ static void
 e1000_watchdog(unsigned long data)
 {
 	struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+	struct e1000_hw *hw = &adapter->hw;
+
+	if (hw->swfw_sync_present) {
+		if (atomic_read(&hw->swfw_sem_count))
+			atomic_set(&hw->watchdog_deferred, 1);
+		else
+			e1000_do_watchdog(hw);
+	} else {
+		e1000_do_watchdog(hw);
+	}
+
+	/* Reset the timer */
+	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
+}
+
+void
+e1000_do_watchdog(struct e1000_hw *hw)
+{
+	struct e1000_adapter *adapter = hw->back;
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
@@ -2586,9 +2609,6 @@ e1000_watchdog(unsigned long data)
 	 * reset from the other port. Set the appropriate LAA in RAR[0] */
 	if (adapter->hw.mac_type == e1000_82571 && adapter->hw.laa_is_present)
 		e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0);
-
-	/* Reset the timer */
-	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
 }

 enum latency_range {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-12-05  0:46     ` Auke Kok
@ 2006-12-12  7:58       ` Kenzo Iwami
  2006-12-19  0:13         ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-12-12  7:58 UTC (permalink / raw)
  To: Auke Kok; +Cc: Jesse Brandeburg, Shaw Vrana, netdev, Ronciak, John

Hi,

> There are several issues that are conflicting and mixing that make it less than 
> intuitive to decide what the better fix is.
> 
> Most of all, we discussed that adding a spinlock is not going to fix the underlying 
> problem of contention, as the code that would need to be spinlocked can sleep. Not a 
> good thing.
> 
> Adding state tracking code in the form of atomics might solve the issue too, but then we 
> need to do this in quite a few locations. And it comes down to the fact that we really 
> want all users of the semaphore to halt in case it is in use.
> 
> Reducing the swfw semaphore time is a usefull exercise, but requires an amazing amount 
> of changes to all of the phy code to make sure we're not locking it too long, and even 
> then I doubt that we will reduce the maximum lock time to acceptable levels.
> 
> The watchdog then, appears to needlessly lock the semaphore every two seconds. this is 
> because even though the link is up and we're already setup, we go through the trouble of 
> doing all the PHY reads, which are protected by the semaphores.
> 
> I'm currently testing a watchdog version which completely bypasses these checks in case 
> the MAC didn't detect a link change, and we already are setup completely. In that case, 
> all we need to do is update stats and reschedule the timer.

I managed to devise a patch that will fix this problem without using
spinlocks nor disabling interrupts.

Basically, if any process is acquiring the semaphore, watchdog processing
will be deferred until when that process releases the semaphore.

Two new members are added to e1000_hw structure.

hw->swfw_sem_count represents the number of processes that is trying to hold
the semaphore.
hw->swfw_sem_count is incremented before acquiring the semaphore, and
decremented after releasing the semaphore. It will be zero if no processes
is trying to acquire the semaphore.
hw->watchdog_deferred is used to check whether there is any watchdogs
pending. It will be one if watchdog processing is deferred, and zero if
no watchdogs are pending.

Before hw->swfw_sem_count is decremented, hw->watchdog_deferred is checked
and exchanged to zero. If the old value is not zero, watchdog function
is called before decrementing hw->swfw_sem_count.

The interrupt handler will not try to acquire the semaphore, while the
interrupted code is holding the semaphore, and the deadlock is avoided.

I ran the reproduction TP I sent previously and confirmed that the system
doesn't panic.

How about this patch? Does this patch have problems?

I welcome any comments.
-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

Signed-off-by: Kenzo Iwami <k-iwami@cj.jp.nec.com>

diff -urpN linux-2.6.19.org/drivers/net/e1000/e1000_hw.c
linux-2.6.19_new/drivers/net/e1000/e1000_hw.c
--- linux-2.6.19.org/drivers/net/e1000/e1000_hw.c	2006-11-30 06:57:37.000000000
+0900
+++ linux-2.6.19_new/drivers/net/e1000/e1000_hw.c	2006-12-12 13:30:13.000000000
+0900
@@ -3389,8 +3389,19 @@ e1000_swfw_sync_acquire(struct e1000_hw
         return e1000_get_hw_eeprom_semaphore(hw);

     while (timeout) {
-            if (e1000_get_hw_eeprom_semaphore(hw))
+            atomic_inc(&hw->swfw_sem_count);
+            if (e1000_get_hw_eeprom_semaphore(hw)) {
+                if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry1:
+                    e1000_do_watchdog(hw);
+                }
+                atomic_dec(&hw->swfw_sem_count);
+                if (atomic_read(&hw->watchdog_deferred)) {
+                    atomic_inc(&hw->swfw_sem_count);
+                    goto retry1;
+                }
                 return -E1000_ERR_SWFW_SYNC;
+            }

             swfw_sync = E1000_READ_REG(hw, SW_FW_SYNC);
             if (!(swfw_sync & (fwmask | swmask))) {
@@ -3400,6 +3411,15 @@ e1000_swfw_sync_acquire(struct e1000_hw
             /* firmware currently using resource (fwmask) */
             /* or other software thread currently using resource (swmask) */
             e1000_put_hw_eeprom_semaphore(hw);
+            if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry2:
+                e1000_do_watchdog(hw);
+            }
+            atomic_dec(&hw->swfw_sem_count);
+            if (atomic_read(&hw->watchdog_deferred)) {
+                atomic_inc(&hw->swfw_sem_count);
+                goto retry2;
+            }
             mdelay(5);
             timeout--;
     }
@@ -3413,6 +3433,15 @@ e1000_swfw_sync_acquire(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry3:
+        e1000_do_watchdog(hw);
+    }
+    atomic_dec(&hw->swfw_sem_count);
+    if (atomic_read(&hw->watchdog_deferred)) {
+        atomic_inc(&hw->swfw_sem_count);
+        goto retry3;
+    }
     return E1000_SUCCESS;
 }

@@ -3434,6 +3463,7 @@ e1000_swfw_sync_release(struct e1000_hw
         return;
     }

+    atomic_inc(&hw->swfw_sem_count);
     /* if (e1000_get_hw_eeprom_semaphore(hw))
      *    return -E1000_ERR_SWFW_SYNC; */
     while (e1000_get_hw_eeprom_semaphore(hw) != E1000_SUCCESS);
@@ -3444,6 +3474,15 @@ e1000_swfw_sync_release(struct e1000_hw
     E1000_WRITE_REG(hw, SW_FW_SYNC, swfw_sync);

     e1000_put_hw_eeprom_semaphore(hw);
+    if (atomic_xchg(&hw->watchdog_deferred, 0)) {
+retry:
+        e1000_do_watchdog(hw);
+    }
+    atomic_dec(&hw->swfw_sem_count);
+    if (atomic_read(&hw->watchdog_deferred)) {
+        atomic_inc(&hw->swfw_sem_count);
+        goto retry;
+    }
 }

 /*****************************************************************************
diff -urpN linux-2.6.19.org/drivers/net/e1000/e1000_hw.h
linux-2.6.19_new/drivers/net/e1000/e1000_hw.h
--- linux-2.6.19.org/drivers/net/e1000/e1000_hw.h	2006-11-30 06:57:37.000000000
+0900
+++ linux-2.6.19_new/drivers/net/e1000/e1000_hw.h	2006-12-12 13:30:14.000000000
+0900
@@ -304,6 +304,7 @@ typedef enum {
 #define E1000_BYTE_SWAP_WORD(_value) ((((_value) & 0x00ff) << 8) | \
                                      (((_value) & 0xff00) >> 8))

+extern void e1000_do_watchdog(struct e1000_hw *hw);
 /* Function prototypes */
 /* Initialization */
 int32_t e1000_reset_hw(struct e1000_hw *hw);
@@ -1454,6 +1455,8 @@ struct e1000_hw {
     boolean_t mng_reg_access_disabled;
     boolean_t leave_av_bit_off;
     boolean_t kmrn_lock_loss_workaround_disabled;
+    atomic_t swfw_sem_count;
+    atomic_t watchdog_deferred;
 };


diff -urpN linux-2.6.19.org/drivers/net/e1000/e1000_main.c
linux-2.6.19_new/drivers/net/e1000/e1000_main.c
--- linux-2.6.19.org/drivers/net/e1000/e1000_main.c	2006-11-30
06:57:37.000000000 +0900
+++ linux-2.6.19_new/drivers/net/e1000/e1000_main.c	2006-12-12
15:00:16.000000000 +0900
@@ -148,6 +148,7 @@ static void e1000_clean_rx_ring(struct e
 static void e1000_set_multi(struct net_device *netdev);
 static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
+void e1000_do_watchdog(struct e1000_hw *hw);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
@@ -1178,6 +1179,8 @@ e1000_sw_init(struct e1000_adapter *adap
 	hw->tbi_compatibility_en = TRUE;
 	hw->adaptive_ifs = TRUE;

+	atomic_set(&hw->swfw_sem_count, 0);
+	atomic_set(&hw->watchdog_deferred, 0);
 	/* Copper options */

 	if (hw->media_type == e1000_media_type_copper) {
@@ -2401,10 +2404,27 @@ e1000_82547_tx_fifo_stall(unsigned long
  * e1000_watchdog - Timer Call-back
  * @data: pointer to adapter cast into an unsigned long
  **/
-static void
-e1000_watchdog(unsigned long data)
+static void e1000_watchdog(unsigned long data)
 {
 	struct e1000_adapter *adapter = (struct e1000_adapter *) data;
+	struct e1000_hw *hw = &adapter->hw;
+
+	if (hw->swfw_sync_present) {
+		if (atomic_read(&hw->swfw_sem_count))
+			atomic_set(&hw->watchdog_deferred, 1);
+		else
+			e1000_do_watchdog(hw);
+	} else {
+		e1000_do_watchdog(hw);
+	}
+
+	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
+}
+
+void
+e1000_do_watchdog(struct e1000_hw *hw)
+{
+	struct e1000_adapter *adapter = hw->back;
 	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_ring *txdr = adapter->tx_ring;
 	uint32_t link, tctl;
@@ -2572,9 +2592,6 @@ e1000_watchdog(unsigned long data)
 	 * reset from the other port. Set the appropriate LAA in RAR[0] */
 	if (adapter->hw.mac_type == e1000_82571 && adapter->hw.laa_is_present)
 		e1000_rar_set(&adapter->hw, adapter->hw.mac_addr, 0);
-
-	/* Reset the timer */
-	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
 }

 #define E1000_TX_FLAGS_CSUM		0x00000001


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-12-04  9:14   ` Kenzo Iwami
@ 2006-12-05  0:46     ` Auke Kok
  2006-12-12  7:58       ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Auke Kok @ 2006-12-05  0:46 UTC (permalink / raw)
  To: Kenzo Iwami; +Cc: Brandeburg, Jesse, Shaw Vrana, netdev, Ronciak, John

Kenzo Iwami wrote:
> Hi,
> 
>>> Doesn't this just mean that we need a spinlock or some other kind of
>>> semaphore around acquiring, using, and releasing this resource?  We keep
>>> going around and around about this but I'm pretty sure spinlocks are
>>> meant to be able to solve exactly this issue.
>>>
>>> The problem is going to get considerably more nasty if we need to hold a
>>> spinlock with interrupts disabled for a significant amount of time, at
>>> which point a semaphore of some kind with a spinlock around it would
>>> seem to be more useful.
>> Even if spin_lock() was used to protect this resource, it is still possible
>> for an interrupt to kick in and call e1000_watchdog. In this case,
>> e1000_get_software_semaphore() will be called from within the interrupt
>> handler and the problem will still occur.
>>
>> In order to solve this problem, interrupt should be disabled (for example,
>> spin_lock_irqsave).
>> The interrupt handler can't run while the process is holding this resource,
>> and this problem doesn't occur.
>>
>>> I'll work with Auke to see if we can come up with another try.
>> Do you have any updates about your test code?
> 
> Does the fix I previously proposed have problems?
> If it does, I'd like to help find investigate another fix to solve
> this problem.

There are several issues that are conflicting and mixing that make it less than 
intuitive to decide what the better fix is.

Most of all, we discussed that adding a spinlock is not going to fix the underlying 
problem of contention, as the code that would need to be spinlocked can sleep. Not a 
good thing.

Adding state tracking code in the form of atomics might solve the issue too, but then we 
need to do this in quite a few locations. And it comes down to the fact that we really 
want all users of the semaphore to halt in case it is in use.

Reducing the swfw semaphore time is a usefull exercise, but requires an amazing amount 
of changes to all of the phy code to make sure we're not locking it too long, and even 
then I doubt that we will reduce the maximum lock time to acceptable levels.

The watchdog then, appears to needlessly lock the semaphore every two seconds. this is 
because even though the link is up and we're already setup, we go through the trouble of 
doing all the PHY reads, which are protected by the semaphores.

I'm currently testing a watchdog version which completely bypasses these checks in case 
the MAC didn't detect a link change, and we already are setup completely. In that case, 
all we need to do is update stats and reschedule the timer.

I'll keep you posted on progress.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-11-21 10:16 ` Kenzo Iwami
@ 2006-12-04  9:14   ` Kenzo Iwami
  2006-12-05  0:46     ` Auke Kok
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-12-04  9:14 UTC (permalink / raw)
  To: Kenzo Iwami
  Cc: Brandeburg, Jesse, Kok, Auke-jan H, Shaw Vrana, netdev, Ronciak, John

Hi,

>> Doesn't this just mean that we need a spinlock or some other kind of
>> semaphore around acquiring, using, and releasing this resource?  We keep
>> going around and around about this but I'm pretty sure spinlocks are
>> meant to be able to solve exactly this issue.
>>
>> The problem is going to get considerably more nasty if we need to hold a
>> spinlock with interrupts disabled for a significant amount of time, at
>> which point a semaphore of some kind with a spinlock around it would
>> seem to be more useful.
> 
> Even if spin_lock() was used to protect this resource, it is still possible
> for an interrupt to kick in and call e1000_watchdog. In this case,
> e1000_get_software_semaphore() will be called from within the interrupt
> handler and the problem will still occur.
> 
> In order to solve this problem, interrupt should be disabled (for example,
> spin_lock_irqsave).
> The interrupt handler can't run while the process is holding this resource,
> and this problem doesn't occur.
> 
>> I'll work with Auke to see if we can come up with another try.
> 
> Do you have any updates about your test code?

Does the fix I previously proposed have problems?
If it does, I'd like to help find investigate another fix to solve
this problem.

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: watchdog timeout panic in e1000 driver
  2006-11-16 17:20 Brandeburg, Jesse
@ 2006-11-21 10:16 ` Kenzo Iwami
  2006-12-04  9:14   ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Kenzo Iwami @ 2006-11-21 10:16 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: Kok, Auke-jan H, Shaw Vrana, netdev, Ronciak, John

Hi,

Thank you for your comment.

>>   ethtool processing holding semaphore
>>     INTERRUPT
>>       e1000_watchdog waits for semaphore to be released
>>
>> The semaphore e1000_watchdog is waiting for can only be released when
>> ethtool resumes from interrupt after e1000_watchdog finishes
>> (basically a deadlock)
>>
>> In order to solve this problem, interrupts has to be disabled on the
>> interrupted side (during ethtool processing) and not during
>> e1000_watchdog within the interrupt handler.
>>
>> e1000_get_hw_eeprom_semaphore is called from both the interrupt level
>> and the normal level, and needs to be protected by irq code. The
>> reason the patch disables interrupts when freeing the semaphore is
>> because e1000_swfw_sync_release also calls
>> e1000_get_hw_eeprom_semaphore. 
> 
> Doesn't this just mean that we need a spinlock or some other kind of
> semaphore around acquiring, using, and releasing this resource?  We keep
> going around and around about this but I'm pretty sure spinlocks are
> meant to be able to solve exactly this issue.
> 
> The problem is going to get considerably more nasty if we need to hold a
> spinlock with interrupts disabled for a significant amount of time, at
> which point a semaphore of some kind with a spinlock around it would
> seem to be more useful.

Even if spin_lock() was used to protect this resource, it is still possible
for an interrupt to kick in and call e1000_watchdog. In this case,
e1000_get_software_semaphore() will be called from within the interrupt
handler and the problem will still occur.

In order to solve this problem, interrupt should be disabled (for example,
spin_lock_irqsave).
The interrupt handler can't run while the process is holding this resource,
and this problem doesn't occur.

> I'll work with Auke to see if we can come up with another try.

Do you have any updates about your test code?

-- 
  Kenzo Iwami (k-iwami@cj.jp.nec.com)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: watchdog timeout panic in e1000 driver
@ 2006-11-16 17:20 Brandeburg, Jesse
  2006-11-21 10:16 ` Kenzo Iwami
  0 siblings, 1 reply; 29+ messages in thread
From: Brandeburg, Jesse @ 2006-11-16 17:20 UTC (permalink / raw)
  To: Kenzo Iwami, Kok, Auke-jan H; +Cc: Shaw Vrana, netdev, Ronciak, John

Kenzo Iwami wrote:
>   ethtool processing holding semaphore
>     INTERRUPT
>       e1000_watchdog waits for semaphore to be released
> 
> The semaphore e1000_watchdog is waiting for can only be released when
> ethtool resumes from interrupt after e1000_watchdog finishes
> (basically a deadlock)
> 
> In order to solve this problem, interrupts has to be disabled on the
> interrupted side (during ethtool processing) and not during
> e1000_watchdog within the interrupt handler.
> 
> e1000_get_hw_eeprom_semaphore is called from both the interrupt level
> and the normal level, and needs to be protected by irq code. The
> reason the patch disables interrupts when freeing the semaphore is
> because e1000_swfw_sync_release also calls
> e1000_get_hw_eeprom_semaphore. 

Doesn't this just mean that we need a spinlock or some other kind of
semaphore around acquiring, using, and releasing this resource?  We keep
going around and around about this but I'm pretty sure spinlocks are
meant to be able to solve exactly this issue.

The problem is going to get considerably more nasty if we need to hold a
spinlock with interrupts disabled for a significant amount of time, at
which point a semaphore of some kind with a spinlock around it would
seem to be more useful.

I'll work with Auke to see if we can come up with another try.

Jesse

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-02-21  5:18 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-10-19 10:19 watchdog timeout panic in e1000 driver Kenzo Iwami
2006-10-19 15:39 ` Auke Kok
     [not found]   ` <4538BFF2.2040207@cj.jp.nec.com>
2006-10-20 15:51     ` Auke Kok
2006-10-24  9:01       ` Kenzo Iwami
2006-10-24 16:15         ` Auke Kok
2006-10-25 13:41           ` Kenzo Iwami
2006-10-25 15:09             ` Auke Kok
2006-10-26 10:35               ` Kenzo Iwami
2006-10-26 14:34                 ` Auke Kok
2006-10-30 11:36                   ` Kenzo Iwami
2006-10-30 17:30                     ` Auke Kok
2006-10-31  3:22                       ` Shaw Vrana
2006-11-01 13:21                         ` Kenzo Iwami
2006-11-15 10:33                           ` Kenzo Iwami
2006-11-15 16:11                             ` Auke Kok
2006-11-16  9:23                               ` Kenzo Iwami
2007-02-20  9:26 ` Kenzo Iwami
2007-02-20 16:10   ` Auke Kok
2007-02-21  5:17     ` Kenzo Iwami
2006-11-16 17:20 Brandeburg, Jesse
2006-11-21 10:16 ` Kenzo Iwami
2006-12-04  9:14   ` Kenzo Iwami
2006-12-05  0:46     ` Auke Kok
2006-12-12  7:58       ` Kenzo Iwami
2006-12-19  0:13         ` Kenzo Iwami
2007-01-15  9:12           ` Kenzo Iwami
2007-01-15 16:14             ` Auke Kok
2007-01-16  8:42               ` Kenzo Iwami
2007-01-18  9:22                 ` Kenzo Iwami

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.