From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kenzo Iwami Subject: Re: watchdog timeout panic in e1000 driver Date: Tue, 16 Jan 2007 17:42:19 +0900 Message-ID: <45AC8FEB.5090102@cj.jp.nec.com> References: <36D9DB17C6DE9E40B059440DB8D95F52013ABFE9@orsmsx418.amr.corp.intel.com> <4562D207.60301@cj.jp.nec.com> <4573E6FD.3030905@cj.jp.nec.com> <4574C14E.2060108@intel.com> <457E610D.4060908@cj.jp.nec.com> <45872EAC.7050501@cj.jp.nec.com> <45AB4561.8040400@cj.jp.nec.com> <45ABA882.20806@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Cc: Jesse Brandeburg , "Ronciak, John" , Shaw Vrana , netdev@vger.kernel.org Return-path: Received: from TYO202.gate.nec.co.jp ([202.32.8.206]:59259 "EHLO tyo202.gate.nec.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932451AbXAPImz (ORCPT ); Tue, 16 Jan 2007 03:42:55 -0500 To: Auke Kok In-Reply-To: <45ABA882.20806@intel.com> Sender: netdev-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Hi, Thank you for your comment. > thanks for staying patient while most of us were out or busy. Apart from acknowledging > that you might have fixed a problem with your patch, we're very reluctant to merge such > a huge change in our driver that touches much more cases then the one that seems to be > giving you problems. > > I've thought up a much more elegant solution that prevents the driver from asserting the > swfw semaphore during normal operations by checking the mac LU (link up) register in the > watchdog. This allows the watchdog task to bypass all PHY checking in case all link > statuses are OK, and thus removes the big problem that you are seeing. > > Attached a version that should apply against most current trees. Please give it a try > and let us know if this also fixes the problem for you. I will most likely push this > patch to the netdev tree in any case. I tried your patch. Unfortunately, the system still panicked with the same symptom. In your patch, e1000_update_stats() is still called by e1000_watchdog(). And, e1000_update_stats() calls e1000_read_phy_reg(). Therefore, interrupt handler tires to acquire the semaphore. As a result, the same problem still occurs. To fix this problem, interrupt handler must not call e1000_read_phy_reg() while the interrupted code is holding the semaphore. My patch may seem like a huge change, but in essence the change is pretty simple. In my patch, the interrupt handler code will check whether the interrupted code is holding the swfw semaphore. If it is held, the watchdog function is deferred until swfw semaphore is released. The modification is for the interrupted code which is holding the semaphore, and the interrupt handler, so they are both directly related to this problem. I will try to add some comments to my code to make it more readable. -- Kenzo Iwami (k-iwami@cj.jp.nec.com)