From mboxrd@z Thu Jan  1 00:00:00 1970
From: Kenzo Iwami <k-iwami@cj.jp.nec.com>
Subject: Re: watchdog timeout panic in e1000 driver
Date: Wed, 01 Nov 2006 22:21:29 +0900
Message-ID: <45489F59.6030102@cj.jp.nec.com>
References: <4538BFF2.2040207@cj.jp.nec.com> <4538F080.5020003@intel.com> <453DD678.4010606@cj.jp.nec.com> <453E3C0B.5030600@intel.com> <453F6983.6020307@cj.jp.nec.com> <453F7E1F.4020406@intel.com> <45408F7B.3050209@cj.jp.nec.com> <4540C765.4000800@intel.com> <4545E3A4.9090004@cj.jp.nec.com> <454636B0.1010004@intel.com> <20061031032218.GA7804@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-2022-JP
Content-Transfer-Encoding: 7bit
Cc: Auke Kok <auke-jan.h.kok@intel.com>, netdev@vger.kernel.org,
	Jesse Brandeburg <jesse.brandeburg@intel.com>,
	"Ronciak, John" <john.ronciak@intel.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from TYO202.gate.nec.co.jp ([210.143.35.52]:39392 "EHLO
	tyo202.gate.nec.co.jp") by vger.kernel.org with ESMTP
	id S1752166AbWKANVp (ORCPT <rfc822;netdev@vger.kernel.org>);
	Wed, 1 Nov 2006 08:21:45 -0500
To: Shaw Vrana <shaw@vranix.com>
In-Reply-To: <20061031032218.GA7804@gmail.com>
Sender: netdev-owner@vger.kernel.org
List-Id: netdev.vger.kernel.org

Hi,

>>> Even if the total lock time can be reduced, it's possible that interrupt
>>> handler is executed while the interrupted code is still holding the 
>>> semaphore.
>>> I think your method only decrease the frequency of this problem.
>>> Why does reducing the lock time solve this problem?
>> there are several problems here that need addressing. It's not acceptable 
>> for our driver to wait up to 15 seconds, and we can (presumably) reduce it 
>> to milliseconds, so that would help a lot. We should in no case at all hold 
>> it for any period longer than (give or take) half a second, so working 
>> towards that is a very good step in the right direction.
>>
>> Adding the timer task back may also help, as we are no longer trying to 
>> aqcuire the sw_fw_semaphore in interrupt context, but we removed it for a 
>> reason, and I need to dig up what reason this exactly was before we can 
>> revert it. Jesse might know, so I'll talk to him. But this will not fix the 
>> fact that the semaphore is held for a long time :)
> 
> Timer tasks that reschedule themselves are a pain.  The watchdog timer task
> had a couple of race conditions that were thought to be better fixed by
> removing it all together.  Please, let's not go down that road again!

I understand that the watchdog_task could cause a race when the timer task
and e1000_down runs concurrently, resulting in memory double free.

I think this problem occurs because interrupt handler is executed in same
CPU as process that acquires semaphore.
How about disabling interrupt while the process is holding the semaphore?
I think this is possible, if the total lock time has been reduced.

-- 
  Kenzi Iwami (k-iwami@cj.jp.nec.com)