From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail2.candelatech.com ([208.74.158.173]:47396 "EHLO mail2.candelatech.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932524AbbERTlS (ORCPT ); Mon, 18 May 2015 15:41:18 -0400 Message-ID: <555A405D.1060203@candelatech.com> Date: Mon, 18 May 2015 12:41:17 -0700 From: Ben Greear MIME-Version: 1.0 To: Linus Torvalds CC: Rui Xiang , Tejun Heo , Rusty Russell , "David S. Miller" , Li Zefan , stable , Eric Dumazet Subject: Re: [PATCH 2/2] Fix lockup related to stop_machine being stuck in __do_softirq. References: <1431584993-2856-1-git-send-email-rui.xiang@huawei.com> <1431584993-2856-3-git-send-email-rui.xiang@huawei.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: stable-owner@vger.kernel.org List-ID: On 05/18/2015 12:19 PM, Linus Torvalds wrote: > So this one kind of fell through the cracks, partly because I don't > exactly love the patch. > > What is it that keeps re-arming the softirq pending bit all the time? > You mention the ath9k driver.. It has been a while...I don't remember all the details, but at least some of the problem was that most CPU cores were shut down, so the system could make very little progress. Maybe that includes ath9k logic that would clear the soft-irq-set logic? I also often run wifi systems at full capacity using sw-rx-crypt, so maybe the system was just very busy for this test case. As for your other suggestions, I don't have any opinions. I did not know this code well and did not want to do more than just fix the problem for worry that I would introduce something worse. Anyway, it seems this patch is in the official kernel by version 3.14 (and maybe sooner...that is just the one that I checked first), so I'm not sure why this email was sent by Rui Xiang. Thanks, Ben > > Also, do we really need the jiffies-based one at all? Maybe we should > just get rid of that entirely, if it's not sufficiently reliable > anyway. It's not like we should *ever* keep doing softirq's forever, > and quite frankly, when you introduce the limit of doing the loop at > most ten times, I doubt that the "2 milliseconds" limit is even > relevant any more. It would be a strange situation where ten times > through the softirq handling loop would take more than 2ms. > > So I'd rather take a patch that replaces the 2ms timeout with the > 10-iteration timeout. And I think it might be a good idea to have a > debug thing that says what the softirq that keepts firing was. If it's > ath9k, I guess it's NET_TX/RX_SOFTIRQ, but maybe we could have > something that tells exact what it is that re-triggers it over and > over again.. > > Linus > > On Wed, May 13, 2015 at 11:29 PM, Rui Xiang wrote: >> From: Ben Greear >> >> commit 34376a50fb1fa095b9d0636fa41ed2e73125f214 upstream. >> >> The stop machine logic can lock up if all but one of the migration >> threads make it through the disable-irq step and the one remaining >> thread gets stuck in __do_softirq. The reason __do_softirq can hang is >> that it has a bail-out based on jiffies timeout, but in the lockup case, >> jiffies itself is not incremented. >> >> To work around this, re-add the max_restart counter in __do_irq and stop >> processing irqs after 10 restarts. >> >> Thanks to Tejun Heo and Rusty Russell and others for helping me track >> this down. >> >> This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce >> latencies"). > -- Ben Greear Candela Technologies Inc http://www.candelatech.com