From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <stable-owner@vger.kernel.org>
Received: from mail2.candelatech.com ([208.74.158.173]:47396 "EHLO
	mail2.candelatech.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932524AbbERTlS (ORCPT
	<rfc822;stable@vger.kernel.org>); Mon, 18 May 2015 15:41:18 -0400
Message-ID: <555A405D.1060203@candelatech.com>
Date: Mon, 18 May 2015 12:41:17 -0700
From: Ben Greear <greearb@candelatech.com>
MIME-Version: 1.0
To: Linus Torvalds <torvalds@linux-foundation.org>
CC: Rui Xiang <rui.xiang@huawei.com>, Tejun Heo <tj@kernel.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"David S. Miller" <davem@davemloft.net>,
	Li Zefan <lizefan@huawei.com>, stable <stable@vger.kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH 2/2] Fix lockup related to stop_machine being stuck in
 __do_softirq.
References: <1431584993-2856-1-git-send-email-rui.xiang@huawei.com> <1431584993-2856-3-git-send-email-rui.xiang@huawei.com> <CA+55aFxM9bOyOBN2diEYL3jS8a7ZxrSpO3_c1umr2KW4+NXkqQ@mail.gmail.com>
In-Reply-To: <CA+55aFxM9bOyOBN2diEYL3jS8a7ZxrSpO3_c1umr2KW4+NXkqQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Sender: stable-owner@vger.kernel.org
List-ID: <stable.vger.kernel.org>

On 05/18/2015 12:19 PM, Linus Torvalds wrote:
> So this one kind of fell through the cracks, partly because I don't
> exactly love the patch.
> 
> What is it that keeps re-arming the softirq pending bit all the time?
> You mention the ath9k driver..

It has been a while...I don't remember all the details, but at least some
of the problem was that most CPU cores were shut down, so the system could
make very little progress.  Maybe that includes ath9k logic that would clear the soft-irq-set logic?
I also often run wifi systems at full capacity using sw-rx-crypt, so maybe the system was just
very busy for this test case.

As for your other suggestions, I don't have any opinions.  I did not know this
code well and did not want to do more than just fix the problem for worry
that I would introduce something worse.

Anyway, it seems this patch is in the official kernel by version 3.14
(and maybe sooner...that is just the one that I checked first), so I'm
not sure why this email was sent by Rui Xiang.

Thanks,
Ben

> 
> Also, do we really need the jiffies-based one at all? Maybe we should
> just get rid of that entirely, if it's not sufficiently reliable
> anyway. It's not like we should *ever* keep doing softirq's forever,
> and quite frankly, when you introduce the limit of doing the loop at
> most ten times, I doubt that the "2 milliseconds" limit is even
> relevant any more. It would be a strange situation where ten times
> through the softirq handling loop would take more than 2ms.
> 
> So I'd rather take a patch that replaces the 2ms timeout with the
> 10-iteration timeout. And I think it might be a good idea to have a
> debug thing that says what the softirq that keepts firing was. If it's
> ath9k, I guess it's NET_TX/RX_SOFTIRQ, but maybe we could have
> something that tells exact what it is that re-triggers it over and
> over again..
> 
>                Linus
> 
> On Wed, May 13, 2015 at 11:29 PM, Rui Xiang <rui.xiang@huawei.com> wrote:
>> From: Ben Greear <greearb@candelatech.com>
>>
>> commit 34376a50fb1fa095b9d0636fa41ed2e73125f214 upstream.
>>
>> The stop machine logic can lock up if all but one of the migration
>> threads make it through the disable-irq step and the one remaining
>> thread gets stuck in __do_softirq.  The reason __do_softirq can hang is
>> that it has a bail-out based on jiffies timeout, but in the lockup case,
>> jiffies itself is not incremented.
>>
>> To work around this, re-add the max_restart counter in __do_irq and stop
>> processing irqs after 10 restarts.
>>
>> Thanks to Tejun Heo and Rusty Russell and others for helping me track
>> this down.
>>
>> This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
>> latencies").
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com