All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rui Xiang <rui.xiang@huawei.com>, Tejun Heo <tj@kernel.org>,
	Rusty Russell <rusty@rustcorp.com.au>,
	"David S. Miller" <davem@davemloft.net>,
	Li Zefan <lizefan@huawei.com>, stable <stable@vger.kernel.org>,
	Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH 2/2] Fix lockup related to stop_machine being stuck in __do_softirq.
Date: Mon, 18 May 2015 12:41:17 -0700	[thread overview]
Message-ID: <555A405D.1060203@candelatech.com> (raw)
In-Reply-To: <CA+55aFxM9bOyOBN2diEYL3jS8a7ZxrSpO3_c1umr2KW4+NXkqQ@mail.gmail.com>

On 05/18/2015 12:19 PM, Linus Torvalds wrote:
> So this one kind of fell through the cracks, partly because I don't
> exactly love the patch.
> 
> What is it that keeps re-arming the softirq pending bit all the time?
> You mention the ath9k driver..

It has been a while...I don't remember all the details, but at least some
of the problem was that most CPU cores were shut down, so the system could
make very little progress.  Maybe that includes ath9k logic that would clear the soft-irq-set logic?
I also often run wifi systems at full capacity using sw-rx-crypt, so maybe the system was just
very busy for this test case.

As for your other suggestions, I don't have any opinions.  I did not know this
code well and did not want to do more than just fix the problem for worry
that I would introduce something worse.

Anyway, it seems this patch is in the official kernel by version 3.14
(and maybe sooner...that is just the one that I checked first), so I'm
not sure why this email was sent by Rui Xiang.

Thanks,
Ben

> 
> Also, do we really need the jiffies-based one at all? Maybe we should
> just get rid of that entirely, if it's not sufficiently reliable
> anyway. It's not like we should *ever* keep doing softirq's forever,
> and quite frankly, when you introduce the limit of doing the loop at
> most ten times, I doubt that the "2 milliseconds" limit is even
> relevant any more. It would be a strange situation where ten times
> through the softirq handling loop would take more than 2ms.
> 
> So I'd rather take a patch that replaces the 2ms timeout with the
> 10-iteration timeout. And I think it might be a good idea to have a
> debug thing that says what the softirq that keepts firing was. If it's
> ath9k, I guess it's NET_TX/RX_SOFTIRQ, but maybe we could have
> something that tells exact what it is that re-triggers it over and
> over again..
> 
>                Linus
> 
> On Wed, May 13, 2015 at 11:29 PM, Rui Xiang <rui.xiang@huawei.com> wrote:
>> From: Ben Greear <greearb@candelatech.com>
>>
>> commit 34376a50fb1fa095b9d0636fa41ed2e73125f214 upstream.
>>
>> The stop machine logic can lock up if all but one of the migration
>> threads make it through the disable-irq step and the one remaining
>> thread gets stuck in __do_softirq.  The reason __do_softirq can hang is
>> that it has a bail-out based on jiffies timeout, but in the lockup case,
>> jiffies itself is not incremented.
>>
>> To work around this, re-add the max_restart counter in __do_irq and stop
>> processing irqs after 10 restarts.
>>
>> Thanks to Tejun Heo and Rusty Russell and others for helping me track
>> this down.
>>
>> This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
>> latencies").
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


  reply	other threads:[~2015-05-18 19:41 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-14  6:29 [request for stable inclusion] softirq: reduce latencies Rui Xiang
2015-05-14  6:29 ` [PATCH 1/2] " Rui Xiang
2015-05-14  6:29 ` [PATCH 2/2] Fix lockup related to stop_machine being stuck in __do_softirq Rui Xiang
2015-05-18 19:19   ` Linus Torvalds
2015-05-18 19:41     ` Ben Greear [this message]
2015-05-18 19:44       ` Linus Torvalds
2015-05-18 20:34     ` Eric Dumazet
2015-05-18 21:05       ` David Miller
2015-06-15  3:25 ` [request for stable inclusion] softirq: reduce latencies Zefan Li
2015-08-01 20:59 ` Ben Hutchings

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555A405D.1060203@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=lizefan@huawei.com \
    --cc=rui.xiang@huawei.com \
    --cc=rusty@rustcorp.com.au \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.