From: Alex Elder <elder@linaro.org>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
elder@kernel.org, evgreen@chromium.org,
bjorn.andersson@linaro.org, cpratapa@codeaurora.org,
Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>,
Network Development <netdev@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
Date: Mon, 1 Feb 2021 08:35:03 -0600 [thread overview]
Message-ID: <a1b12c17-5d65-ce29-3d4f-e09de4fdcf3f@linaro.org> (raw)
In-Reply-To: <CAF=yD-+ABnhRmcHq=1T7PVz8VUVjqC073bjTa89GUt1rA3KVUw@mail.gmail.com>
On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
>>
>> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
>>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>
>>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>>>
>>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>>>>> which quiesces channel activity, disables NAPI, and (on other than
>>>>>> SDM845) stops the channel. Similarly, the start and resume paths
>>>>>> share __gsi_channel_start(), which starts the channel and re-enables
>>>>>> NAPI again.
>>>>>>
>>>>>> Disabling NAPI should be done when stopping a channel, but this
>>>>>> should *not* be done when suspending. It's not necessary in the
>>>>>> suspend path anyway, because the stopped channel (or suspended
>>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>>>>> and gsi_channel_trans_quiesce() won't return until there are no
>>>>>> more transactions to process in the NAPI polling loop.
>>>>>
>>>>> But why is it incorrect to do so?
>>>>
>>>> Maybe it's not; I also thought it was fine before, but...
. . .
>> The "hang" occurs on an RX endpoint, and in particular it
>> occurs on an endpoint that we *know* will be receiving a
>> packet as part of the suspend process (when clearing the
>> hardware pipeline). I can go into that further but won't'
>> unless asked.
>>
>>>> A stopped channel won't interrupt,
>>>> so we don't bother disabling the completion interrupt,
>>>> with no interrupts, NAPI won't be scheduled, so there's
>>>> no need to disable NAPI either.
>>>
>>> That sounds plausible. But it doesn't explain why napi_disable "should
>>> *not* be done when suspending" as the commit states.
>>>
>>> Arguably, leaving that won't have much effect either way, and is in
>>> line with other drivers.
>>
>> Understood and agreed. In fact, if the hang occurrs in
>> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
>> it would occur in napi_synchronize() as well.
>
> Agreed.
>
> So you have an environment to test a patch in, it might be worthwhile
> to test essentially the same logic reordering as in this patch set,
> but while still disabling napi.
What is the purpose of this test? Just to guarantee
that the NAPI hang goes away? Because you agree that
the napi_schedule() call would *also* hang if that
problem exists, right?
Anyway, what you're suggesting is to simply test with
this last patch removed. I can do that but I really
don't expect it to change anything. I will start that
test later today when I'm turning my attention to
something else for a while.
> The disappearing race may be due to another change rather than
> napi_disable vs napi_synchronize. A smaller, more targeted patch could
> also be a net (instead of net-next) candidate.
I am certain it is.
I can tell you that we have seen a hang (after I think 2500+
suspend/resume cycles) with the IPA code that is currently
upstream.
But with this latest series of 9, there is no hang after
10,000+ cycles. That gives me a bisect window, but I really
don't want to go through a full bisect of even those 9,
because it's 4 tests, each of which takes days to complete.
Looking at the 9 patches, I think this one is the most
likely culprit:
net: ipa: disable IEOB interrupt after channel stop
I think the race involves the I/O completion handler
interacting with NAPI in an unwanted way, but I have
not come up with the exact sequence that would lead
to getting stuck in napi_disable().
Here are some possible events that could occur on an
RX channel in *some* order, prior to that patch. And
in the order I show there's at least a problem of a
receive not being processed immediately.
. . . (suspend initiated)
replenish_stop()
quiesce()
IRQ fires (receive ready)
napi_disable()
napi_schedule() (ignored)
irq_disable()
IRQ condition; pending
channel_stop()
. . . (resume triggered)
channel_start()
irq_enable()
pending IRQ fires
napi_schedule() (ignored)
napi_enable()
. . . (suspend initiated)
>> At this point
>> it's more about the whole set of rework here, and keeping
>> NAPI enabled during suspend seems a little cleaner.
>
> I'm not sure. I haven't looked if there is a common behavior across
> devices. That might be informative. igb, for one, releases all
> resources.
I tried to do a survey of that too and did not see a
consistent pattern. I didn't look *that* hard because
doing so would be more involved than I wanted to get.
So in summary:
- I'm putting together version 2 of this series now
- Testing this past week seems to show that this series
makes the hang in napi_disable() (or synchronize)
go away.
- I think the most likely patch in this series that
fixes the problem is the IRQ ordering one I mention
above, but right now I can't cite a specific sequence
of events that would prove it.
- I will begin some long testing later today without
this last patch applied
--> But I think testing without the IRQ ordering
patch would be more promising, and I'd like
to hear your opinion on that
Thanks again for your input and help on this.
-Alex
. . .
next prev parent reply other threads:[~2021-02-01 14:36 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-30 15:25 ` Willem de Bruijn
2021-01-30 19:22 ` Jakub Kicinski
2021-01-31 4:30 ` Alex Elder
2021-01-31 4:29 ` Alex Elder
2021-01-31 14:52 ` Willem de Bruijn
2021-01-31 15:32 ` Alex Elder
2021-02-01 1:36 ` Willem de Bruijn
2021-02-01 14:35 ` Alex Elder [this message]
2021-02-01 14:47 ` Willem de Bruijn
2021-02-01 15:48 ` Alex Elder
2021-02-01 18:38 ` Alex Elder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a1b12c17-5d65-ce29-3d4f-e09de4fdcf3f@linaro.org \
--to=elder@linaro.org \
--cc=bjorn.andersson@linaro.org \
--cc=cpratapa@codeaurora.org \
--cc=davem@davemloft.net \
--cc=elder@kernel.org \
--cc=evgreen@chromium.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=subashab@codeaurora.org \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).