linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Elder <elder@linaro.org>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	elder@kernel.org, evgreen@chromium.org,
	bjorn.andersson@linaro.org, cpratapa@codeaurora.org,
	Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>,
	Network Development <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
Date: Mon, 1 Feb 2021 08:35:03 -0600	[thread overview]
Message-ID: <a1b12c17-5d65-ce29-3d4f-e09de4fdcf3f@linaro.org> (raw)
In-Reply-To: <CAF=yD-+ABnhRmcHq=1T7PVz8VUVjqC073bjTa89GUt1rA3KVUw@mail.gmail.com>

On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> On Sun, Jan 31, 2021 at 10:32 AM Alex Elder <elder@linaro.org> wrote:
>>
>> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
>>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>
>>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
>>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
>>>>>>
>>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
>>>>>> which quiesces channel activity, disables NAPI, and (on other than
>>>>>> SDM845) stops the channel.  Similarly, the start and resume paths
>>>>>> share __gsi_channel_start(), which starts the channel and re-enables
>>>>>> NAPI again.
>>>>>>
>>>>>> Disabling NAPI should be done when stopping a channel, but this
>>>>>> should *not* be done when suspending.  It's not necessary in the
>>>>>> suspend path anyway, because the stopped channel (or suspended
>>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
>>>>>> and gsi_channel_trans_quiesce() won't return until there are no
>>>>>> more transactions to process in the NAPI polling loop.
>>>>>
>>>>> But why is it incorrect to do so?
>>>>
>>>> Maybe it's not; I also thought it was fine before, but...

. . .

>> The "hang" occurs on an RX endpoint, and in particular it
>> occurs on an endpoint that we *know* will be receiving a
>> packet as part of the suspend process (when clearing the
>> hardware pipeline).  I can go into that further but won't'
>> unless asked.
>>
>>>> A stopped channel won't interrupt,
>>>> so we don't bother disabling the completion interrupt,
>>>> with no interrupts, NAPI won't be scheduled, so there's
>>>> no need to disable NAPI either.
>>>
>>> That sounds plausible. But it doesn't explain why napi_disable "should
>>> *not* be done when suspending" as the commit states.
>>>
>>> Arguably, leaving that won't have much effect either way, and is in
>>> line with other drivers.
>>
>> Understood and agreed.  In fact, if the hang occurrs in
>> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
>> it would occur in napi_synchronize() as well.
> 
> Agreed.
> 
> So you have an environment to test a patch in, it might be worthwhile
> to test essentially the same logic reordering as in this patch set,
> but while still disabling napi.

What is the purpose of this test?  Just to guarantee
that the NAPI hang goes away?  Because you agree that
the napi_schedule() call would *also* hang if that
problem exists, right?

Anyway, what you're suggesting is to simply test with
this last patch removed.  I can do that but I really
don't expect it to change anything.  I will start that
test later today when I'm turning my attention to
something else for a while.

> The disappearing race may be due to another change rather than
> napi_disable vs napi_synchronize. A smaller, more targeted patch could
> also be a net (instead of net-next) candidate.

I am certain it is.

I can tell you that we have seen a hang (after I think 2500+
suspend/resume cycles) with the IPA code that is currently
upstream.

But with this latest series of 9, there is no hang after
10,000+ cycles.  That gives me a bisect window, but I really
don't want to go through a full bisect of even those 9,
because it's 4 tests, each of which takes days to complete.

Looking at the 9 patches, I think this one is the most
likely culprit:
   net: ipa: disable IEOB interrupt after channel stop

I think the race involves the I/O completion handler
interacting with NAPI in an unwanted way, but I have
not come up with the exact sequence that would lead
to getting stuck in napi_disable().

Here are some possible events that could occur on an
RX channel in *some* order, prior to that patch.  And
in the order I show there's at least a problem of a
receive not being processed immediately.

		. . . (suspend initiated)

	replenish_stop()
	quiesce()
			IRQ fires (receive ready)
	napi_disable()
			napi_schedule() (ignored)
	irq_disable()
			IRQ condition; pending
	channel_stop()

		. . . (resume triggered)

	channel_start()
	irq_enable()
			pending IRQ fires
			napi_schedule() (ignored)
	napi_enable()

		. . . (suspend initiated)

>> At this point
>> it's more about the whole set of rework here, and keeping
>> NAPI enabled during suspend seems a little cleaner.
> 
> I'm not sure. I haven't looked if there is a common behavior across
> devices. That might be informative. igb, for one, releases all
> resources.

I tried to do a survey of that too and did not see a
consistent pattern.  I didn't look *that* hard because
doing so would be more involved than I wanted to get.

So in summary:
- I'm putting together version 2 of this series now
- Testing this past week seems to show that this series
   makes the hang in napi_disable() (or synchronize)
   go away.
- I think the most likely patch in this series that
   fixes the problem is the IRQ ordering one I mention
   above, but right now I can't cite a specific sequence
   of events that would prove it.
- I will begin some long testing later today without
   this last patch applied
     --> But I think testing without the IRQ ordering
	patch would be more promising, and I'd like
	to hear your opinion on that

Thanks again for your input and help on this.

					-Alex

. . .

  reply	other threads:[~2021-02-01 14:36 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-30 15:25   ` Willem de Bruijn
2021-01-30 19:22     ` Jakub Kicinski
2021-01-31  4:30       ` Alex Elder
2021-01-31  4:29     ` Alex Elder
2021-01-31 14:52       ` Willem de Bruijn
2021-01-31 15:32         ` Alex Elder
2021-02-01  1:36           ` Willem de Bruijn
2021-02-01 14:35             ` Alex Elder [this message]
2021-02-01 14:47               ` Willem de Bruijn
2021-02-01 15:48                 ` Alex Elder
2021-02-01 18:38                 ` Alex Elder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a1b12c17-5d65-ce29-3d4f-e09de4fdcf3f@linaro.org \
    --to=elder@linaro.org \
    --cc=bjorn.andersson@linaro.org \
    --cc=cpratapa@codeaurora.org \
    --cc=davem@davemloft.net \
    --cc=elder@kernel.org \
    --cc=evgreen@chromium.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=subashab@codeaurora.org \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).