linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Alex Elder <elder@linaro.org>
Cc: David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	elder@kernel.org, evgreen@chromium.org,
	bjorn.andersson@linaro.org, cpratapa@codeaurora.org,
	Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>,
	Network Development <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend
Date: Sun, 31 Jan 2021 09:52:41 -0500	[thread overview]
Message-ID: <CAF=yD-+4xNjgkWQw8tMz0uvK45ysL6vnx86ZgEud+kCW9zw9_A@mail.gmail.com> (raw)
In-Reply-To: <e27f5c10-7b77-1f12-fe36-e9261f01bca1@linaro.org>

On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <elder@linaro.org> wrote:
>
> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> > On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <elder@linaro.org> wrote:
> >>
> >> The channel stop and suspend paths both call __gsi_channel_stop(),
> >> which quiesces channel activity, disables NAPI, and (on other than
> >> SDM845) stops the channel.  Similarly, the start and resume paths
> >> share __gsi_channel_start(), which starts the channel and re-enables
> >> NAPI again.
> >>
> >> Disabling NAPI should be done when stopping a channel, but this
> >> should *not* be done when suspending.  It's not necessary in the
> >> suspend path anyway, because the stopped channel (or suspended
> >> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >> and gsi_channel_trans_quiesce() won't return until there are no
> >> more transactions to process in the NAPI polling loop.
> >
> > But why is it incorrect to do so?
>
> Maybe it's not; I also thought it was fine before, but...
>
> Someone at Qualcomm asked me why I thought NAPI needed
> to be disabled on suspend.  My response was basically
> that it was a lightweight operation, and it shouldn't
> really be a problem to do so.
>
> Then, when I posted two patches last month, Jakub's
> response told me he didn't understand why I was doing
> what I was doing, and I stepped back to reconsider
> the details of what was happening at suspend time.
>
> https://lore.kernel.org/netdev/20210107183803.47308e23@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>
> Four things were happening to suspend a channel:
> quiesce activity; disable interrupt; disable NAPI;
> and stop the channel.  It occurred to me that a
> stopped channel would not generate interrupts, so if
> the channel was stopped earlier there would be no need
> to disable the interrupt.  Similarly there would be
> (essentially) no need to disable NAPI once a channel
> was stopped.
>
> Underlying all of this is that I started chasing a
> hang that was occurring on suspend over a month ago.
> It was hard to reproduce (hundreds or thousands of
> suspend/resume cycles without hitting it), and one
> of the few times I actually hit the problem it was
> stuck in napi_disable(), apparently waiting for
> NAPI_STATE_SCHED to get cleared by napi_complete().

This is important information.

What exactly do you mean by hang?

>
> My best guess about how this could occur was if there
> were a race of some kind between the interrupt handler
> (scheduling NAPI) and the poll function (completing
> it).  I found a number of problems while looking
> at this, and in the past few weeks I've posted some
> fixes to improve things.  Still, even with some of
> these fixes in place we have seen a hang (but now
> even more rarely).
>
> So this grand rework of suspending/stopping channels
> is an attempt to resolve this hang on suspend.

Do you have any data that this patchset resolves the issue, or is it
too hard to reproduce to say anything?

> The channel is now stopped early, and once stopped,
> everything that completed prior to the channel being
> stopped is polled before considering the suspend
> function done.

Does the call to gsi_channel_trans_quiesce before
gsi_channel_stop_retry leave a race where new transactions may occur
until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
knowing the details of this device.

> A stopped channel won't interrupt,
> so we don't bother disabling the completion interrupt,
> with no interrupts, NAPI won't be scheduled, so there's
> no need to disable NAPI either.

That sounds plausible. But it doesn't explain why napi_disable "should
*not* be done when suspending" as the commit states.

Arguably, leaving that won't have much effect either way, and is in
line with other drivers.

Your previous patchset mentions "When stopping a channel, the IPA
driver currently disables NAPI before disabling the interrupt." That
would no longer be the case.

> The net result is simpler, and seems logical, and
> should preclude any possible race between the interrupt
> handler and poll function.  I'm trying to solve the
> hang problem analytically, because it takes *so* long
> to reproduce.
>
> I'm open to other suggestions.
>
>                                         -Alex
>
> >  From a quick look, virtio-net disables on both remove and freeze, for instance.
> >
> >> Instead, enable NAPI in gsi_channel_start(), when the completion
> >> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> >> when finally disabling the interrupt.
> >>
> >> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
> >> NAPI polling is done before moving on.
> >>
> >> Signed-off-by: Alex Elder <elder@linaro.org>
> >> ---
> > =
> >> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
> >>          struct gsi_channel *channel = &gsi->channel[channel_id];
> >>          int ret;
> >>
> >> -       /* Enable the completion interrupt */
> >> +       /* Enable NAPI and the completion interrupt */
> >> +       napi_enable(&channel->napi);
> >>          gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
> >>
> >>          ret = __gsi_channel_start(channel, true);
> >> -       if (ret)
> >> -               gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >> +       if (!ret)
> >> +               return 0;
> >> +
> >> +       gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> >> +       napi_disable(&channel->napi);
> >>
> >>          return ret;
> >>   }
> >
> > subjective, but easier to parse when the normal control flow is linear
> > and the error path takes a branch (or goto, if reused).
> >
>

  reply	other threads:[~2021-01-31 15:00 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-29 20:20 [PATCH net-next 0/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 1/9] net: ipa: don't thaw channel if error starting Alex Elder
2021-01-29 20:20 ` [PATCH net-next 2/9] net: ipa: introduce gsi_channel_stop_retry() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 3/9] net: ipa: introduce __gsi_channel_start() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 4/9] net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw() Alex Elder
2021-01-29 20:20 ` [PATCH net-next 5/9] net: ipa: disable IEOB interrupt after channel stop Alex Elder
2021-01-29 20:20 ` [PATCH net-next 6/9] net: ipa: move completion interrupt enable/disable Alex Elder
2021-01-29 20:20 ` [PATCH net-next 7/9] net: ipa: don't disable IEOB interrupt during suspend Alex Elder
2021-01-29 20:20 ` [PATCH net-next 8/9] net: ipa: expand last transaction check Alex Elder
2021-01-29 20:20 ` [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend Alex Elder
2021-01-30 15:25   ` Willem de Bruijn
2021-01-30 19:22     ` Jakub Kicinski
2021-01-31  4:30       ` Alex Elder
2021-01-31  4:29     ` Alex Elder
2021-01-31 14:52       ` Willem de Bruijn [this message]
2021-01-31 15:32         ` Alex Elder
2021-02-01  1:36           ` Willem de Bruijn
2021-02-01 14:35             ` Alex Elder
2021-02-01 14:47               ` Willem de Bruijn
2021-02-01 15:48                 ` Alex Elder
2021-02-01 18:38                 ` Alex Elder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAF=yD-+4xNjgkWQw8tMz0uvK45ysL6vnx86ZgEud+kCW9zw9_A@mail.gmail.com' \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=bjorn.andersson@linaro.org \
    --cc=cpratapa@codeaurora.org \
    --cc=davem@davemloft.net \
    --cc=elder@kernel.org \
    --cc=elder@linaro.org \
    --cc=evgreen@chromium.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=subashab@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).