Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr

From: Doug Anderson <dianders@chromium.org>
To: Stephen Boyd <swboyd@chromium.org>
Cc: Roja Rani Yarubandi <rojay@codeaurora.org>,
	Mark Brown <broonie@kernel.org>, Andy Gross <agross@kernel.org>,
	Bjorn Andersson <bjorn.andersson@linaro.org>,
	linux-arm-msm <linux-arm-msm@vger.kernel.org>,
	linux-spi <linux-spi@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Akash Asthana <akashast@codeaurora.org>,
	msavaliy@qti.qualcomm.com
Subject: Re: [PATCH] spi: spi-geni-qcom: Fix NULL pointer access in geni_spi_isr
Date: Mon, 14 Dec 2020 16:31:31 -0800	[thread overview]
Message-ID: <CAD=FV=UJStfgAUvKc9iv7Y27OAAxWVdEsCf_HVmpc0eN9y4Gfg@mail.gmail.com> (raw)
In-Reply-To: <160773672053.1580929.15441111796129112926@swboyd.mtv.corp.google.com>

Hi,

On Fri, Dec 11, 2020 at 5:32 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Doug Anderson (2020-12-10 17:51:53)
> > Hi,
> >
> > On Thu, Dec 10, 2020 at 5:39 PM Stephen Boyd <swboyd@chromium.org> wrote:
> > >
> > > Quoting Doug Anderson (2020-12-10 17:30:17)
> > > > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swboyd@chromium.org> wrote:
> > > > >
> > > > > Yeah and so if it comes way later because it timed out then what's the
> > > > > point of calling synchronize_irq() again? To make the completion
> > > > > variable set when it won't be tested again until it is reinitialized?
> > > >
> > > > Presumably the idea is to try to recover to a somewhat usable state
> > > > again?  We're not rebooting the machine so, even though this transfer
> > > > failed, we will undoubtedly do another transfer later.  If that
> > > > "abort" interrupt comes way later while we're setting up the next
> > > > transfer we'll really confuse ourselves.
> > >
> > > The interrupt handler just sets a completion variable. What does that
> > > confuse?
> >
> > The interrupt handler sees a "DONE" interrupt.  If we've made it far
> > enough into setting up the next transfer that "cur_xfer" has been set
> > then it might do more, no?
>
> I thought it saw a cancel/abort EN bit?
>
>         if (m_irq & M_CMD_CANCEL_EN)
>                 complete(&mas->cancel_done);
>         if (m_irq & M_CMD_ABORT_EN)
>                 complete(&mas->abort_done)
>
> and only a DONE bit if a transfer happened.

Ah, true.  The crazy thing is that since we do abort / cancel with
commands we get them together with "done".  That "done" could
potentially confuse the next transfer...  In theory we could ignore
DONE if we see ABORT / CANCEL, but I've now spent a bunch of time on
this and I think the best thing is to just make sure we won't start
the next transfer if any IRQs are pending.  I'll post patches...

> > > > I guess you could go the route of adding a synchronize_irq() at the
> > > > start of the next transfer, but I'd rather add the overhead in the
> > > > exceptional case (the timeout) than the normal case.  In the normal
> > > > case we don't need to worry about random IRQs from the past transfer
> > > > suddenly showing up.
> > > >
> > >
> > > How does adding synchronize_irq() at the end guarantee that the abort is
> > > cleared out of the hardware though? It seems to assume that the abort is
> > > pending at the GIC when it could still be running through the hardware
> > > and not executed yet. It seems like a synchronize_irq() for that is
> > > wishful thinking that the irq is merely pending even though it timed
> > > out and possibly never ran. Maybe it's stuck in a write buffer in the
> > > CPU?
> >
> > I guess I'm asserting that if a full second passed (because we timed
> > out) and after that full second no interrupts are pending then the
> > interrupt will never come.  That seems a reasonable assumption to me.
> > It seems hard to believe it'd be stuck in a write buffer for a full
> > second?
> >
>
> Ok, so if we don't expect an irq to come in why are we calling
> synchronize_irq()? I'm lost.

It turns out that synchronize_irq() doesn't do what I thought it did,
actually.  :(  Despite __synchronize_hardirq() talking about waiting
for "pending" interrupts, it actually passes in "IRQCHIP_STATE_ACTIVE"
and not "IRQCHIP_STATE_PENDING".  So much for that.

...but, if it did, I guess my point (which no longer matters) was:

a) If you wait a second but don't wait for pending interrupts to be
done, interrupts might still come later if the CPU servicing
interrupts was blocked.

b) If you don't wait a second but wait for pending interrupts to be
done, interrupts might still come later because maybe the transaction
wasn't finished first.

c) If you wait a second (enough for the transaction to finish) and
then wait for pending interrupts (to handle ISR being blocked) then
you're good.

---

So I got tired of all this conjecture and decided to write some code.
I reproduced the problem with some test code that let me call
local_irq_disable() for a set amount of time based on sysfs.

In terminal 1:
  while true; do ectool version > /dev/null; done

In terminal 2, disable interrupts on cpu0 for 2000 ms:
  taskset -c 0 echo 2000 > /sys/module/spi_geni_qcom/parameters/doug_test

Of course, I got the timeout and the NULL dereference.

Then I could poke at all the corner cases.  Posting patches for what I
think is the best solution...

-Doug