linux-spi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt
@ 2020-03-16 22:20 Douglas Anderson
  2020-03-17 12:10 ` Mark Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Douglas Anderson @ 2020-03-16 22:20 UTC (permalink / raw)
  To: Mark Brown
  Cc: Alok Chauhan, Dilip Kota, skakit, Girish Mahadevan,
	Douglas Anderson, Andy Gross, Bjorn Andersson, Stephen Boyd,
	linux-arm-msm, linux-kernel, linux-spi

Every once in a while (like once in a few months on a device) people
have seen warnings on devices using spi-geni-qcom like:

  irq ...: nobody cared (try booting with the "irqpoll" option)

...where the interrupt number listed matches up with "spi_geni" in
/proc/interrupts.

You can get "nobody cared" if the interrupt handler returns IRQ_NONE.
Once you get into this state the driver will just stop working.

Auditing the code makes me believe that it's probably not so good
checking "cur_mcmd" in the interrupt handler not under spinlock.
Let's move it to be under spinlock.  While we're at it, let's not
return IRQ_NONE.  IRQ_NONE is supposed to check the _hardware_ status
registers and only be returned if your hardware says that there's no
interrupt present.  It's not supposed to check software state.  In any
case, the whole point of IRQ_NONE is if two separate devices are
trying to share an interrupt line, which just isn't true for anyone
using geni.

Looking more closely at the code, it looks as if there are some other
places that could be under spinlock, so let's add.  It looks as if the
original code was assuming that by the time that the interrupt handler
got called that the write to "cur_mcmd" (to make it not CMD_NONE
anymore) would make it to the processor handling the interrupt.
Perhaps with weakly ordered memory this sometimes (though very rarely)
happened.

Patch is marked "speculative" because I have no way to reproduce this
problem, so I'm just hoping this fixes it.  Weakly ordered memory
makes my brain hurt.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/spi/spi-geni-qcom.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index c3972424af71..51290a3fd203 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -151,16 +151,19 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	struct geni_se *se = &mas->se;
 	unsigned long time_left;
 
-	reinit_completion(&mas->xfer_done);
 	pm_runtime_get_sync(mas->dev);
 	if (!(slv->mode & SPI_CS_HIGH))
 		set_flag = !set_flag;
 
+	spin_lock_irq(&mas->lock);
+	reinit_completion(&mas->xfer_done);
+
 	mas->cur_mcmd = CMD_CS;
 	if (set_flag)
 		geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
 	else
 		geni_se_setup_m_cmd(se, SPI_CS_DEASSERT, 0);
+	spin_unlock_irq(&mas->lock);
 
 	time_left = wait_for_completion_timeout(&mas->xfer_done, HZ);
 	if (!time_left)
@@ -307,6 +310,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
 	u32 spi_tx_cfg, len;
 	struct geni_se *se = &mas->se;
 
+	spin_lock_irq(&mas->lock);
+
 	spi_tx_cfg = readl(se->base + SE_SPI_TRANS_CFG);
 	if (xfer->bits_per_word != mas->cur_bits_per_word) {
 		spi_setup_word_len(mas, mode, xfer->bits_per_word);
@@ -376,6 +381,8 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
 	 */
 	if (m_cmd & SPI_TX_ONLY)
 		writel(mas->tx_wm, se->base + SE_GENI_TX_WATERMARK_REG);
+
+	spin_unlock_irq(&mas->lock);
 }
 
 static int spi_geni_transfer_one(struct spi_master *spi,
@@ -479,12 +486,17 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
 	u32 m_irq;
 	unsigned long flags;
 
-	if (mas->cur_mcmd == CMD_NONE)
-		return IRQ_NONE;
-
 	spin_lock_irqsave(&mas->lock, flags);
 	m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
 
+	/*
+	 * We don't expect to hit this, but if we do we should try our best
+	 * to clear the interrupts and return so we don't just get called
+	 * again.
+	 */
+	if (mas->cur_mcmd == CMD_NONE)
+		goto exit;
+
 	if ((m_irq & M_RX_FIFO_WATERMARK_EN) || (m_irq & M_RX_FIFO_LAST_EN))
 		geni_spi_handle_rx(mas);
 
@@ -523,6 +535,7 @@ static irqreturn_t geni_spi_isr(int irq, void *data)
 		complete(&mas->xfer_done);
 	}
 
+exit:
 	writel(m_irq, se->base + SE_GENI_M_IRQ_CLEAR);
 	spin_unlock_irqrestore(&mas->lock, flags);
 	return IRQ_HANDLED;
-- 
2.25.1.481.gfbce0eb801-goog

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt
  2020-03-16 22:20 [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt Douglas Anderson
@ 2020-03-17 12:10 ` Mark Brown
       [not found]   ` <20200317121018.GB3971-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Brown @ 2020-03-17 12:10 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Alok Chauhan, Dilip Kota, skakit, Girish Mahadevan, Andy Gross,
	Bjorn Andersson, Stephen Boyd, linux-arm-msm, linux-kernel,
	linux-spi

[-- Attachment #1: Type: text/plain, Size: 479 bytes --]

On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:

> +	/*
> +	 * We don't expect to hit this, but if we do we should try our best
> +	 * to clear the interrupts and return so we don't just get called
> +	 * again.
> +	 */
> +	if (mas->cur_mcmd == CMD_NONE)
> +		goto exit;
> +

Does this mean that there was an actual concrete message of type
CMD_NONE or does it mean that there was no message waiting?  If there
was no message then isn't the interrupt spurious?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt
       [not found]   ` <20200317121018.GB3971-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
@ 2020-03-17 15:12     ` Doug Anderson
  2020-03-17 16:44       ` Mark Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Doug Anderson @ 2020-03-17 15:12 UTC (permalink / raw)
  To: Mark Brown
  Cc: Alok Chauhan, Dilip Kota, skakit-sgV2jX0FEOL9JmXXK+q4OQ,
	Girish Mahadevan, Andy Gross, Bjorn Andersson, Stephen Boyd,
	linux-arm-msm, LKML, linux-spi

Hi,

On Tue, Mar 17, 2020 at 5:10 AM Mark Brown <broonie-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:
>
> > +     /*
> > +      * We don't expect to hit this, but if we do we should try our best
> > +      * to clear the interrupts and return so we don't just get called
> > +      * again.
> > +      */
> > +     if (mas->cur_mcmd == CMD_NONE)
> > +             goto exit;
> > +
>
> Does this mean that there was an actual concrete message of type
> CMD_NONE or does it mean that there was no message waiting?  If there
> was no message then isn't the interrupt spurious?

There is no message of type "CMD_NONE".  The "cur_mcmd" field is
basically where in the software state machine we're at:

* CMD_NONE - Software thinks that the controller should be idle.
* CMD_XFER - Software has started a transfer.
* CMD_CS - Software has started a chip select change.
* CMD_CANCEL - Software sent a "cancel".

...so certainly if we see "cur_mcmd == CMD_NONE" in the interrupt
handler we're in an unexpected situation.  We don't expect interrupts
while idle.  I wouldn't necessarily say it was a spurious interrupt,
though.  To say that I'd rather look at the result of this line in the
IRQ handler:

    m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

...if that line returns 0 then I would be willing to say it is a
spurious interrupt.


So there is really more than one issue at hand, I guess.

A) Why did we get an interrupt when we had "cur_mcmd == CMD_NONE"?
IMO this is due to weakly ordered memory and not enough locking.

B) If we do see an interrupt when "cur_mcmd == CMD_NONE" (even after
we fix the locking), what should we do?  IMO we should still try to
Ack it.  I can add a "pr_warn()" if it's helpful?

C) Do we care to try to detect spurious interrupts (by checking
SE_GENI_M_IRQ_STATUS) and return IRQ_NONE?  Right now a spurious
interrupt will be harmless because all of the logic in geni_spi_isr()
doesn't do anything if SE_GENI_M_IRQ_STATUS has no bits set.  ...but
it will still return IRQ_HANDLED.  I can't imagine anyone ever putting
this device on a shared interrupt, but if it's important I can detect
this and return IRQ_NONE in this case in a v2 of this patch.

-Doug

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt
  2020-03-17 15:12     ` Doug Anderson
@ 2020-03-17 16:44       ` Mark Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Brown @ 2020-03-17 16:44 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Alok Chauhan, Dilip Kota, skakit, Girish Mahadevan, Andy Gross,
	Bjorn Andersson, Stephen Boyd, linux-arm-msm, LKML, linux-spi

[-- Attachment #1: Type: text/plain, Size: 1969 bytes --]

On Tue, Mar 17, 2020 at 08:12:30AM -0700, Doug Anderson wrote:
> On Tue, Mar 17, 2020 at 5:10 AM Mark Brown <broonie@kernel.org> wrote:
> > On Mon, Mar 16, 2020 at 03:20:01PM -0700, Douglas Anderson wrote:

> >
> > Does this mean that there was an actual concrete message of type
> > CMD_NONE or does it mean that there was no message waiting?  If there
> > was no message then isn't the interrupt spurious?

> There is no message of type "CMD_NONE".  The "cur_mcmd" field is
> basically where in the software state machine we're at:

...

> ...so certainly if we see "cur_mcmd == CMD_NONE" in the interrupt
> handler we're in an unexpected situation.  We don't expect interrupts
> while idle.  I wouldn't necessarily say it was a spurious interrupt,
> though.  To say that I'd rather look at the result of this line in the
> IRQ handler:

>     m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);

> ...if that line returns 0 then I would be willing to say it is a
> spurious interrupt.

Right, that should return IRQ_NONE if there's nothing actually asserted.
I think you're right about the state machine though.

> C) Do we care to try to detect spurious interrupts (by checking
> SE_GENI_M_IRQ_STATUS) and return IRQ_NONE?  Right now a spurious
> interrupt will be harmless because all of the logic in geni_spi_isr()
> doesn't do anything if SE_GENI_M_IRQ_STATUS has no bits set.  ...but
> it will still return IRQ_HANDLED.  I can't imagine anyone ever putting
> this device on a shared interrupt, but if it's important I can detect
> this and return IRQ_NONE in this case in a v2 of this patch.

It's a good idea to return IRQ_NONE not just for the shared interrupt
case but also because it lets the error handling code in the genirq core
do it's thing and detect bugs - as seems to have happened here.  This
one was fairly benign but you can also see things like interrupts that
are constantly asserted by the hardware where we end up spinning in
interrupt handlers.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-03-17 16:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-16 22:20 [PATCH] spi: spi-geni-qcom: Speculative fix of "nobody cared" about interrupt Douglas Anderson
2020-03-17 12:10 ` Mark Brown
     [not found]   ` <20200317121018.GB3971-GFdadSzt00ze9xe1eoZjHA@public.gmane.org>
2020-03-17 15:12     ` Doug Anderson
2020-03-17 16:44       ` Mark Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).