linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
@ 2020-12-16 22:41 Douglas Anderson
  2020-12-16 22:41 ` [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending Douglas Anderson
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Douglas Anderson @ 2020-12-16 22:41 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Alok Chauhan, Andy Gross, Bjorn Andersson,
	Dilip Kota, Girish Mahadevan, linux-arm-msm, linux-kernel,
	linux-spi

In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
variable") we changed handle_fifo_timeout() so that we set
"mas->cur_xfer" to NULL to make absolutely sure that we don't mess
with the buffers from the previous transfer in the timeout case.

Unfortunately, this caused the IRQ handler to dereference NULL in some
cases.  One case:

  CPU0                           CPU1
  ----                           ----
                                 setup_fifo_xfer()
                                  geni_se_setup_m_cmd()
                                 <hardware starts transfer>
                                 <transfer completes in hardware>
                                 <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq>
                                 ...
                                 handle_fifo_timeout()
                                  spin_lock_irq(mas->lock)
                                  mas->cur_xfer = NULL
                                  geni_se_cancel_m_cmd()
                                  spin_unlock_irq(mas->lock)

  geni_spi_isr()
   spin_lock(mas->lock)
   if (m_irq & M_RX_FIFO_WATERMARK_EN)
    geni_spi_handle_rx()
     mas->cur_xfer NULL dereference!

tl;dr: Seriously delayed interrupts for RX/TX can lead to timeout
handling setting mas->cur_xfer to NULL.

Let's check for the NULL transfer in the TX and RX cases and reset the
watermark or clear out the fifo respectively to put the hardware back
into a sane state.

NOTE: things still could get confused if we get timeouts all the way
through handle_fifo_timeout() and then start a new transfer because
interrupts from the old transfer / cancel / abort could still be
pending.  A future patch will help this corner case.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- (ptr == NULL) => (!ptr).
- Addressed loop nits in geni_spi_handle_rx().
- Commit message rewording from Stephen.

 drivers/spi/spi-geni-qcom.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index 25810a7eef10..bf55abbd39f1 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas)
 	unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas);
 	unsigned int i = 0;
 
+	/* Stop the watermark IRQ if nothing to send */
+	if (!mas->cur_xfer) {
+		writel(0, se->base + SE_GENI_TX_WATERMARK_REG);
+		return false;
+	}
+
 	max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word;
 	if (mas->tx_rem_bytes < max_bytes)
 		max_bytes = mas->tx_rem_bytes;
@@ -396,6 +402,14 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas)
 		if (rx_last_byte_valid && rx_last_byte_valid < 4)
 			rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
 	}
+
+	/* Clear out the FIFO and bail if nowhere to put it */
+	if (mas->cur_xfer == NULL) {
+		while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word))
+			readl(se->base + SE_GENI_RX_FIFOn);
+		return;
+	}
+
 	if (mas->rx_rem_bytes < rx_bytes)
 		rx_bytes = mas->rx_rem_bytes;
 
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
  2020-12-16 22:41 [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
@ 2020-12-16 22:41 ` Douglas Anderson
  2020-12-17  4:21   ` Stephen Boyd
  2020-12-16 22:41 ` [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending Douglas Anderson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Douglas Anderson @ 2020-12-16 22:41 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Alok Chauhan, Andy Gross, Bjorn Andersson,
	Dilip Kota, Girish Mahadevan, linux-arm-msm, linux-kernel,
	linux-spi

If we got a timeout when trying to send an abort command then it means
that we just got 3 timeouts in a row:

1. The original timeout that caused handle_fifo_timeout() to be
   called.
2. A one second timeout waiting for the cancel command to finish.
3. A one second timeout waiting for the abort command to finish.

SPI is clocked by the controller, so nothing (aside from a hardware
fault or a totally broken sequencer) should be causing the actual
commands to fail in hardware.  However, even though the hardware
itself is not expected to fail (and it'd be hard to predict how we
should handle things if it did), it's easy to hit the timeout case by
simply blocking our interrupt handler from running for a long period
of time.  Obviously the system is in pretty bad shape if a interrupt
handler is blocked for > 2 seconds, but there are certainly bugs (even
bugs in other unrelated drivers) that can make this happen.

Let's make things a bit more robust against this case.  If we fail to
abort we'll set a flag and then we'll block all future transfers until
we have no more interrupts pending.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- Make this just about the failed abort.

 drivers/spi/spi-geni-qcom.c | 56 +++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index bf55abbd39f1..d988463e606f 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -83,6 +83,7 @@ struct spi_geni_master {
 	spinlock_t lock;
 	int irq;
 	bool cs_flag;
+	bool abort_failed;
 };
 
 static int get_spi_clk_cfg(unsigned int speed_hz,
@@ -141,8 +142,46 @@ static void handle_fifo_timeout(struct spi_master *spi,
 	spin_unlock_irq(&mas->lock);
 
 	time_left = wait_for_completion_timeout(&mas->abort_done, HZ);
-	if (!time_left)
+	if (!time_left) {
 		dev_err(mas->dev, "Failed to cancel/abort m_cmd\n");
+
+		/*
+		 * No need for a lock since SPI core has a lock and we never
+		 * access this from an interrupt.
+		 */
+		mas->abort_failed = true;
+	}
+}
+
+static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas)
+{
+	struct geni_se *se = &mas->se;
+	u32 m_irq, m_irq_en;
+
+	if (!mas->abort_failed)
+		return false;
+
+	/*
+	 * The only known case where a transfer times out and then a cancel
+	 * times out then an abort times out is if something is blocking our
+	 * interrupt handler from running.  Avoid starting any new transfers
+	 * until that sorts itself out.
+	 */
+	m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
+	m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
+	if (m_irq & m_irq_en) {
+		dev_err(mas->dev, "Interrupts pending after abort: %#010x\n",
+			m_irq & m_irq_en);
+		return true;
+	}
+
+	/*
+	 * If we're here the problem resolved itself so no need to check more
+	 * on future transfers.
+	 */
+	mas->abort_failed = false;
+
+	return false;
 }
 
 static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
@@ -158,9 +197,15 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	if (set_flag == mas->cs_flag)
 		return;
 
+	pm_runtime_get_sync(mas->dev);
+
+	if (spi_geni_is_abort_still_pending(mas)) {
+		dev_err(mas->dev, "Can't set chip select\n");
+		goto exit;
+	}
+
 	mas->cs_flag = set_flag;
 
-	pm_runtime_get_sync(mas->dev);
 	spin_lock_irq(&mas->lock);
 	reinit_completion(&mas->cs_done);
 	if (set_flag)
@@ -173,6 +218,7 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	if (!time_left)
 		handle_fifo_timeout(spi, NULL);
 
+exit:
 	pm_runtime_put(mas->dev);
 }
 
@@ -280,6 +326,9 @@ static int spi_geni_prepare_message(struct spi_master *spi,
 	int ret;
 	struct spi_geni_master *mas = spi_master_get_devdata(spi);
 
+	if (spi_geni_is_abort_still_pending(mas))
+		return -EBUSY;
+
 	ret = setup_fifo_params(spi_msg->spi, spi);
 	if (ret)
 		dev_err(mas->dev, "Couldn't select mode %d\n", ret);
@@ -509,6 +558,9 @@ static int spi_geni_transfer_one(struct spi_master *spi,
 {
 	struct spi_geni_master *mas = spi_master_get_devdata(spi);
 
+	if (spi_geni_is_abort_still_pending(mas))
+		return -EBUSY;
+
 	/* Terminate and return success for 0 byte length transfer */
 	if (!xfer->len)
 		return 0;
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-16 22:41 [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
  2020-12-16 22:41 ` [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending Douglas Anderson
@ 2020-12-16 22:41 ` Douglas Anderson
  2020-12-17  4:25   ` Stephen Boyd
  2020-12-17  4:29   ` Stephen Boyd
  2020-12-16 22:41 ` [PATCH v2 4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS Douglas Anderson
  2020-12-17  3:41 ` [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
  3 siblings, 2 replies; 14+ messages in thread
From: Douglas Anderson @ 2020-12-16 22:41 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Andy Gross, Bjorn Andersson, linux-arm-msm,
	linux-kernel, linux-spi

If we get a timeout sending then this happens:
* spi_transfer_wait() will get a timeout.
* We'll set the chip select
* We'll call handle_err() => handle_fifo_timeout().

Unfortunately that won't work so well on geni.  If we got a timeout
transferring then it's likely that our interrupt handler is blocked,
but we need that same interrupt handler to adjust the chip select.
Trying to set the chip select doesn't crash us but ends up confusing
our state machine and leads to messages like:
  Premature done. rx_rem = 32 bpw8

Let's just drop the chip select request in this case.  Sure, we might
leave the chip select in the wrong state but it's likely it was going
to fail anyway and this avoids getting the driver even more confused
about what it's doing.

The SPI core in general assumes that setting chip select is a simple
operation that doesn't fail.  Yet another reason to just reconfigure
the chip select line as GPIOs.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.

 drivers/spi/spi-geni-qcom.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index d988463e606f..0e4fa52ac017 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 		goto exit;
 	}
 
-	mas->cs_flag = set_flag;
-
 	spin_lock_irq(&mas->lock);
+	if (mas->cur_xfer) {
+		dev_err(mas->dev, "Can't set CS when prev xfter running\n");
+		spin_unlock_irq(&mas->lock);
+		goto exit;
+	}
+
+	mas->cs_flag = set_flag;
 	reinit_completion(&mas->cs_done);
 	if (set_flag)
 		geni_se_setup_m_cmd(se, SPI_CS_ASSERT, 0);
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS
  2020-12-16 22:41 [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
  2020-12-16 22:41 ` [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending Douglas Anderson
  2020-12-16 22:41 ` [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending Douglas Anderson
@ 2020-12-16 22:41 ` Douglas Anderson
  2020-12-17  3:41 ` [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
  3 siblings, 0 replies; 14+ messages in thread
From: Douglas Anderson @ 2020-12-16 22:41 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Andy Gross, Bjorn Andersson, linux-arm-msm,
	linux-kernel, linux-spi

If we're using geni to manage the chip select line (don't do it--use a
GPIO!) and we happen to get a timeout waiting for the chip select
command to be completed, no errors are printed even though things
might not be in the best shape.  Let's add a print.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

Changes in v2:
- ("spi: spi-geni-qcom: Print an error when we timeout setting the CS") new for v2

 drivers/spi/spi-geni-qcom.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index 0e4fa52ac017..744009875762 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -220,8 +220,10 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	spin_unlock_irq(&mas->lock);
 
 	time_left = wait_for_completion_timeout(&mas->cs_done, HZ);
-	if (!time_left)
+	if (!time_left) {
+		dev_warn(mas->dev, "Timeout setting chip select\n");
 		handle_fifo_timeout(spi, NULL);
+	}
 
 exit:
 	pm_runtime_put(mas->dev);
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
  2020-12-16 22:41 [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
                   ` (2 preceding siblings ...)
  2020-12-16 22:41 ` [PATCH v2 4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS Douglas Anderson
@ 2020-12-17  3:41 ` Stephen Boyd
  3 siblings, 0 replies; 14+ messages in thread
From: Stephen Boyd @ 2020-12-17  3:41 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, linux-kernel, linux-spi

Quoting Douglas Anderson (2020-12-16 14:41:49)
> In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
> variable") we changed handle_fifo_timeout() so that we set
> "mas->cur_xfer" to NULL to make absolutely sure that we don't mess
> with the buffers from the previous transfer in the timeout case.
> 
> Unfortunately, this caused the IRQ handler to dereference NULL in some
> cases.  One case:
> 
>   CPU0                           CPU1
>   ----                           ----
>                                  setup_fifo_xfer()
>                                   geni_se_setup_m_cmd()
>                                  <hardware starts transfer>
>                                  <transfer completes in hardware>
>                                  <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq>
>                                  ...
>                                  handle_fifo_timeout()
>                                   spin_lock_irq(mas->lock)
>                                   mas->cur_xfer = NULL
>                                   geni_se_cancel_m_cmd()
>                                   spin_unlock_irq(mas->lock)
> 
>   geni_spi_isr()
>    spin_lock(mas->lock)
>    if (m_irq & M_RX_FIFO_WATERMARK_EN)
>     geni_spi_handle_rx()
>      mas->cur_xfer NULL dereference!
> 
> tl;dr: Seriously delayed interrupts for RX/TX can lead to timeout
> handling setting mas->cur_xfer to NULL.
> 
> Let's check for the NULL transfer in the TX and RX cases and reset the
> watermark or clear out the fifo respectively to put the hardware back
> into a sane state.
> 
> NOTE: things still could get confused if we get timeouts all the way
> through handle_fifo_timeout() and then start a new transfer because
> interrupts from the old transfer / cancel / abort could still be
> pending.  A future patch will help this corner case.
> 
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Minor nits below.

Reviewed-by: Stephen Boyd <swboyd@chromium.org>

> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index 25810a7eef10..bf55abbd39f1 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas)
>         unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas);
>         unsigned int i = 0;
>  
> +       /* Stop the watermark IRQ if nothing to send */
> +       if (!mas->cur_xfer) {
> +               writel(0, se->base + SE_GENI_TX_WATERMARK_REG);
> +               return false;
> +       }
> +
>         max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word;
>         if (mas->tx_rem_bytes < max_bytes)
>                 max_bytes = mas->tx_rem_bytes;
> @@ -396,6 +402,14 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas)
>                 if (rx_last_byte_valid && rx_last_byte_valid < 4)
>                         rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
>         }
> +
> +       /* Clear out the FIFO and bail if nowhere to put it */
> +       if (mas->cur_xfer == NULL) {

This is still == NULL though. :(

> +               while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word))

A for loop is also fine, but I think it would push the 80 character
limit which is probably OK. I'm happy either way, but a for loop is
usually easier to reason about. I suggested while to fit within 80
characters :/

		for (i = 0; i < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word); i++)

> +                       readl(se->base + SE_GENI_RX_FIFOn);
> +               return;
> +       }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
  2020-12-16 22:41 ` [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending Douglas Anderson
@ 2020-12-17  4:21   ` Stephen Boyd
  2020-12-17 21:45     ` Doug Anderson
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Boyd @ 2020-12-17  4:21 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, linux-kernel, linux-spi

Quoting Douglas Anderson (2020-12-16 14:41:50)
> If we got a timeout when trying to send an abort command then it means
> that we just got 3 timeouts in a row:
> 
> 1. The original timeout that caused handle_fifo_timeout() to be
>    called.
> 2. A one second timeout waiting for the cancel command to finish.
> 3. A one second timeout waiting for the abort command to finish.
> 
> SPI is clocked by the controller, so nothing (aside from a hardware
> fault or a totally broken sequencer) should be causing the actual
> commands to fail in hardware.  However, even though the hardware
> itself is not expected to fail (and it'd be hard to predict how we
> should handle things if it did), it's easy to hit the timeout case by
> simply blocking our interrupt handler from running for a long period
> of time.  Obviously the system is in pretty bad shape if a interrupt
> handler is blocked for > 2 seconds, but there are certainly bugs (even
> bugs in other unrelated drivers) that can make this happen.
> 
> Let's make things a bit more robust against this case.  If we fail to
> abort we'll set a flag and then we'll block all future transfers until
> we have no more interrupts pending.

Why can't we forcibly roll the ball forward and clear the irq if it's a
cancel/abort that's pending? Basically tell the hardware that we
understand it did the job and canceled things out but our sad little CPU
didn't run that irq handler yet. Here have a cookie and get ready for
the next transfer.

	if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */
		writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR);

This would let us limp along and try to send another transfer in the
case that we timed out but the transfer went through by servicing our
own interrupt.

> 
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
> Changes in v2:
> - Make this just about the failed abort.
> 
>  drivers/spi/spi-geni-qcom.c | 56 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index bf55abbd39f1..d988463e606f 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -83,6 +83,7 @@ struct spi_geni_master {
>         spinlock_t lock;
>         int irq;
>         bool cs_flag;
> +       bool abort_failed;
>  };
>  
>  static int get_spi_clk_cfg(unsigned int speed_hz,
> @@ -141,8 +142,46 @@ static void handle_fifo_timeout(struct spi_master *spi,
>         spin_unlock_irq(&mas->lock);
>  
>         time_left = wait_for_completion_timeout(&mas->abort_done, HZ);
> -       if (!time_left)
> +       if (!time_left) {
>                 dev_err(mas->dev, "Failed to cancel/abort m_cmd\n");
> +
> +               /*
> +                * No need for a lock since SPI core has a lock and we never
> +                * access this from an interrupt.
> +                */
> +               mas->abort_failed = true;
> +       }
> +}
> +
> +static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas)
> +{
> +       struct geni_se *se = &mas->se;
> +       u32 m_irq, m_irq_en;
> +
> +       if (!mas->abort_failed)
> +               return false;
> +
> +       /*
> +        * The only known case where a transfer times out and then a cancel
> +        * times out then an abort times out is if something is blocking our
> +        * interrupt handler from running.  Avoid starting any new transfers
> +        * until that sorts itself out.
> +        */
> +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);

I suppose this could race with the irq handler. Maybe we should grab the
irq lock around the register reads so we can synchronize with the irq
handler and save a fail?

> +       if (m_irq & m_irq_en) {
> +               dev_err(mas->dev, "Interrupts pending after abort: %#010x\n",
> +                       m_irq & m_irq_en);
> +               return true;
> +       }
> +

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-16 22:41 ` [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending Douglas Anderson
@ 2020-12-17  4:25   ` Stephen Boyd
  2020-12-17 13:25     ` Mark Brown
  2020-12-17 21:21     ` Doug Anderson
  2020-12-17  4:29   ` Stephen Boyd
  1 sibling, 2 replies; 14+ messages in thread
From: Stephen Boyd @ 2020-12-17  4:25 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Andy Gross, Bjorn Andersson, linux-arm-msm, linux-kernel,
	linux-spi

Quoting Douglas Anderson (2020-12-16 14:41:51)
> If we get a timeout sending then this happens:
> * spi_transfer_wait() will get a timeout.
> * We'll set the chip select
> * We'll call handle_err() => handle_fifo_timeout().
> 
> Unfortunately that won't work so well on geni.  If we got a timeout
> transferring then it's likely that our interrupt handler is blocked,
> but we need that same interrupt handler to adjust the chip select.
> Trying to set the chip select doesn't crash us but ends up confusing
> our state machine and leads to messages like:
>   Premature done. rx_rem = 32 bpw8
> 
> Let's just drop the chip select request in this case.  Sure, we might
> leave the chip select in the wrong state but it's likely it was going
> to fail anyway and this avoids getting the driver even more confused
> about what it's doing.
> 
> The SPI core in general assumes that setting chip select is a simple
> operation that doesn't fail.  Yet another reason to just reconfigure
> the chip select line as GPIOs.

Indeed.

> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
> Changes in v2:
> - ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.
> 
>  drivers/spi/spi-geni-qcom.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index d988463e606f..0e4fa52ac017 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
>                 goto exit;
>         }
>  
> -       mas->cs_flag = set_flag;
> -
>         spin_lock_irq(&mas->lock);
> +       if (mas->cur_xfer) {

How is it possible that cs change happens when cur_xfer is non-NULL?

> +               dev_err(mas->dev, "Can't set CS when prev xfter running\n");

xfer? or xfter?

> +               spin_unlock_irq(&mas->lock);
> +               goto exit;
> +       }
> +
> +       mas->cs_flag = set_flag;
>         reinit_completion(&mas->cs_done);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-16 22:41 ` [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending Douglas Anderson
  2020-12-17  4:25   ` Stephen Boyd
@ 2020-12-17  4:29   ` Stephen Boyd
  2020-12-17 21:35     ` Doug Anderson
  1 sibling, 1 reply; 14+ messages in thread
From: Stephen Boyd @ 2020-12-17  4:29 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Andy Gross, Bjorn Andersson, linux-arm-msm, linux-kernel,
	linux-spi

Quoting Douglas Anderson (2020-12-16 14:41:51)
> If we get a timeout sending then this happens:
> * spi_transfer_wait() will get a timeout.
> * We'll set the chip select
> * We'll call handle_err() => handle_fifo_timeout().
> 
> Unfortunately that won't work so well on geni.  If we got a timeout
> transferring then it's likely that our interrupt handler is blocked,
> but we need that same interrupt handler to adjust the chip select.
> Trying to set the chip select doesn't crash us but ends up confusing
> our state machine and leads to messages like:
>   Premature done. rx_rem = 32 bpw8
> 
> Let's just drop the chip select request in this case.  Sure, we might
> leave the chip select in the wrong state but it's likely it was going
> to fail anyway and this avoids getting the driver even more confused
> about what it's doing.
> 
> The SPI core in general assumes that setting chip select is a simple
> operation that doesn't fail.  Yet another reason to just reconfigure
> the chip select line as GPIOs.

BTW, we could peek at the irq bit for the CS change and ignore the irq
handler entirely. That would be one way to make sure the cs change went
through, and would avoid an irq delay/scheduling problem for this simple
operation. Maybe using the irq path is worse in general here?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-17  4:25   ` Stephen Boyd
@ 2020-12-17 13:25     ` Mark Brown
  2020-12-17 21:21     ` Doug Anderson
  1 sibling, 0 replies; 14+ messages in thread
From: Mark Brown @ 2020-12-17 13:25 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Douglas Anderson, msavaliy, akashast, Roja Rani Yarubandi,
	Andy Gross, Bjorn Andersson, linux-arm-msm, linux-kernel,
	linux-spi

[-- Attachment #1: Type: text/plain, Size: 285 bytes --]

On Wed, Dec 16, 2020 at 08:25:18PM -0800, Stephen Boyd wrote:

> >         spin_lock_irq(&mas->lock);
> > +       if (mas->cur_xfer) {
> 
> How is it possible that cs change happens when cur_xfer is non-NULL?

We will set the initial chip select state during controller setup.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-17  4:25   ` Stephen Boyd
  2020-12-17 13:25     ` Mark Brown
@ 2020-12-17 21:21     ` Doug Anderson
  1 sibling, 0 replies; 14+ messages in thread
From: Doug Anderson @ 2020-12-17 21:21 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Andy Gross, Bjorn Andersson, linux-arm-msm, LKML, linux-spi

Hi,

On Wed, Dec 16, 2020 at 8:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2020-12-16 14:41:51)
> > If we get a timeout sending then this happens:
> > * spi_transfer_wait() will get a timeout.
> > * We'll set the chip select
> > * We'll call handle_err() => handle_fifo_timeout().
> >
> > Unfortunately that won't work so well on geni.  If we got a timeout
> > transferring then it's likely that our interrupt handler is blocked,
> > but we need that same interrupt handler to adjust the chip select.
> > Trying to set the chip select doesn't crash us but ends up confusing
> > our state machine and leads to messages like:
> >   Premature done. rx_rem = 32 bpw8
> >
> > Let's just drop the chip select request in this case.  Sure, we might
> > leave the chip select in the wrong state but it's likely it was going
> > to fail anyway and this avoids getting the driver even more confused
> > about what it's doing.
> >
> > The SPI core in general assumes that setting chip select is a simple
> > operation that doesn't fail.  Yet another reason to just reconfigure
> > the chip select line as GPIOs.
>
> Indeed.
>
> >
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> > Changes in v2:
> > - ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2.
> >
> >  drivers/spi/spi-geni-qcom.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> > index d988463e606f..0e4fa52ac017 100644
> > --- a/drivers/spi/spi-geni-qcom.c
> > +++ b/drivers/spi/spi-geni-qcom.c
> > @@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
> >                 goto exit;
> >         }
> >
> > -       mas->cs_flag = set_flag;
> > -
> >         spin_lock_irq(&mas->lock);
> > +       if (mas->cur_xfer) {
>
> How is it possible that cs change happens when cur_xfer is non-NULL?

I'll add this to the commit message:

spi_transfer_one_message()
 ->transfer_one() AKA spi_geni_transfer_one()
  setup_fifo_xfer()
   mas->cur_xfer = non-NULL
 spi_transfer_wait() => TIMES OUT
 msg->status != -EINPROGRESS => goto out
 if (ret != 0 ...)
  spi_set_cs()
   ->set_cs AKA spi_geni_set_cs()
    # mas->cur_xfer is non-NULL

Specifically the place where cur_xfer is usually made NULL is in the
interrupt handler.  If that doesn't run then it will be non-NULL.


> > +               dev_err(mas->dev, "Can't set CS when prev xfter running\n");
>
> xfer? or xfter?

Will fix.


-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-17  4:29   ` Stephen Boyd
@ 2020-12-17 21:35     ` Doug Anderson
  2020-12-18  2:51       ` Stephen Boyd
  0 siblings, 1 reply; 14+ messages in thread
From: Doug Anderson @ 2020-12-17 21:35 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Andy Gross, Bjorn Andersson, linux-arm-msm, LKML, linux-spi

Hi,

On Wed, Dec 16, 2020 at 8:29 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> > The SPI core in general assumes that setting chip select is a simple
> > operation that doesn't fail.  Yet another reason to just reconfigure
> > the chip select line as GPIOs.
>
> BTW, we could peek at the irq bit for the CS change and ignore the irq
> handler entirely. That would be one way to make sure the cs change went
> through, and would avoid an irq delay/scheduling problem for this simple
> operation. Maybe using the irq path is worse in general here?

Yes, when I was in my SPI optimization phase I actually coded this up
and thought about it.  It can be made to work and is probably
marginally faster at the expense of more cpu cycles to poll the
interrupt line.  I also don't think it fixes this issue nor does it
simplify things...  :(

1. If there are already commands pending we still have to do something
about them.  Said another way: there's not a separate channel just for
setting the chip select, so if the single command channel is gummed up
then using polling mode to handle the chip select won't really un-gum
it up.

2. In order to use polling mode to set the chip select we have to do
_something_ to temporarily disable our interrupt handler.  If we don't
do that then the interrupt handler will fire for the "DONE" when we
send the chip select command.


If we wanted to truly make this driver super robust against ridiculous
interrupt latencies then, presumably, we could handle the SPI timeout
ourselves but before timing out we could check to see if the
interrupts were pending.  Then we could disable our interrupts,
synchronize our interrupt handler, handle the interrupt directly, and
then re-enable interrupts.  If we did this then transfers could
continue to eek their way through even if interrupts were completely
blocked.  IMO, it's not worth it.  I'm satisfied with not crashing and
not getting the state machine too out-of-whack.

-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
  2020-12-17  4:21   ` Stephen Boyd
@ 2020-12-17 21:45     ` Doug Anderson
  2020-12-18  2:53       ` Stephen Boyd
  0 siblings, 1 reply; 14+ messages in thread
From: Doug Anderson @ 2020-12-17 21:45 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, LKML, linux-spi

Hi,

On Wed, Dec 16, 2020 at 8:21 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2020-12-16 14:41:50)
> > If we got a timeout when trying to send an abort command then it means
> > that we just got 3 timeouts in a row:
> >
> > 1. The original timeout that caused handle_fifo_timeout() to be
> >    called.
> > 2. A one second timeout waiting for the cancel command to finish.
> > 3. A one second timeout waiting for the abort command to finish.
> >
> > SPI is clocked by the controller, so nothing (aside from a hardware
> > fault or a totally broken sequencer) should be causing the actual
> > commands to fail in hardware.  However, even though the hardware
> > itself is not expected to fail (and it'd be hard to predict how we
> > should handle things if it did), it's easy to hit the timeout case by
> > simply blocking our interrupt handler from running for a long period
> > of time.  Obviously the system is in pretty bad shape if a interrupt
> > handler is blocked for > 2 seconds, but there are certainly bugs (even
> > bugs in other unrelated drivers) that can make this happen.
> >
> > Let's make things a bit more robust against this case.  If we fail to
> > abort we'll set a flag and then we'll block all future transfers until
> > we have no more interrupts pending.
>
> Why can't we forcibly roll the ball forward and clear the irq if it's a
> cancel/abort that's pending? Basically tell the hardware that we
> understand it did the job and canceled things out but our sad little CPU
> didn't run that irq handler yet. Here have a cookie and get ready for
> the next transfer.
>
>         if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */
>                 writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR);
>
> This would let us limp along and try to send another transfer in the
> case that we timed out but the transfer went through by servicing our
> own interrupt.

A few problems:

1. The cancel and abort are commands and they generate a "done"
interrupt along with their "cancel" and/or "abort".  Clearing the
cancel/abort without the done will leave things in a much more
confused state.

2. If we timed out all the way down then there's probably _also_
interrupts for the previous transfer still pending.  Those would also
need to be cleared.  ...and we'd need to disable watermarks / read
pending data.

3. Even if we tried to solve all that, we're probably still in
terrible shape.  Sure, we could try to start another transfer, but if
the previous one failed because the interrupt handler was blocked then
the next one is just going to fail too so all the extra complexity we
just added to handle this was likely wasted.


The whole fact that you need to send little packets to the sequencer
(and wait for an interrupt to tell you that it got your packet) for
every last thing really doesn't work super well and it's just
especially bad for chip select.


> > +static bool spi_geni_is_abort_still_pending(struct spi_geni_master *mas)
> > +{
> > +       struct geni_se *se = &mas->se;
> > +       u32 m_irq, m_irq_en;
> > +
> > +       if (!mas->abort_failed)
> > +               return false;
> > +
> > +       /*
> > +        * The only known case where a transfer times out and then a cancel
> > +        * times out then an abort times out is if something is blocking our
> > +        * interrupt handler from running.  Avoid starting any new transfers
> > +        * until that sorts itself out.
> > +        */
> > +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> > +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
>
> I suppose this could race with the irq handler. Maybe we should grab the
> irq lock around the register reads so we can synchronize with the irq
> handler and save a fail?

I don't _think_ it'll matter a whole lot but it also won't hurt, so sure.

-Doug

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
  2020-12-17 21:35     ` Doug Anderson
@ 2020-12-18  2:51       ` Stephen Boyd
  0 siblings, 0 replies; 14+ messages in thread
From: Stephen Boyd @ 2020-12-18  2:51 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Andy Gross, Bjorn Andersson, linux-arm-msm, LKML, linux-spi

Quoting Doug Anderson (2020-12-17 13:35:08)
> 
> If we wanted to truly make this driver super robust against ridiculous
> interrupt latencies then, presumably, we could handle the SPI timeout
> ourselves but before timing out we could check to see if the
> interrupts were pending.  Then we could disable our interrupts,
> synchronize our interrupt handler, handle the interrupt directly, and
> then re-enable interrupts.  If we did this then transfers could
> continue to eek their way through even if interrupts were completely
> blocked.  IMO, it's not worth it.  I'm satisfied with not crashing and
> not getting the state machine too out-of-whack.
> 

Ok that's fair. If it's not worth the effort then let's drop this idea.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
  2020-12-17 21:45     ` Doug Anderson
@ 2020-12-18  2:53       ` Stephen Boyd
  0 siblings, 0 replies; 14+ messages in thread
From: Stephen Boyd @ 2020-12-18  2:53 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, LKML, linux-spi

Quoting Doug Anderson (2020-12-17 13:45:18)
> 
> On Wed, Dec 16, 2020 at 8:21 PM Stephen Boyd <swboyd@chromium.org> wrote:
> >
> >         if (M_CMD_CANCEL_EN || M_CMD_ABORT_EN) /* but not the other irqs like CMD_DONE or refill fifos */
> >                 writel(M_CMD_CANCEL_EN | M_CMD_ABORT_EN, se->base + SE_GENI_M_IRQ_CLEAR);
> >
> > This would let us limp along and try to send another transfer in the
> > case that we timed out but the transfer went through by servicing our
> > own interrupt.
> 
> A few problems:
> 
> 1. The cancel and abort are commands and they generate a "done"
> interrupt along with their "cancel" and/or "abort".  Clearing the
> cancel/abort without the done will leave things in a much more
> confused state.

Ah I didn't know that the DONE bit was set even for cancel/abort, but
that makes sense given that it's a sequencer and everything that goes
into the sequencer eventually gets "DONE". I agree if the DONE bit
hanging around that really confuses stuff, so best to ignore it and
figure out why interrupt latencies are leading to this problem.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-12-18  2:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-16 22:41 [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
2020-12-16 22:41 ` [PATCH v2 2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending Douglas Anderson
2020-12-17  4:21   ` Stephen Boyd
2020-12-17 21:45     ` Doug Anderson
2020-12-18  2:53       ` Stephen Boyd
2020-12-16 22:41 ` [PATCH v2 3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending Douglas Anderson
2020-12-17  4:25   ` Stephen Boyd
2020-12-17 13:25     ` Mark Brown
2020-12-17 21:21     ` Doug Anderson
2020-12-17  4:29   ` Stephen Boyd
2020-12-17 21:35     ` Doug Anderson
2020-12-18  2:51       ` Stephen Boyd
2020-12-16 22:41 ` [PATCH v2 4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS Douglas Anderson
2020-12-17  3:41 ` [PATCH v2 1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).