linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
@ 2020-12-15  0:30 Douglas Anderson
  2020-12-15  0:30 ` [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one Douglas Anderson
  2020-12-15  2:29 ` [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
  0 siblings, 2 replies; 10+ messages in thread
From: Douglas Anderson @ 2020-12-15  0:30 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Alok Chauhan, Andy Gross, Bjorn Andersson,
	Girish Mahadevan, linux-arm-msm, linux-kernel, linux-spi

In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
variable") we changed handle_fifo_timeout() so that we set
"mas->cur_xfer" to NULL to make absolutely sure that we don't mess
with the buffers from the previous transfer in the timeout case.

Unfortunately, this caused the IRQ handler to dereference NULL in some
cases.  One case:

 CPU0                           CPU1
 ----                           ----
                                setup_fifo_xfer()
                                 ...
                                 geni_se_setup_m_cmd()
                                 <hardware starts transfer>
 <unrelated interrupt storm>     spin_unlock_irq()
 <continued interrupt storm>    <time passes>
 <continued interrupt storm>    <transfer complets in hardware>
 <continued interrupt storm>    <hardware sets M_RX_FIFO_WATERMARK_EN>
 <continued interrupt storm>    <time passes>
 <continued interrupt storm>    handle_fifo_timeout()
 <continued interrupt storm>     spin_lock_irq()
 <continued interrupt storm>     mas->cur_xfer = NULL
 <continued interrupt storm>     geni_se_cancel_m_cmd()
 <continued interrupt storm>     spin_unlock_irq()
 <continued interrupt storm>     wait_for_completion_timeout() => timeout
 <continued interrupt storm>     spin_lock_irq()
 <continued interrupt storm>     geni_se_abort_m_cmd()
 <continued interrupt storm>     spin_unlock_irq()
 <continued interrupt storm>     wait_for_completion_timeout() => timeout
 <interrupt storm ends>
 geni_spi_isr()
  spin_lock()
  if (m_irq & M_RX_FIFO_WATERMARK_EN)
   geni_spi_handle_rx()
    mas->cur_xfer NULL derefrence

Specifically it should be noted that the RX/TX interrupts are still
shown asserted even when a CANCEL/ABORT interrupt has asserted.

Let's check for the NULL transfer in the TX and RX cases.

NOTE: things still could get confused if we get timeouts all the way
through handle_fifo_timeout(), meaning that interrupts are still
pending.  A future patch will help these corner cases.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/spi/spi-geni-qcom.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index 25810a7eef10..6f736e94e9f4 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas)
 	unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas);
 	unsigned int i = 0;
 
+	/* Stop the watermark IRQ if nothing to send */
+	if (mas->cur_xfer == NULL) {
+		writel(0, se->base + SE_GENI_TX_WATERMARK_REG);
+		return false;
+	}
+
 	max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word;
 	if (mas->tx_rem_bytes < max_bytes)
 		max_bytes = mas->tx_rem_bytes;
@@ -396,6 +402,17 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas)
 		if (rx_last_byte_valid && rx_last_byte_valid < 4)
 			rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
 	}
+
+	/* Clear out the FIFO and bail if nowhere to put it */
+	if (mas->cur_xfer == NULL) {
+		unsigned int words = DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word);
+
+		for (i = 0; i < words; i++)
+			readl(se->base + SE_GENI_RX_FIFOn);
+
+		return;
+	}
+
 	if (mas->rx_rem_bytes < rx_bytes)
 		rx_bytes = mas->rx_rem_bytes;
 
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15  0:30 [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
@ 2020-12-15  0:30 ` Douglas Anderson
  2020-12-15  2:57   ` Stephen Boyd
  2020-12-15  2:29 ` [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
  1 sibling, 1 reply; 10+ messages in thread
From: Douglas Anderson @ 2020-12-15  0:30 UTC (permalink / raw)
  To: Mark Brown
  Cc: msavaliy, akashast, Stephen Boyd, Roja Rani Yarubandi,
	Douglas Anderson, Alok Chauhan, Andy Gross, Bjorn Andersson,
	Dilip Kota, Girish Mahadevan, linux-arm-msm, linux-kernel,
	linux-spi

In commit 2ee471a1e28e ("spi: spi-geni-qcom: Mo' betta locking") we
added a dance in setup_fifo_xfer() to make sure that the previous
transfer was really done before we setup the next one.  However, it
wasn't enough.  Specifically, if we had a timeout it's possible that
the previous transfer could still be pending.  This could happen if
our interrupt handler was blocked for a long while (interrupt storm or
someone disablng IRQs for a while).  This pending interrupt could
throw off our logic.

Let's really make sure that the previous interrupt isn't still pending
before we start the next transfer.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/spi/spi-geni-qcom.c | 69 ++++++++++++++++++++++++++++---------
 1 file changed, 53 insertions(+), 16 deletions(-)

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index 6f736e94e9f4..5ef2e9f38ac9 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi,
 		dev_err(mas->dev, "Failed to cancel/abort m_cmd\n");
 }
 
+static int spi_geni_check_busy(struct spi_geni_master *mas)
+{
+	struct geni_se *se = &mas->se;
+	u32 m_irq, m_irq_en;
+
+	/*
+	 * We grab the spinlock so that if we raced really fast and the IRQ
+	 * handler is still actually running we'll wait for it to exit.  This
+	 * can happen because the IRQ handler may signal in the middle of the
+	 * function and the next transfer can kick off right away.
+	 *
+	 * Once we have the spinlock, if we're starting a new transfer we
+	 * expect nothing is pending.  We check this to handle the case where
+	 * the previous transfer timed out and then handle_fifo_timeout() timed
+	 * out.  This can happen if the interrupt handler was blocked for
+	 * a long time and we don't want to start any new transfers until it's
+	 * all done.
+	 *
+	 * We are OK releasing the spinlock after we're done here since (if
+	 * we're returning 0 and going ahead with the transfer) we know that
+	 * the SPI controller must be in a quiet state.
+	 */
+	spin_lock_irq(&mas->lock);
+	m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
+	m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
+	spin_unlock_irq(&mas->lock);
+
+	if (m_irq & m_irq_en) {
+		dev_err(mas->dev, "Busy, IRQs pending %#010x\n",
+			m_irq & m_irq_en);
+		return -EBUSY;
+	}
+
+	return 0;
+}
+
 static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 {
 	struct spi_geni_master *mas = spi_master_get_devdata(slv->master);
 	struct spi_master *spi = dev_get_drvdata(mas->dev);
 	struct geni_se *se = &mas->se;
 	unsigned long time_left;
+	int ret;
 
 	if (!(slv->mode & SPI_CS_HIGH))
 		set_flag = !set_flag;
@@ -158,6 +195,12 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
 	if (set_flag == mas->cs_flag)
 		return;
 
+	ret = spi_geni_check_busy(mas);
+	if (ret) {
+		dev_err(mas->dev, "Can't set chip select\n");
+		return;
+	}
+
 	mas->cs_flag = set_flag;
 
 	pm_runtime_get_sync(mas->dev);
@@ -277,8 +320,12 @@ static int setup_fifo_params(struct spi_device *spi_slv,
 static int spi_geni_prepare_message(struct spi_master *spi,
 					struct spi_message *spi_msg)
 {
-	int ret;
 	struct spi_geni_master *mas = spi_master_get_devdata(spi);
+	int ret;
+
+	ret = spi_geni_check_busy(mas);
+	if (ret)
+		return ret;
 
 	ret = setup_fifo_params(spi_msg->spi, spi);
 	if (ret)
@@ -440,21 +487,6 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
 	struct geni_se *se = &mas->se;
 	int ret;
 
-	/*
-	 * Ensure that our interrupt handler isn't still running from some
-	 * prior command before we start messing with the hardware behind
-	 * its back.  We don't need to _keep_ the lock here since we're only
-	 * worried about racing with out interrupt handler.  The SPI core
-	 * already handles making sure that we're not trying to do two
-	 * transfers at once or setting a chip select and doing a transfer
-	 * concurrently.
-	 *
-	 * NOTE: we actually _can't_ hold the lock here because possibly we
-	 * might call clk_set_rate() which needs to be able to sleep.
-	 */
-	spin_lock_irq(&mas->lock);
-	spin_unlock_irq(&mas->lock);
-
 	if (xfer->bits_per_word != mas->cur_bits_per_word) {
 		spi_setup_word_len(mas, mode, xfer->bits_per_word);
 		mas->cur_bits_per_word = xfer->bits_per_word;
@@ -511,6 +543,11 @@ static int spi_geni_transfer_one(struct spi_master *spi,
 				struct spi_transfer *xfer)
 {
 	struct spi_geni_master *mas = spi_master_get_devdata(spi);
+	int ret;
+
+	ret = spi_geni_check_busy(mas);
+	if (ret)
+		return ret;
 
 	/* Terminate and return success for 0 byte length transfer */
 	if (!xfer->len)
-- 
2.29.2.684.gfbc64c5ab5-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
  2020-12-15  0:30 [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
  2020-12-15  0:30 ` [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one Douglas Anderson
@ 2020-12-15  2:29 ` Stephen Boyd
  2020-12-16 22:42   ` Doug Anderson
  1 sibling, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2020-12-15  2:29 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Girish Mahadevan,
	linux-arm-msm, linux-kernel, linux-spi

Quoting Douglas Anderson (2020-12-14 16:30:18)
> In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
> variable") we changed handle_fifo_timeout() so that we set
> "mas->cur_xfer" to NULL to make absolutely sure that we don't mess
> with the buffers from the previous transfer in the timeout case.
> 
> Unfortunately, this caused the IRQ handler to dereference NULL in some
> cases.  One case:
> 
>  CPU0                           CPU1
>  ----                           ----
>                                 setup_fifo_xfer()
>                                  ...
>                                  geni_se_setup_m_cmd()
>                                  <hardware starts transfer>
>  <unrelated interrupt storm>     spin_unlock_irq()
>  <continued interrupt storm>    <time passes>

Use ... for "time passes"

>  <continued interrupt storm>    <transfer complets in hardware>

s/complets/completes/

>  <continued interrupt storm>    <hardware sets M_RX_FIFO_WATERMARK_EN>

I'd rather just say handle_irq() or something instead of have <continued
interrupt storm> over here. Would make it easier to read and we can then
just assume that the geni_spi_isr() hasn't run. Or nothing at all and
just indicate that the irq for geni_spi_isr() comes in after the timeout
handling code.

>  <continued interrupt storm>    <time passes>
>  <continued interrupt storm>    handle_fifo_timeout()
>  <continued interrupt storm>     spin_lock_irq()
>  <continued interrupt storm>     mas->cur_xfer = NULL

From here

>  <continued interrupt storm>     geni_se_cancel_m_cmd()
>  <continued interrupt storm>     spin_unlock_irq()
>  <continued interrupt storm>     wait_for_completion_timeout() => timeout
>  <continued interrupt storm>     spin_lock_irq()
>  <continued interrupt storm>     geni_se_abort_m_cmd()
>  <continued interrupt storm>     spin_unlock_irq()
>  <continued interrupt storm>     wait_for_completion_timeout() => timeout

to here, these lines can be left out?

>  <interrupt storm ends>
>  geni_spi_isr()
>   spin_lock()
>   if (m_irq & M_RX_FIFO_WATERMARK_EN)
>    geni_spi_handle_rx()
>     mas->cur_xfer NULL derefrence

s/derefrence/dereference/

Here's a shortened version:

  CPU0                           CPU1
  ----                           ----
                                 setup_fifo_xfer()
                                  geni_se_setup_m_cmd()
                                 <hardware starts transfer>
                                 <transfer completes in hardware>
                                 <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq>
				 ...
                                 handle_fifo_timeout()
                                  spin_lock_irq(mas->lock)
                                  mas->cur_xfer = NULL
                                  geni_se_cancel_m_cmd()
                                  spin_unlock_irq(mas->lock)

  geni_spi_isr()
   spin_lock(mas->lock)
   if (m_irq & M_RX_FIFO_WATERMARK_EN)
    geni_spi_handle_rx()
     mas->cur_xfer NULL dereference!

Two CPUs also don't really matter but I guess that's fine.

> 
> Specifically it should be noted that the RX/TX interrupts are still
> shown asserted even when a CANCEL/ABORT interrupt has asserted.

Can we have 'TL;DR: Seriously delayed interrupts for RX/TX can lead to
timeout handling setting mas->cur_xfer to NULL.'?

> 
> Let's check for the NULL transfer in the TX and RX cases.

and reset the watermark or clear out the fifo respectively to put the
hardware back into a sane state.

> 
> NOTE: things still could get confused if we get timeouts all the way
> through handle_fifo_timeout(), meaning that interrupts are still
> pending.  A future patch will help these corner cases.
> 
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>  drivers/spi/spi-geni-qcom.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index 25810a7eef10..6f736e94e9f4 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas)
>         unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas);
>         unsigned int i = 0;
>  
> +       /* Stop the watermark IRQ if nothing to send */
> +       if (mas->cur_xfer == NULL) {
> +               writel(0, se->base + SE_GENI_TX_WATERMARK_REG);
> +               return false;
> +       }
> +
>         max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word;
>         if (mas->tx_rem_bytes < max_bytes)
>                 max_bytes = mas->tx_rem_bytes;
> @@ -396,6 +402,17 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas)
>                 if (rx_last_byte_valid && rx_last_byte_valid < 4)
>                         rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
>         }
> +
> +       /* Clear out the FIFO and bail if nowhere to put it */
> +       if (mas->cur_xfer == NULL) {

I think if (!mas->cur_xfer) is more kernel idiomatic, but sure.

> +               unsigned int words = DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word);

Any chance to move this define up to the start of the function instead
of putting it here inside the if? Or just stick it into the for loop.
It's to avoid shadow variables.

> +
> +               for (i = 0; i < words; i++)

		while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word))
			readl(se->base + SE_GENI_RX_FIFOn);

> +
> +               return;
> +       }
> +
>         if (mas->rx_rem_bytes < rx_bytes)
>                 rx_bytes = mas->rx_rem_bytes;
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15  0:30 ` [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one Douglas Anderson
@ 2020-12-15  2:57   ` Stephen Boyd
  2020-12-15 17:25     ` Doug Anderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2020-12-15  2:57 UTC (permalink / raw)
  To: Douglas Anderson, Mark Brown
  Cc: msavaliy, akashast, Roja Rani Yarubandi, Douglas Anderson,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, linux-kernel, linux-spi

Quoting Douglas Anderson (2020-12-14 16:30:19)
> diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> index 6f736e94e9f4..5ef2e9f38ac9 100644
> --- a/drivers/spi/spi-geni-qcom.c
> +++ b/drivers/spi/spi-geni-qcom.c
> @@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi,
>                 dev_err(mas->dev, "Failed to cancel/abort m_cmd\n");
>  }
>  
> +static int spi_geni_check_busy(struct spi_geni_master *mas)

Maybe spi_geni_is_busy() and return bool?

> +{
> +       struct geni_se *se = &mas->se;
> +       u32 m_irq, m_irq_en;
> +
> +       /*
> +        * We grab the spinlock so that if we raced really fast and the IRQ
> +        * handler is still actually running we'll wait for it to exit.  This
> +        * can happen because the IRQ handler may signal in the middle of the
> +        * function and the next transfer can kick off right away.
> +        *
> +        * Once we have the spinlock, if we're starting a new transfer we
> +        * expect nothing is pending.  We check this to handle the case where
> +        * the previous transfer timed out and then handle_fifo_timeout() timed
> +        * out.  This can happen if the interrupt handler was blocked for
> +        * a long time and we don't want to start any new transfers until it's
> +        * all done.
> +        *
> +        * We are OK releasing the spinlock after we're done here since (if
> +        * we're returning 0 and going ahead with the transfer) we know that
> +        * the SPI controller must be in a quiet state.
> +        */
> +       spin_lock_irq(&mas->lock);
> +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
> +       spin_unlock_irq(&mas->lock);
> +
> +       if (m_irq & m_irq_en) {

Is this really "busy" though? If we canceled something out then maybe
the irq has fired but what if it's to tell us that we have some
available space in the TX fifo? Does that really matter? It seems like
if we have an RX irq when we're starting a transfer that might be bad
too but we could forcibly clear that by acking it here and then setting
the fifo word count that we're expecting for rx?

Put another way, why isn't this driver looking at the TX and RX fifo
status registers more than in one place?

> +               dev_err(mas->dev, "Busy, IRQs pending %#010x\n",
> +                       m_irq & m_irq_en);
> +               return -EBUSY;
> +       }
> +
> +       return 0;
> +}
> +
>  static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
>  {
>         struct spi_geni_master *mas = spi_master_get_devdata(slv->master);
>         struct spi_master *spi = dev_get_drvdata(mas->dev);
>         struct geni_se *se = &mas->se;
>         unsigned long time_left;
> +       int ret;
>  
>         if (!(slv->mode & SPI_CS_HIGH))
>                 set_flag = !set_flag;
> @@ -158,6 +195,12 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag)
>         if (set_flag == mas->cs_flag)
>                 return;
>  
> +       ret = spi_geni_check_busy(mas);
> +       if (ret) {

	if (spi_geni_is_busy())

> +               dev_err(mas->dev, "Can't set chip select\n");
> +               return;
> +       }
> +
>         mas->cs_flag = set_flag;
>  
>         pm_runtime_get_sync(mas->dev);
> @@ -277,8 +320,12 @@ static int setup_fifo_params(struct spi_device *spi_slv,
>  static int spi_geni_prepare_message(struct spi_master *spi,
>                                         struct spi_message *spi_msg)
>  {
> -       int ret;
>         struct spi_geni_master *mas = spi_master_get_devdata(spi);
> +       int ret;
> +
> +       ret = spi_geni_check_busy(mas);
> +       if (ret)

	if (spi_geni_is_busy())
		return -EBUSY;

> +               return ret;
>  
>         ret = setup_fifo_params(spi_msg->spi, spi);
>         if (ret)
> @@ -440,21 +487,6 @@ static void setup_fifo_xfer(struct spi_transfer *xfer,
>         struct geni_se *se = &mas->se;
>         int ret;
>  
> -       /*
> -        * Ensure that our interrupt handler isn't still running from some
> -        * prior command before we start messing with the hardware behind
> -        * its back.  We don't need to _keep_ the lock here since we're only
> -        * worried about racing with out interrupt handler.  The SPI core
> -        * already handles making sure that we're not trying to do two
> -        * transfers at once or setting a chip select and doing a transfer
> -        * concurrently.
> -        *
> -        * NOTE: we actually _can't_ hold the lock here because possibly we
> -        * might call clk_set_rate() which needs to be able to sleep.
> -        */
> -       spin_lock_irq(&mas->lock);
> -       spin_unlock_irq(&mas->lock);
> -
>         if (xfer->bits_per_word != mas->cur_bits_per_word) {
>                 spi_setup_word_len(mas, mode, xfer->bits_per_word);
>                 mas->cur_bits_per_word = xfer->bits_per_word;
> @@ -511,6 +543,11 @@ static int spi_geni_transfer_one(struct spi_master *spi,
>                                 struct spi_transfer *xfer)
>  {
>         struct spi_geni_master *mas = spi_master_get_devdata(spi);
> +       int ret;
> +
> +       ret = spi_geni_check_busy(mas);
> +       if (ret)
> +               return ret;

	if (spi_geni_is_busy())
		return -EBUSY;
>  
>         /* Terminate and return success for 0 byte length transfer */
>         if (!xfer->len)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15  2:57   ` Stephen Boyd
@ 2020-12-15 17:25     ` Doug Anderson
  2020-12-15 22:25       ` Stephen Boyd
  0 siblings, 1 reply; 10+ messages in thread
From: Doug Anderson @ 2020-12-15 17:25 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Dilip Kota,
	Girish Mahadevan, linux-arm-msm, LKML, linux-spi

Hi,

On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2020-12-14 16:30:19)
> > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> > index 6f736e94e9f4..5ef2e9f38ac9 100644
> > --- a/drivers/spi/spi-geni-qcom.c
> > +++ b/drivers/spi/spi-geni-qcom.c
> > @@ -145,12 +145,49 @@ static void handle_fifo_timeout(struct spi_master *spi,
> >                 dev_err(mas->dev, "Failed to cancel/abort m_cmd\n");
> >  }
> >
> > +static int spi_geni_check_busy(struct spi_geni_master *mas)
>
> Maybe spi_geni_is_busy() and return bool?

Yeah, that's cleaner, thanks.


> > +       spin_lock_irq(&mas->lock);
> > +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> > +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
> > +       spin_unlock_irq(&mas->lock);
> > +
> > +       if (m_irq & m_irq_en) {
>
> Is this really "busy" though? If we canceled something out then maybe
> the irq has fired but what if it's to tell us that we have some
> available space in the TX fifo? Does that really matter? It seems like
> if we have an RX irq when we're starting a transfer that might be bad
> too but we could forcibly clear that by acking it here and then setting
> the fifo word count that we're expecting for rx?
>
> Put another way, why isn't this driver looking at the TX and RX fifo
> status registers more than in one place?

I'm not sure I understand all your concerns.  Can you clarify?  In
case it helps, I'll add a few thoughts here:

1. SPI is a controller clocked protocol and this is the driver for the
controller.  There is no way to get a RX IRQ unless we initiate it.

2. The code always takes care to make sure that when we're done with a
transfer that we disable the TX watermark.  This means we won't get
any more interrupts.

The only time an interrupt could still be pending when we start a new
transfer is:

a) If the interrupt handler is still running on another CPU.  In that
case it will have the spinlock and won't release it until it clears
the interrupts.

b) If we had a timeout on the previous transfer and then got timeouts
sending the cancel and abort.

In general when we're starting a new transfer we assume that we can
program the hardware willy-nilly.  If there's some chance something
else is happening (or our interrupt could go off) then it breaks that
whole model.

I've addressed all your other concerns and I'm ready to send out v2,
but I'll hold off until you confirm that the above explanation
satisfies your questions.  If you can think of any extra comments
somewhere that would help document that better I'm happy to add them
into the commit or commit message.


-Doug

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15 17:25     ` Doug Anderson
@ 2020-12-15 22:25       ` Stephen Boyd
  2020-12-15 23:34         ` Doug Anderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2020-12-15 22:25 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, linux-arm-msm, LKML,
	linux-spi

Quoting Doug Anderson (2020-12-15 09:25:51)
> On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote:
> >
> > Quoting Douglas Anderson (2020-12-14 16:30:19)
> > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> > > index 6f736e94e9f4..5ef2e9f38ac9 100644
> > > --- a/drivers/spi/spi-geni-qcom.c
> > > +++ b/drivers/spi/spi-geni-qcom.c
> 
> > > +       spin_lock_irq(&mas->lock);
> > > +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> > > +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
> > > +       spin_unlock_irq(&mas->lock);
> > > +
> > > +       if (m_irq & m_irq_en) {
> >
> > Is this really "busy" though? If we canceled something out then maybe
> > the irq has fired but what if it's to tell us that we have some
> > available space in the TX fifo? Does that really matter? It seems like
> > if we have an RX irq when we're starting a transfer that might be bad
> > too but we could forcibly clear that by acking it here and then setting
> > the fifo word count that we're expecting for rx?
> >
> > Put another way, why isn't this driver looking at the TX and RX fifo
> > status registers more than in one place?
> 
> I'm not sure I understand all your concerns.  Can you clarify?  In
> case it helps, I'll add a few thoughts here:
> 
> 1. SPI is a controller clocked protocol and this is the driver for the
> controller.  There is no way to get a RX IRQ unless we initiate it.
> 
> 2. The code always takes care to make sure that when we're done with a
> transfer that we disable the TX watermark.  This means we won't get
> any more interrupts.
> 
> The only time an interrupt could still be pending when we start a new
> transfer is:
> 
> a) If the interrupt handler is still running on another CPU.  In that
> case it will have the spinlock and won't release it until it clears
> the interrupts.
> 
> b) If we had a timeout on the previous transfer and then got timeouts
> sending the cancel and abort.
> 
> In general when we're starting a new transfer we assume that we can
> program the hardware willy-nilly.  If there's some chance something
> else is happening (or our interrupt could go off) then it breaks that
> whole model.

Right. I thought this patch was making sure that the hardware wasn't in
the process of doing something else when we setup the transfer. I'm
saying that only checking the irq misses the fact that maybe the
transfer hasn't completed yet or a pending irq hasn't come in yet, but
the fifo status would tell us that the fifo is transferring something or
receiving something. If an RX can't happen, then the code should clearly
show that an RX irq isn't expected, and mask out that bit so it is
ignored or explicitly check for it and call WARN_ON() if the bit is set.

I'm wondering why we don't check the FIFO status and the irq bits to
make sure that some previous cancelled operation isn't still pending
either in the FIFO or as an irq. While this patch will fix the scenario
where the irq is delayed but pending in the hardware it won't cover the
case that the hardware itself is wedged, for example because the
sequencer just decided to stop working entirely.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15 22:25       ` Stephen Boyd
@ 2020-12-15 23:34         ` Doug Anderson
  2020-12-16  1:17           ` Stephen Boyd
  0 siblings, 1 reply; 10+ messages in thread
From: Doug Anderson @ 2020-12-15 23:34 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, linux-arm-msm, LKML,
	linux-spi

Hi,

On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Doug Anderson (2020-12-15 09:25:51)
> > On Mon, Dec 14, 2020 at 6:57 PM Stephen Boyd <swboyd@chromium.org> wrote:
> > >
> > > Quoting Douglas Anderson (2020-12-14 16:30:19)
> > > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
> > > > index 6f736e94e9f4..5ef2e9f38ac9 100644
> > > > --- a/drivers/spi/spi-geni-qcom.c
> > > > +++ b/drivers/spi/spi-geni-qcom.c
> >
> > > > +       spin_lock_irq(&mas->lock);
> > > > +       m_irq = readl(se->base + SE_GENI_M_IRQ_STATUS);
> > > > +       m_irq_en = readl(se->base + SE_GENI_M_IRQ_EN);
> > > > +       spin_unlock_irq(&mas->lock);
> > > > +
> > > > +       if (m_irq & m_irq_en) {
> > >
> > > Is this really "busy" though? If we canceled something out then maybe
> > > the irq has fired but what if it's to tell us that we have some
> > > available space in the TX fifo? Does that really matter? It seems like
> > > if we have an RX irq when we're starting a transfer that might be bad
> > > too but we could forcibly clear that by acking it here and then setting
> > > the fifo word count that we're expecting for rx?
> > >
> > > Put another way, why isn't this driver looking at the TX and RX fifo
> > > status registers more than in one place?
> >
> > I'm not sure I understand all your concerns.  Can you clarify?  In
> > case it helps, I'll add a few thoughts here:
> >
> > 1. SPI is a controller clocked protocol and this is the driver for the
> > controller.  There is no way to get a RX IRQ unless we initiate it.
> >
> > 2. The code always takes care to make sure that when we're done with a
> > transfer that we disable the TX watermark.  This means we won't get
> > any more interrupts.
> >
> > The only time an interrupt could still be pending when we start a new
> > transfer is:
> >
> > a) If the interrupt handler is still running on another CPU.  In that
> > case it will have the spinlock and won't release it until it clears
> > the interrupts.
> >
> > b) If we had a timeout on the previous transfer and then got timeouts
> > sending the cancel and abort.
> >
> > In general when we're starting a new transfer we assume that we can
> > program the hardware willy-nilly.  If there's some chance something
> > else is happening (or our interrupt could go off) then it breaks that
> > whole model.
>
> Right. I thought this patch was making sure that the hardware wasn't in
> the process of doing something else when we setup the transfer. I'm
> saying that only checking the irq misses the fact that maybe the
> transfer hasn't completed yet or a pending irq hasn't come in yet, but
> the fifo status would tell us that the fifo is transferring something or
> receiving something. If an RX can't happen, then the code should clearly
> show that an RX irq isn't expected, and mask out that bit so it is
> ignored or explicitly check for it and call WARN_ON() if the bit is set.
>
> I'm wondering why we don't check the FIFO status and the irq bits to
> make sure that some previous cancelled operation isn't still pending
> either in the FIFO or as an irq. While this patch will fix the scenario
> where the irq is delayed but pending in the hardware it won't cover the
> case that the hardware itself is wedged, for example because the
> sequencer just decided to stop working entirely.

It also won't catch the case where the SoC decided that all GPIOs are
inverted and starts reporting highs for lows and lows for highs, nor
does it handle the case where the CPU suddenly switches to Big Endian
mode for no reason.  :-P

...by that, I mean I'm not trying to catch the case where the hardware
itself is behaving in a totally unexpected way.  I have seen no
instances where the hardware wedges nor where the sequencer stops
working and until I see them happen I'm not inclined to add code for
them.  Without seeing them actually happen I'm not really sure what
the right way to recover would be.  We've already tried "cancel" and
"abort" and then waited at least 1 second.  If you know of some sort
of magic "unwedge" then we should add it into handle_fifo_timeout().

However, super delayed interrupts due to software not servicing the
interrupt in time is something that really happens, if rarely.  Adding
code to account for that seems worth it and is easy to test...

Did I convince you?  ;-)

-Doug

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-15 23:34         ` Doug Anderson
@ 2020-12-16  1:17           ` Stephen Boyd
  2020-12-16 22:42             ` Doug Anderson
  0 siblings, 1 reply; 10+ messages in thread
From: Stephen Boyd @ 2020-12-16  1:17 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, linux-arm-msm, LKML,
	linux-spi

Quoting Doug Anderson (2020-12-15 15:34:59)
> On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
> >
> > Quoting Doug Anderson (2020-12-15 09:25:51)
> > > In general when we're starting a new transfer we assume that we can
> > > program the hardware willy-nilly.  If there's some chance something
> > > else is happening (or our interrupt could go off) then it breaks that
> > > whole model.
> >
> > Right. I thought this patch was making sure that the hardware wasn't in
> > the process of doing something else when we setup the transfer. I'm
> > saying that only checking the irq misses the fact that maybe the
> > transfer hasn't completed yet or a pending irq hasn't come in yet, but
> > the fifo status would tell us that the fifo is transferring something or
> > receiving something. If an RX can't happen, then the code should clearly
> > show that an RX irq isn't expected, and mask out that bit so it is
> > ignored or explicitly check for it and call WARN_ON() if the bit is set.
> >
> > I'm wondering why we don't check the FIFO status and the irq bits to
> > make sure that some previous cancelled operation isn't still pending
> > either in the FIFO or as an irq. While this patch will fix the scenario
> > where the irq is delayed but pending in the hardware it won't cover the
> > case that the hardware itself is wedged, for example because the
> > sequencer just decided to stop working entirely.
> 
> It also won't catch the case where the SoC decided that all GPIOs are
> inverted and starts reporting highs for lows and lows for highs, nor
> does it handle the case where the CPU suddenly switches to Big Endian
> mode for no reason.  :-P
> 
> ...by that, I mean I'm not trying to catch the case where the hardware
> itself is behaving in a totally unexpected way.  I have seen no
> instances where the hardware wedges nor where the sequencer stops
> working and until I see them happen I'm not inclined to add code for
> them.  Without seeing them actually happen I'm not really sure what
> the right way to recover would be.  We've already tried "cancel" and
> "abort" and then waited at least 1 second.  If you know of some sort
> of magic "unwedge" then we should add it into handle_fifo_timeout().

I am not aware of an "unwedge" command. Presumably the cancel/abort
stuff makes the FIFO state "sane" so there's nothing to see in the FIFO
status registers. I wonder if we should keep around some "did we cancel
last time?" flag and only check the isr if we canceled out and timed
out to boot? That would be a cheap and easy check to make sure that we
don't check this each transaction.

> 
> However, super delayed interrupts due to software not servicing the
> interrupt in time is something that really happens, if rarely.  Adding
> code to account for that seems worth it and is easy to test...
> 

Agreed. The function name is wrong then as the device is not "busy". So
maybe spi_geni_isr_pending()? That would clearly describe what's being
checked.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one
  2020-12-16  1:17           ` Stephen Boyd
@ 2020-12-16 22:42             ` Doug Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Anderson @ 2020-12-16 22:42 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, linux-arm-msm, LKML,
	linux-spi

Hi,

On Tue, Dec 15, 2020 at 5:18 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Doug Anderson (2020-12-15 15:34:59)
> > On Tue, Dec 15, 2020 at 2:25 PM Stephen Boyd <swboyd@chromium.org> wrote:
> > >
> > > Quoting Doug Anderson (2020-12-15 09:25:51)
> > > > In general when we're starting a new transfer we assume that we can
> > > > program the hardware willy-nilly.  If there's some chance something
> > > > else is happening (or our interrupt could go off) then it breaks that
> > > > whole model.
> > >
> > > Right. I thought this patch was making sure that the hardware wasn't in
> > > the process of doing something else when we setup the transfer. I'm
> > > saying that only checking the irq misses the fact that maybe the
> > > transfer hasn't completed yet or a pending irq hasn't come in yet, but
> > > the fifo status would tell us that the fifo is transferring something or
> > > receiving something. If an RX can't happen, then the code should clearly
> > > show that an RX irq isn't expected, and mask out that bit so it is
> > > ignored or explicitly check for it and call WARN_ON() if the bit is set.
> > >
> > > I'm wondering why we don't check the FIFO status and the irq bits to
> > > make sure that some previous cancelled operation isn't still pending
> > > either in the FIFO or as an irq. While this patch will fix the scenario
> > > where the irq is delayed but pending in the hardware it won't cover the
> > > case that the hardware itself is wedged, for example because the
> > > sequencer just decided to stop working entirely.
> >
> > It also won't catch the case where the SoC decided that all GPIOs are
> > inverted and starts reporting highs for lows and lows for highs, nor
> > does it handle the case where the CPU suddenly switches to Big Endian
> > mode for no reason.  :-P
> >
> > ...by that, I mean I'm not trying to catch the case where the hardware
> > itself is behaving in a totally unexpected way.  I have seen no
> > instances where the hardware wedges nor where the sequencer stops
> > working and until I see them happen I'm not inclined to add code for
> > them.  Without seeing them actually happen I'm not really sure what
> > the right way to recover would be.  We've already tried "cancel" and
> > "abort" and then waited at least 1 second.  If you know of some sort
> > of magic "unwedge" then we should add it into handle_fifo_timeout().
>
> I am not aware of an "unwedge" command. Presumably the cancel/abort
> stuff makes the FIFO state "sane" so there's nothing to see in the FIFO
> status registers. I wonder if we should keep around some "did we cancel
> last time?" flag and only check the isr if we canceled out and timed
> out to boot? That would be a cheap and easy check to make sure that we
> don't check this each transaction.

Sure.  I guess technically it would be a "did we fail to cancel last time".


> > However, super delayed interrupts due to software not servicing the
> > interrupt in time is something that really happens, if rarely.  Adding
> > code to account for that seems worth it and is easy to test...
> >
>
> Agreed. The function name is wrong then as the device is not "busy". So
> maybe spi_geni_isr_pending()? That would clearly describe what's being
> checked.

I changed this to just be about the abort.  See if v2 looks better to you.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
  2020-12-15  2:29 ` [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
@ 2020-12-16 22:42   ` Doug Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Anderson @ 2020-12-16 22:42 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: Mark Brown, msavaliy, Akash Asthana, Roja Rani Yarubandi,
	Alok Chauhan, Andy Gross, Bjorn Andersson, Girish Mahadevan,
	linux-arm-msm, LKML, linux-spi

Hi,

On Mon, Dec 14, 2020 at 6:29 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Here's a shortened version:
>
>   CPU0                           CPU1
>   ----                           ----
>                                  setup_fifo_xfer()
>                                   geni_se_setup_m_cmd()
>                                  <hardware starts transfer>
>                                  <transfer completes in hardware>
>                                  <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq>
>                                  ...
>                                  handle_fifo_timeout()
>                                   spin_lock_irq(mas->lock)
>                                   mas->cur_xfer = NULL
>                                   geni_se_cancel_m_cmd()
>                                   spin_unlock_irq(mas->lock)
>
>   geni_spi_isr()
>    spin_lock(mas->lock)
>    if (m_irq & M_RX_FIFO_WATERMARK_EN)
>     geni_spi_handle_rx()
>      mas->cur_xfer NULL dereference!
>
> Two CPUs also don't really matter but I guess that's fine.

OK, replaced it with your version.


> > Specifically it should be noted that the RX/TX interrupts are still
> > shown asserted even when a CANCEL/ABORT interrupt has asserted.
>
> Can we have 'TL;DR: Seriously delayed interrupts for RX/TX can lead to
> timeout handling setting mas->cur_xfer to NULL.'?

Sure, added this.  ...but made the super important change that "tl;dr"
is more conventionally lower case.  :-P


> > Let's check for the NULL transfer in the TX and RX cases.
>
> and reset the watermark or clear out the fifo respectively to put the
> hardware back into a sane state.

Sure.


> > @@ -396,6 +402,17 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas)
> >                 if (rx_last_byte_valid && rx_last_byte_valid < 4)
> >                         rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
> >         }
> > +
> > +       /* Clear out the FIFO and bail if nowhere to put it */
> > +       if (mas->cur_xfer == NULL) {
>
> I think if (!mas->cur_xfer) is more kernel idiomatic, but sure.

I've been yelled at both ways, but changed it to your way here.


> > +               for (i = 0; i < words; i++)
>
>                 while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word))
>                         readl(se->base + SE_GENI_RX_FIFOn);

Sure, that's fine.  I was marginally worried that the compiler
wouldn't know it could optimize the test and would do the divide every
time, but I guess that's pretty dang unlikely and also not a place we
really care about optimizing a lot.  I'm also not a huge fan of
relying on loop counters being initted at the start of the function,
but I guess it's OK.  Changed to your syntax.



-Doug

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-12-16 22:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15  0:30 [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Douglas Anderson
2020-12-15  0:30 ` [PATCH 2/2] spi: spi-geni-qcom: Really ensure the previous xfer is done before new one Douglas Anderson
2020-12-15  2:57   ` Stephen Boyd
2020-12-15 17:25     ` Doug Anderson
2020-12-15 22:25       ` Stephen Boyd
2020-12-15 23:34         ` Doug Anderson
2020-12-16  1:17           ` Stephen Boyd
2020-12-16 22:42             ` Doug Anderson
2020-12-15  2:29 ` [PATCH 1/2] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case Stephen Boyd
2020-12-16 22:42   ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).