All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
@ 2024-01-12 23:03 ` Douglas Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Douglas Anderson @ 2024-01-12 23:03 UTC (permalink / raw)
  To: Bjorn Andersson, Konrad Dybcio, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, Douglas Anderson, linux-kernel

According to the docs I have, bit 21 of the status register is
asserted when the FIFO is _not_ empty. Add the definition.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 include/linux/soc/qcom/geni-se.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/soc/qcom/geni-se.h b/include/linux/soc/qcom/geni-se.h
index 29e06905bc1f..0f038a1a0330 100644
--- a/include/linux/soc/qcom/geni-se.h
+++ b/include/linux/soc/qcom/geni-se.h
@@ -178,6 +178,7 @@ struct geni_se {
 #define M_GP_IRQ_3_EN			BIT(12)
 #define M_GP_IRQ_4_EN			BIT(13)
 #define M_GP_IRQ_5_EN			BIT(14)
+#define M_TX_FIFO_NOT_EMPTY_EN		BIT(21)
 #define M_IO_DATA_DEASSERT_EN		BIT(22)
 #define M_IO_DATA_ASSERT_EN		BIT(23)
 #define M_RX_FIFO_RD_ERR_EN		BIT(24)
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
@ 2024-01-12 23:03 ` Douglas Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Douglas Anderson @ 2024-01-12 23:03 UTC (permalink / raw)
  To: Bjorn Andersson, Konrad Dybcio, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, Douglas Anderson, linux-kernel

According to the docs I have, bit 21 of the status register is
asserted when the FIFO is _not_ empty. Add the definition.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 include/linux/soc/qcom/geni-se.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/soc/qcom/geni-se.h b/include/linux/soc/qcom/geni-se.h
index 29e06905bc1f..0f038a1a0330 100644
--- a/include/linux/soc/qcom/geni-se.h
+++ b/include/linux/soc/qcom/geni-se.h
@@ -178,6 +178,7 @@ struct geni_se {
 #define M_GP_IRQ_3_EN			BIT(12)
 #define M_GP_IRQ_4_EN			BIT(13)
 #define M_GP_IRQ_5_EN			BIT(14)
+#define M_TX_FIFO_NOT_EMPTY_EN		BIT(21)
 #define M_IO_DATA_DEASSERT_EN		BIT(22)
 #define M_IO_DATA_ASSERT_EN		BIT(23)
 #define M_RX_FIFO_RD_ERR_EN		BIT(24)
-- 
2.43.0.275.g3460e3d667-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock
  2024-01-12 23:03 ` Douglas Anderson
@ 2024-01-12 23:03   ` Douglas Anderson
  -1 siblings, 0 replies; 10+ messages in thread
From: Douglas Anderson @ 2024-01-12 23:03 UTC (permalink / raw)
  To: Bjorn Andersson, Konrad Dybcio, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, Douglas Anderson, linux-kernel

As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
enabled then we'll use it to stop CPUs at panic time. This is nice,
but it does mean that there's a pretty good chance that we'll end up
stopping a CPU while it holds the port lock for the console
UART. Specifically, I see a CPU get stopped while holding the port
lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
enabling the "buddy" hardlockup detector and then doing:

  sysctl -w kernel.hardlockup_all_cpu_backtrace=1
  sysctl -w kernel.hardlockup_panic=1
  echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT

UART drivers are _supposed_ to handle this case OK and this is why
UART drivers check "oops_in_progress" and only do a "trylock" in that
case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
very well-tested situation.

Now that we're testing the situation a lot, it can be seen that the
Qualcomm GENI UART driver is pretty broken. Specifically, when I run
my test case and look at the console output I just see a bunch of
garbled output like:

  [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
  PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
  #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
  ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
  201.069095] pc : smp_call_function_man[  201.069099]

That's obviously not so great. This happens because each call to the
console driver exits after the data has been written to the FIFO but
before it's actually been flushed out of the serial port. When we have
multiple calls into the console one after the other then (if we can't
get the lock) each call tells the UART to throw away any data in the
FIFO that hadn't been transferred yet.

I've posted up a patch to change the arm64 core to avoid this
situation most of the time [1] much like x86 seems to do, but even if
that patch lands the GENI driver should still be fixed.

From testing, it appears that we can just delete the cancel/abort in
the case where we weren't able to get the UART lock and the output
looks good. It makes sense that we'd be able to do this since that
means we'll just call into __qcom_geni_serial_console_write() and
__qcom_geni_serial_console_write() looks much like
qcom_geni_serial_poll_put_char() but with a loop. However, it seems
safest to poll the FIFO and make sure it's empty before our
transfer. This should reliably make sure that we're not
interrupting/clobbering any existing transfers.

As part of this change, we'll also avoid re-setting up a TX at the end
of the console write function if we weren't able to get the lock,
since accessing "port->tx_remaining" without the lock is not
safe. This is only needed to re-start userspace initiated transfers.

[1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 7e78f97e8f43..06ebe62f99bc 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
 
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 
-	/* Cancel the current write to log the fault */
 	if (!locked) {
-		geni_se_cancel_m_cmd(&port->se);
-		if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
-						M_CMD_CANCEL_EN, true)) {
-			geni_se_abort_m_cmd(&port->se);
-			qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
-							M_CMD_ABORT_EN, true);
-			writel(M_CMD_ABORT_EN, uport->membase +
-							SE_GENI_M_IRQ_CLEAR);
-		}
-		writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+		/*
+		 * We can only get here if an oops is in progress then we were
+		 * unable to get the lock. This means we can't safely access
+		 * our state variables like tx_remaining. About the best we
+		 * can do is wait for the FIFO to be empty before we start our
+		 * transfer, so we'll do that.
+		 */
+		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
+					  M_TX_FIFO_NOT_EMPTY_EN, false);
 	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
 		/*
 		 * It seems we can't interrupt existing transfers if all data
@@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
 
 	__qcom_geni_serial_console_write(uport, s, count);
 
-	if (port->tx_remaining)
-		qcom_geni_serial_setup_tx(uport, port->tx_remaining);
 
-	if (locked)
+	if (locked) {
+		if (port->tx_remaining)
+			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
 		uart_port_unlock_irqrestore(uport, flags);
+	}
 }
 
 static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
-- 
2.43.0.275.g3460e3d667-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock
@ 2024-01-12 23:03   ` Douglas Anderson
  0 siblings, 0 replies; 10+ messages in thread
From: Douglas Anderson @ 2024-01-12 23:03 UTC (permalink / raw)
  To: Bjorn Andersson, Konrad Dybcio, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, Douglas Anderson, linux-kernel

As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
enabled then we'll use it to stop CPUs at panic time. This is nice,
but it does mean that there's a pretty good chance that we'll end up
stopping a CPU while it holds the port lock for the console
UART. Specifically, I see a CPU get stopped while holding the port
lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
enabling the "buddy" hardlockup detector and then doing:

  sysctl -w kernel.hardlockup_all_cpu_backtrace=1
  sysctl -w kernel.hardlockup_panic=1
  echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT

UART drivers are _supposed_ to handle this case OK and this is why
UART drivers check "oops_in_progress" and only do a "trylock" in that
case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
very well-tested situation.

Now that we're testing the situation a lot, it can be seen that the
Qualcomm GENI UART driver is pretty broken. Specifically, when I run
my test case and look at the console output I just see a bunch of
garbled output like:

  [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
  PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
  #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
  ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
  201.069095] pc : smp_call_function_man[  201.069099]

That's obviously not so great. This happens because each call to the
console driver exits after the data has been written to the FIFO but
before it's actually been flushed out of the serial port. When we have
multiple calls into the console one after the other then (if we can't
get the lock) each call tells the UART to throw away any data in the
FIFO that hadn't been transferred yet.

I've posted up a patch to change the arm64 core to avoid this
situation most of the time [1] much like x86 seems to do, but even if
that patch lands the GENI driver should still be fixed.

From testing, it appears that we can just delete the cancel/abort in
the case where we weren't able to get the UART lock and the output
looks good. It makes sense that we'd be able to do this since that
means we'll just call into __qcom_geni_serial_console_write() and
__qcom_geni_serial_console_write() looks much like
qcom_geni_serial_poll_put_char() but with a loop. However, it seems
safest to poll the FIFO and make sure it's empty before our
transfer. This should reliably make sure that we're not
interrupting/clobbering any existing transfers.

As part of this change, we'll also avoid re-setting up a TX at the end
of the console write function if we weren't able to get the lock,
since accessing "port->tx_remaining" without the lock is not
safe. This is only needed to re-start userspace initiated transfers.

[1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 7e78f97e8f43..06ebe62f99bc 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
 
 	geni_status = readl(uport->membase + SE_GENI_STATUS);
 
-	/* Cancel the current write to log the fault */
 	if (!locked) {
-		geni_se_cancel_m_cmd(&port->se);
-		if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
-						M_CMD_CANCEL_EN, true)) {
-			geni_se_abort_m_cmd(&port->se);
-			qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
-							M_CMD_ABORT_EN, true);
-			writel(M_CMD_ABORT_EN, uport->membase +
-							SE_GENI_M_IRQ_CLEAR);
-		}
-		writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+		/*
+		 * We can only get here if an oops is in progress then we were
+		 * unable to get the lock. This means we can't safely access
+		 * our state variables like tx_remaining. About the best we
+		 * can do is wait for the FIFO to be empty before we start our
+		 * transfer, so we'll do that.
+		 */
+		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
+					  M_TX_FIFO_NOT_EMPTY_EN, false);
 	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
 		/*
 		 * It seems we can't interrupt existing transfers if all data
@@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
 
 	__qcom_geni_serial_console_write(uport, s, count);
 
-	if (port->tx_remaining)
-		qcom_geni_serial_setup_tx(uport, port->tx_remaining);
 
-	if (locked)
+	if (locked) {
+		if (port->tx_remaining)
+			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
 		uart_port_unlock_irqrestore(uport, flags);
+	}
 }
 
 static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
-- 
2.43.0.275.g3460e3d667-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
  2024-01-12 23:03 ` Douglas Anderson
@ 2024-01-12 23:21   ` Konrad Dybcio
  -1 siblings, 0 replies; 10+ messages in thread
From: Konrad Dybcio @ 2024-01-12 23:21 UTC (permalink / raw)
  To: Douglas Anderson, Bjorn Andersson, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, linux-kernel

On 13.01.2024 00:03, Douglas Anderson wrote:
> According to the docs I have, bit 21 of the status register is
> asserted when the FIFO is _not_ empty. Add the definition.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Looks like.

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
@ 2024-01-12 23:21   ` Konrad Dybcio
  0 siblings, 0 replies; 10+ messages in thread
From: Konrad Dybcio @ 2024-01-12 23:21 UTC (permalink / raw)
  To: Douglas Anderson, Bjorn Andersson, Greg Kroah-Hartman
  Cc: linux-arm-kernel, Stephen Boyd, linux-serial, linux-arm-msm,
	Jiri Slaby, linux-kernel

On 13.01.2024 00:03, Douglas Anderson wrote:
> According to the docs I have, bit 21 of the status register is
> asserted when the FIFO is _not_ empty. Add the definition.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Looks like.

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
  2024-01-12 23:03 ` Douglas Anderson
@ 2024-01-28  3:10   ` Bjorn Andersson
  -1 siblings, 0 replies; 10+ messages in thread
From: Bjorn Andersson @ 2024-01-28  3:10 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Konrad Dybcio, Greg Kroah-Hartman, linux-arm-kernel,
	Stephen Boyd, linux-serial, linux-arm-msm, Jiri Slaby,
	linux-kernel

On Fri, Jan 12, 2024 at 03:03:07PM -0800, Douglas Anderson wrote:
> According to the docs I have, bit 21 of the status register is
> asserted when the FIFO is _not_ empty. Add the definition.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Acked-by: Bjorn Andersson <andersson@kernel.org>

Regards,
Bjorn

> ---
> 
>  include/linux/soc/qcom/geni-se.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/soc/qcom/geni-se.h b/include/linux/soc/qcom/geni-se.h
> index 29e06905bc1f..0f038a1a0330 100644
> --- a/include/linux/soc/qcom/geni-se.h
> +++ b/include/linux/soc/qcom/geni-se.h
> @@ -178,6 +178,7 @@ struct geni_se {
>  #define M_GP_IRQ_3_EN			BIT(12)
>  #define M_GP_IRQ_4_EN			BIT(13)
>  #define M_GP_IRQ_5_EN			BIT(14)
> +#define M_TX_FIFO_NOT_EMPTY_EN		BIT(21)
>  #define M_IO_DATA_DEASSERT_EN		BIT(22)
>  #define M_IO_DATA_ASSERT_EN		BIT(23)
>  #define M_RX_FIFO_RD_ERR_EN		BIT(24)
> -- 
> 2.43.0.275.g3460e3d667-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition
@ 2024-01-28  3:10   ` Bjorn Andersson
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Andersson @ 2024-01-28  3:10 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Konrad Dybcio, Greg Kroah-Hartman, linux-arm-kernel,
	Stephen Boyd, linux-serial, linux-arm-msm, Jiri Slaby,
	linux-kernel

On Fri, Jan 12, 2024 at 03:03:07PM -0800, Douglas Anderson wrote:
> According to the docs I have, bit 21 of the status register is
> asserted when the FIFO is _not_ empty. Add the definition.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Acked-by: Bjorn Andersson <andersson@kernel.org>

Regards,
Bjorn

> ---
> 
>  include/linux/soc/qcom/geni-se.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/soc/qcom/geni-se.h b/include/linux/soc/qcom/geni-se.h
> index 29e06905bc1f..0f038a1a0330 100644
> --- a/include/linux/soc/qcom/geni-se.h
> +++ b/include/linux/soc/qcom/geni-se.h
> @@ -178,6 +178,7 @@ struct geni_se {
>  #define M_GP_IRQ_3_EN			BIT(12)
>  #define M_GP_IRQ_4_EN			BIT(13)
>  #define M_GP_IRQ_5_EN			BIT(14)
> +#define M_TX_FIFO_NOT_EMPTY_EN		BIT(21)
>  #define M_IO_DATA_DEASSERT_EN		BIT(22)
>  #define M_IO_DATA_ASSERT_EN		BIT(23)
>  #define M_RX_FIFO_RD_ERR_EN		BIT(24)
> -- 
> 2.43.0.275.g3460e3d667-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock
  2024-01-12 23:03   ` Douglas Anderson
@ 2024-01-28  3:10     ` Bjorn Andersson
  -1 siblings, 0 replies; 10+ messages in thread
From: Bjorn Andersson @ 2024-01-28  3:10 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Konrad Dybcio, Greg Kroah-Hartman, linux-arm-kernel,
	Stephen Boyd, linux-serial, linux-arm-msm, Jiri Slaby,
	linux-kernel

On Fri, Jan 12, 2024 at 03:03:08PM -0800, Douglas Anderson wrote:
> As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
> enabled then we'll use it to stop CPUs at panic time. This is nice,
> but it does mean that there's a pretty good chance that we'll end up
> stopping a CPU while it holds the port lock for the console
> UART. Specifically, I see a CPU get stopped while holding the port
> lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
> enabling the "buddy" hardlockup detector and then doing:
> 
>   sysctl -w kernel.hardlockup_all_cpu_backtrace=1
>   sysctl -w kernel.hardlockup_panic=1
>   echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> 
> UART drivers are _supposed_ to handle this case OK and this is why
> UART drivers check "oops_in_progress" and only do a "trylock" in that
> case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
> very well-tested situation.
> 
> Now that we're testing the situation a lot, it can be seen that the
> Qualcomm GENI UART driver is pretty broken. Specifically, when I run
> my test case and look at the console output I just see a bunch of
> garbled output like:
> 
>   [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
>   PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
>   #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
>   ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
>   201.069095] pc : smp_call_function_man[  201.069099]
> 
> That's obviously not so great. This happens because each call to the
> console driver exits after the data has been written to the FIFO but
> before it's actually been flushed out of the serial port. When we have
> multiple calls into the console one after the other then (if we can't
> get the lock) each call tells the UART to throw away any data in the
> FIFO that hadn't been transferred yet.
> 
> I've posted up a patch to change the arm64 core to avoid this
> situation most of the time [1] much like x86 seems to do, but even if
> that patch lands the GENI driver should still be fixed.
> 
> From testing, it appears that we can just delete the cancel/abort in
> the case where we weren't able to get the UART lock and the output
> looks good. It makes sense that we'd be able to do this since that
> means we'll just call into __qcom_geni_serial_console_write() and
> __qcom_geni_serial_console_write() looks much like
> qcom_geni_serial_poll_put_char() but with a loop. However, it seems
> safest to poll the FIFO and make sure it's empty before our
> transfer. This should reliably make sure that we're not
> interrupting/clobbering any existing transfers.
> 
> As part of this change, we'll also avoid re-setting up a TX at the end
> of the console write function if we weren't able to get the lock,
> since accessing "port->tx_remaining" without the lock is not
> safe. This is only needed to re-start userspace initiated transfers.
> 
> [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Reviewed-by: Bjorn Andersson <andersson@kernel.org>

Regards,
Bjorn

> ---
> 
>  drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 7e78f97e8f43..06ebe62f99bc 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	geni_status = readl(uport->membase + SE_GENI_STATUS);
>  
> -	/* Cancel the current write to log the fault */
>  	if (!locked) {
> -		geni_se_cancel_m_cmd(&port->se);
> -		if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -						M_CMD_CANCEL_EN, true)) {
> -			geni_se_abort_m_cmd(&port->se);
> -			qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -							M_CMD_ABORT_EN, true);
> -			writel(M_CMD_ABORT_EN, uport->membase +
> -							SE_GENI_M_IRQ_CLEAR);
> -		}
> -		writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> +		/*
> +		 * We can only get here if an oops is in progress then we were
> +		 * unable to get the lock. This means we can't safely access
> +		 * our state variables like tx_remaining. About the best we
> +		 * can do is wait for the FIFO to be empty before we start our
> +		 * transfer, so we'll do that.
> +		 */
> +		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> +					  M_TX_FIFO_NOT_EMPTY_EN, false);
>  	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
>  		/*
>  		 * It seems we can't interrupt existing transfers if all data
> @@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	__qcom_geni_serial_console_write(uport, s, count);
>  
> -	if (port->tx_remaining)
> -		qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  
> -	if (locked)
> +	if (locked) {
> +		if (port->tx_remaining)
> +			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  		uart_port_unlock_irqrestore(uport, flags);
> +	}
>  }
>  
>  static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
> -- 
> 2.43.0.275.g3460e3d667-goog
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock
@ 2024-01-28  3:10     ` Bjorn Andersson
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Andersson @ 2024-01-28  3:10 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Konrad Dybcio, Greg Kroah-Hartman, linux-arm-kernel,
	Stephen Boyd, linux-serial, linux-arm-msm, Jiri Slaby,
	linux-kernel

On Fri, Jan 12, 2024 at 03:03:08PM -0800, Douglas Anderson wrote:
> As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
> enabled then we'll use it to stop CPUs at panic time. This is nice,
> but it does mean that there's a pretty good chance that we'll end up
> stopping a CPU while it holds the port lock for the console
> UART. Specifically, I see a CPU get stopped while holding the port
> lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
> enabling the "buddy" hardlockup detector and then doing:
> 
>   sysctl -w kernel.hardlockup_all_cpu_backtrace=1
>   sysctl -w kernel.hardlockup_panic=1
>   echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> 
> UART drivers are _supposed_ to handle this case OK and this is why
> UART drivers check "oops_in_progress" and only do a "trylock" in that
> case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
> very well-tested situation.
> 
> Now that we're testing the situation a lot, it can be seen that the
> Qualcomm GENI UART driver is pretty broken. Specifically, when I run
> my test case and look at the console output I just see a bunch of
> garbled output like:
> 
>   [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
>   PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
>   #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
>   ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
>   201.069095] pc : smp_call_function_man[  201.069099]
> 
> That's obviously not so great. This happens because each call to the
> console driver exits after the data has been written to the FIFO but
> before it's actually been flushed out of the serial port. When we have
> multiple calls into the console one after the other then (if we can't
> get the lock) each call tells the UART to throw away any data in the
> FIFO that hadn't been transferred yet.
> 
> I've posted up a patch to change the arm64 core to avoid this
> situation most of the time [1] much like x86 seems to do, but even if
> that patch lands the GENI driver should still be fixed.
> 
> From testing, it appears that we can just delete the cancel/abort in
> the case where we weren't able to get the UART lock and the output
> looks good. It makes sense that we'd be able to do this since that
> means we'll just call into __qcom_geni_serial_console_write() and
> __qcom_geni_serial_console_write() looks much like
> qcom_geni_serial_poll_put_char() but with a loop. However, it seems
> safest to poll the FIFO and make sure it's empty before our
> transfer. This should reliably make sure that we're not
> interrupting/clobbering any existing transfers.
> 
> As part of this change, we'll also avoid re-setting up a TX at the end
> of the console write function if we weren't able to get the lock,
> since accessing "port->tx_remaining" without the lock is not
> safe. This is only needed to re-start userspace initiated transfers.
> 
> [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>

Reviewed-by: Bjorn Andersson <andersson@kernel.org>

Regards,
Bjorn

> ---
> 
>  drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 7e78f97e8f43..06ebe62f99bc 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	geni_status = readl(uport->membase + SE_GENI_STATUS);
>  
> -	/* Cancel the current write to log the fault */
>  	if (!locked) {
> -		geni_se_cancel_m_cmd(&port->se);
> -		if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -						M_CMD_CANCEL_EN, true)) {
> -			geni_se_abort_m_cmd(&port->se);
> -			qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -							M_CMD_ABORT_EN, true);
> -			writel(M_CMD_ABORT_EN, uport->membase +
> -							SE_GENI_M_IRQ_CLEAR);
> -		}
> -		writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> +		/*
> +		 * We can only get here if an oops is in progress then we were
> +		 * unable to get the lock. This means we can't safely access
> +		 * our state variables like tx_remaining. About the best we
> +		 * can do is wait for the FIFO to be empty before we start our
> +		 * transfer, so we'll do that.
> +		 */
> +		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> +					  M_TX_FIFO_NOT_EMPTY_EN, false);
>  	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
>  		/*
>  		 * It seems we can't interrupt existing transfers if all data
> @@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	__qcom_geni_serial_console_write(uport, s, count);
>  
> -	if (port->tx_remaining)
> -		qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  
> -	if (locked)
> +	if (locked) {
> +		if (port->tx_remaining)
> +			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  		uart_port_unlock_irqrestore(uport, flags);
> +	}
>  }
>  
>  static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
> -- 
> 2.43.0.275.g3460e3d667-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-28  3:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-12 23:03 [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition Douglas Anderson
2024-01-12 23:03 ` Douglas Anderson
2024-01-12 23:03 ` [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock Douglas Anderson
2024-01-12 23:03   ` Douglas Anderson
2024-01-28  3:10   ` Bjorn Andersson
2024-01-28  3:10     ` Bjorn Andersson
2024-01-12 23:21 ` [PATCH 1/2] soc: qcom: geni-se: Add M_TX_FIFO_NOT_EMPTY bit definition Konrad Dybcio
2024-01-12 23:21   ` Konrad Dybcio
2024-01-28  3:10 ` Bjorn Andersson
2024-01-28  3:10   ` Bjorn Andersson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.