regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
@ 2023-03-14 10:20 Hans de Goede
  2023-03-14 10:55 ` Ilpo Järvinen
  0 siblings, 1 reply; 10+ messages in thread
From: Hans de Goede @ 2023-03-14 10:20 UTC (permalink / raw)
  To: Ilpo Järvinen, Greg Kroah-Hartman, Jiri Slaby
  Cc: open list:SERIAL DRIVERS, regressions

Hi Ilpo,

I have spend the last couple of days debugging a problem with Bluetooth
adapters (HCIs) connected over an UART connection on Intel Bay Trail
and Cherry Trail devices.

After much debugging I found out that sometimes the first byte of
a received packet (data[0]) would be overwritten with a 0 byte.

E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:

04 0e 0a 01 79 fc 00 54 fe ff ff 00 00

would sometimes turn into:

00 0e 0a 01 79 fc 00 54 fe ff ff 00 00

Further investigation revealed that this goes away if I stop
the dw_dmac module from loading, leading to:
"dw-apb-uart 80860F0A:00: failed to request DMA"
and the UART working without DMA support.

Testing various kernels showed that this problem was introduced
in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:

e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")

And reverting that on top of v6.3-rc2 indeed solves the problem.

So it seems that that commit somehow interferes with DMA based
data receiving, causing the first byte of a received data transfer
to get replaced by 0.

The issue has been seen on and the revert has been tested on
the following HW:

Asus T100TA
SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3

Lenovo Yoga Tablet 2 1051L
SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3

Lenovo Yoga Book X91F
Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2

I have more hw which I believe is affected but these are the models
where I have done detailed testing.

I would be happy to test any patches, or run a kernel with some extra
debugging added, just let me know what you need to help fixing this.

Regards,

Hans


p.s.

I believe that these changes were made to improve the timing of
disabling the transmitter on tx completion when the UART is used
for a RS485 bus. So one option to workaround this might be to only
enable the new behavior when RS485 mode is used ?



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 10:20 [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx Hans de Goede
@ 2023-03-14 10:55 ` Ilpo Järvinen
  2023-03-14 11:11   ` Hans de Goede
  0 siblings, 1 reply; 10+ messages in thread
From: Ilpo Järvinen @ 2023-03-14 10:55 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

On Tue, 14 Mar 2023, Hans de Goede wrote:

> Hi Ilpo,

Hi,

Thanks for the report.

> I have spend the last couple of days debugging a problem with Bluetooth
> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> and Cherry Trail devices.
> 
> After much debugging I found out that sometimes the first byte of
> a received packet (data[0]) would be overwritten with a 0 byte.
> 
> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> 
> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> 
> would sometimes turn into:
> 
> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>
> Further investigation revealed that this goes away if I stop
> the dw_dmac module from loading, leading to:
> "dw-apb-uart 80860F0A:00: failed to request DMA"
> and the UART working without DMA support.
> 
> Testing various kernels showed that this problem was introduced
> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> 
> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> 
> And reverting that on top of v6.3-rc2 indeed solves the problem.

You did something else too than just that because you cannot cleanly 
revert just e8ffbb71f783. Please indicate what happened to:
  f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")

I guess you reverted that too and forgot to mention about it but I just 
want to be sure we're on the same page?

> So it seems that that commit somehow interferes with DMA based
> data receiving, causing the first byte of a received data transfer
> to get replaced by 0.

Okay, and you're sure the problem/corruption occurs on the receiving side? 

Maybe the the extra interrupt that the tx side change will trigger somehow 
causes the confusion on the rx side. So that would be an extra call into 
handle_rx_dma() that could either do an extra flush or start DMA Rx that 
would not occur w/o that tx side change.

> The issue has been seen on and the revert has been tested on
> the following HW:
> 
> Asus T100TA
> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> 
> Lenovo Yoga Tablet 2 1051L
> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> 
> Lenovo Yoga Book X91F
> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> 
> I have more hw which I believe is affected but these are the models
> where I have done detailed testing.
> 
> I would be happy to test any patches, or run a kernel with some extra
> debugging added, just let me know what you need to help fixing this.

How easy this is to trigger in general? (Mainly trying to gauge how 
easy it will be to find the read and/or the irq that related to the 
corrupted payload).

> Regards,
> 
> Hans
> 
> 
> p.s.
> 
> I believe that these changes were made to improve the timing of
> disabling the transmitter on tx completion when the UART is used
> for a RS485 bus. So one option to workaround this might be to only
> enable the new behavior when RS485 mode is used ?

That change also removed what IMO was a too early serial8250_rpm_put_tx()
which is not just RS485 related but yes, the main reason was to have the
half-duplex RS485 behave correctly.


-- 
 i.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 10:55 ` Ilpo Järvinen
@ 2023-03-14 11:11   ` Hans de Goede
  2023-03-14 11:48     ` Ilpo Järvinen
  0 siblings, 1 reply; 10+ messages in thread
From: Hans de Goede @ 2023-03-14 11:11 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

[-- Attachment #1: Type: text/plain, Size: 4280 bytes --]

Hi,

On 3/14/23 11:55, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
> 
>> Hi Ilpo,
> 
> Hi,
> 
> Thanks for the report.
> 
>> I have spend the last couple of days debugging a problem with Bluetooth
>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>> and Cherry Trail devices.
>>
>> After much debugging I found out that sometimes the first byte of
>> a received packet (data[0]) would be overwritten with a 0 byte.
>>
>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>
>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>
>> would sometimes turn into:
>>
>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>
>> Further investigation revealed that this goes away if I stop
>> the dw_dmac module from loading, leading to:
>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>> and the UART working without DMA support.
>>
>> Testing various kernels showed that this problem was introduced
>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>
>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>
>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> 
> You did something else too than just that because you cannot cleanly 
> revert just e8ffbb71f783. Please indicate what happened to:
>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
> 
> I guess you reverted that too and forgot to mention about it but I just 
> want to be sure we're on the same page?

I manually fixed up the revert, effectively
dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
I did not revert f8d6e9d3ca5c in its entirety.

I've attached my manual fixed up revert as a patch here.

>> So it seems that that commit somehow interferes with DMA based
>> data receiving, causing the first byte of a received data transfer
>> to get replaced by 0.
> 
> Okay, and you're sure the problem/corruption occurs on the receiving side? 

What I am sure is that the first byte of a packet has been replaced by 0
by the time drivers/tty/tty_buffer.c: receive_buf() gets called.

I did not dive into the serial-port code side of this problem since
I'm unfamiliar with that.

> Maybe the the extra interrupt that the tx side change will trigger somehow 
> causes the confusion on the rx side. So that would be an extra call into 
> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> would not occur w/o that tx side change.

That sounds like a likely candidate for causing this yes, as said
I'm unfamiliar with the serial-port code, but I did already suspect
that the change was causing some extra interrupt which somehow
interfered with the rx side.

>> The issue has been seen on and the revert has been tested on
>> the following HW:
>>
>> Asus T100TA
>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>
>> Lenovo Yoga Tablet 2 1051L
>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>
>> Lenovo Yoga Book X91F
>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>
>> I have more hw which I believe is affected but these are the models
>> where I have done detailed testing.
>>
>> I would be happy to test any patches, or run a kernel with some extra
>> debugging added, just let me know what you need to help fixing this.
> 
> How easy this is to trigger in general? (Mainly trying to gauge how 
> easy it will be to find the read and/or the irq that related to the 
> corrupted payload).

For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
calls, re-initializing and re-uploading the BT firmware 10 times and
then looked for "Frame reassembly failed" errors (+ more error caused
by this error) from the drivers/bluetooth/hci_bcm.c code.

This would usually trigger at least twice with the 10 rmmod + modprobe
calls and typically more often then twice.

With the revert (and with 5.18.y) I never see a single error even
with more rmmod + modprobe calls. Note it is not just the driver's
probe() which fails, sometimes there are also later errors, this
is just a somewhat convenient way to reproduce.

So this is somewhat easy to trigger, not trivial to trigger,
but it also does not take waiting many hours before hitting it
once.

Regards,

Hans


[-- Attachment #2: 0001-Revert-serial-8250-use-THRE-__stop_tx-also-with-DMA.patch --]
[-- Type: text/x-patch, Size: 1887 bytes --]

From 5343ba4a6fab8d91021403315d36a175a097c5d8 Mon Sep 17 00:00:00 2001
From: Hans de Goede <hdegoede@redhat.com>
Date: Mon, 13 Mar 2023 22:32:19 +0100
Subject: [PATCH] Revert "serial: 8250: use THRE & __stop_tx also with DMA"

This reverts commit e8ffbb71f783f577b24c25bd4dd1c7119d344924.
---
 drivers/tty/serial/8250/8250_dma.c  | 3 ++-
 drivers/tty/serial/8250/8250_port.c | 9 +++------
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
index 7fa66501792d..80c70a5cfa4c 100644
--- a/drivers/tty/serial/8250/8250_dma.c
+++ b/drivers/tty/serial/8250/8250_dma.c
@@ -32,7 +32,7 @@ static void __dma_tx_complete(void *param)
 		uart_write_wakeup(&p->port);
 
 	ret = serial8250_tx_dma(p);
-	if (ret || !dma->tx_running)
+	if (ret)
 		serial8250_set_THRI(p);
 
 	spin_unlock_irqrestore(&p->port.lock, flags);
@@ -104,6 +104,7 @@ int serial8250_tx_dma(struct uart_8250_port *p)
 
 	if (uart_tx_stopped(&p->port) || uart_circ_empty(xmit)) {
 		/* We have been called from __dma_tx_complete() */
+		serial8250_rpm_put_tx(p);
 		return 0;
 	}
 
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index fa43df05342b..4e3605a494e8 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1950,12 +1950,9 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 			status = serial8250_rx_chars(up, status);
 	}
 	serial8250_modem_status(up);
-	if ((status & UART_LSR_THRE) && (up->ier & UART_IER_THRI)) {
-		if (!up->dma || up->dma->tx_err)
-			serial8250_tx_chars(up);
-		else if (!up->dma->tx_running)
-			__stop_tx(up);
-	}
+	if ((!up->dma || up->dma->tx_err) && (status & UART_LSR_THRE) &&
+		(up->ier & UART_IER_THRI))
+		serial8250_tx_chars(up);
 
 	uart_unlock_and_check_sysrq_irqrestore(port, flags);
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 11:11   ` Hans de Goede
@ 2023-03-14 11:48     ` Ilpo Järvinen
  2023-03-14 16:04       ` Hans de Goede
  0 siblings, 1 reply; 10+ messages in thread
From: Ilpo Järvinen @ 2023-03-14 11:48 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

[-- Attachment #1: Type: text/plain, Size: 7584 bytes --]

On Tue, 14 Mar 2023, Hans de Goede wrote:
> On 3/14/23 11:55, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> > 
> >> I have spend the last couple of days debugging a problem with Bluetooth
> >> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >> and Cherry Trail devices.
> >>
> >> After much debugging I found out that sometimes the first byte of
> >> a received packet (data[0]) would be overwritten with a 0 byte.
> >>
> >> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>
> >> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>
> >> would sometimes turn into:
> >>
> >> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>
> >> Further investigation revealed that this goes away if I stop
> >> the dw_dmac module from loading, leading to:
> >> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >> and the UART working without DMA support.
> >>
> >> Testing various kernels showed that this problem was introduced
> >> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>
> >> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>
> >> And reverting that on top of v6.3-rc2 indeed solves the problem.
> > 
> > You did something else too than just that because you cannot cleanly 
> > revert just e8ffbb71f783. Please indicate what happened to:
> >   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
> > 
> > I guess you reverted that too and forgot to mention about it but I just 
> > want to be sure we're on the same page?
> 
> I manually fixed up the revert, effectively
> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
> I did not revert f8d6e9d3ca5c in its entirety.
> 
> I've attached my manual fixed up revert as a patch here.
> 
> >> So it seems that that commit somehow interferes with DMA based
> >> data receiving, causing the first byte of a received data transfer
> >> to get replaced by 0.
> > 
> > Okay, and you're sure the problem/corruption occurs on the receiving side? 
> 
> What I am sure is that the first byte of a packet has been replaced by 0
> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
> 
> I did not dive into the serial-port code side of this problem since
> I'm unfamiliar with that.
> 
> > Maybe the the extra interrupt that the tx side change will trigger somehow 
> > causes the confusion on the rx side. So that would be an extra call into 
> > handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> > would not occur w/o that tx side change.
> 
> That sounds like a likely candidate for causing this yes, as said
> I'm unfamiliar with the serial-port code, but I did already suspect
> that the change was causing some extra interrupt which somehow
> interfered with the rx side.
> 
> >> The issue has been seen on and the revert has been tested on
> >> the following HW:
> >>
> >> Asus T100TA
> >> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>
> >> Lenovo Yoga Tablet 2 1051L
> >> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>
> >> Lenovo Yoga Book X91F
> >> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>
> >> I have more hw which I believe is affected but these are the models
> >> where I have done detailed testing.
> >>
> >> I would be happy to test any patches, or run a kernel with some extra
> >> debugging added, just let me know what you need to help fixing this.
> > 
> > How easy this is to trigger in general? (Mainly trying to gauge how 
> > easy it will be to find the read and/or the irq that related to the 
> > corrupted payload).
> 
> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
> calls, re-initializing and re-uploading the BT firmware 10 times and
> then looked for "Frame reassembly failed" errors (+ more error caused
> by this error) from the drivers/bluetooth/hci_bcm.c code.
> 
> This would usually trigger at least twice with the 10 rmmod + modprobe
> calls and typically more often then twice.
> 
> With the revert (and with 5.18.y) I never see a single error even
> with more rmmod + modprobe calls. Note it is not just the driver's
> probe() which fails, sometimes there are also later errors, this
> is just a somewhat convenient way to reproduce.
> 
> So this is somewhat easy to trigger, not trivial to trigger,
> but it also does not take waiting many hours before hitting it
> once.

Okay, maybe the patch below will manage to filter the interesting part of 
it (not tested beyond compiling). It should capture a few irq events after 
the THRI is set when DMA Tx completes.

What I'd expect is to have iir value variation between the working and not 
working case. Especially, if it's UART_IIR_THRI that shows up that might 
be usable as a condition to return from handle_rx_dma() without touching
DMA Rx.

From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Subject: [PATCH] DEBUG: DMA Rx problem

A DEBUG PATCH, not intended for upstream inclusing.

---
 drivers/tty/serial/8250/8250_dma.c  |  4 +++-
 drivers/tty/serial/8250/8250_port.c | 13 +++++++++++++
 include/linux/serial_8250.h         |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
index 7fa66501792d..20d54abdded2 100644
--- a/drivers/tty/serial/8250/8250_dma.c
+++ b/drivers/tty/serial/8250/8250_dma.c
@@ -32,8 +32,10 @@ static void __dma_tx_complete(void *param)
 		uart_write_wakeup(&p->port);
 
 	ret = serial8250_tx_dma(p);
-	if (ret || !dma->tx_running)
+	if (ret || !dma->tx_running) {
+		p->irq_debug = -5;
 		serial8250_set_THRI(p);
+	}
 
 	spin_unlock_irqrestore(&p->port.lock, flags);
 }
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index fa43df05342b..193c08b1888a 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1923,6 +1923,9 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 	struct uart_8250_port *up = up_to_u8250p(port);
 	bool skip_rx = false;
 	unsigned long flags;
+	long irq_debug;
+	unsigned int tx_running = 0, rx_running = 0, tx_err = 0;
+	unsigned int ier;
 	u16 status;
 
 	if (iir & UART_IIR_NO_INT)
@@ -1931,6 +1934,12 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 	spin_lock_irqsave(&port->lock, flags);
 
 	status = serial_lsr_in(up);
+	if (up->dma) {
+		rx_running = up->dma->rx_running;
+		tx_running = up->dma->tx_running;
+		tx_err = up->dma->tx_err;
+	}
+	ier = up->ier;
 
 	/*
 	 * If port is stopped and there are no error conditions in the
@@ -1957,7 +1966,11 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 			__stop_tx(up);
 	}
 
+	irq_debug = up->irq_debug++;
 	uart_unlock_and_check_sysrq_irqrestore(port, flags);
+	if (irq_debug < 0)
+		pr_err("8250irq: iir=%02x lsr+saved=%02x ier=%02x dma_t/rx/err=%u/%u/%u skiprx=%d cnt=%ld\n",
+		       iir, status, ier, tx_running, rx_running, tx_err, skip_rx, irq_debug);
 
 	return 1;
 }
diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h
index 741ed4807a9c..8d9d7de6728d 100644
--- a/include/linux/serial_8250.h
+++ b/include/linux/serial_8250.h
@@ -107,6 +107,7 @@ struct uart_8250_port {
 	unsigned char		mcr;
 	unsigned char		cur_iotype;	/* Running I/O type */
 	unsigned int		rpm_tx_active;
+	long			irq_debug;
 	unsigned char		canary;		/* non-zero during system sleep
 						 *   if no_console_suspend
 						 */

-- 
tg: (571079f5ba93..) debug/dma-rx-problem (depends on: tty-next)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 11:48     ` Ilpo Järvinen
@ 2023-03-14 16:04       ` Hans de Goede
  2023-03-14 16:55         ` Ilpo Järvinen
  0 siblings, 1 reply; 10+ messages in thread
From: Hans de Goede @ 2023-03-14 16:04 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

Hi,

On 3/14/23 12:48, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>
>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>> and Cherry Trail devices.
>>>>
>>>> After much debugging I found out that sometimes the first byte of
>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>
>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>
>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>
>>>> would sometimes turn into:
>>>>
>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>
>>>> Further investigation revealed that this goes away if I stop
>>>> the dw_dmac module from loading, leading to:
>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>> and the UART working without DMA support.
>>>>
>>>> Testing various kernels showed that this problem was introduced
>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>
>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>
>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>
>>> You did something else too than just that because you cannot cleanly 
>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>
>>> I guess you reverted that too and forgot to mention about it but I just 
>>> want to be sure we're on the same page?
>>
>> I manually fixed up the revert, effectively
>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>> I did not revert f8d6e9d3ca5c in its entirety.
>>
>> I've attached my manual fixed up revert as a patch here.
>>
>>>> So it seems that that commit somehow interferes with DMA based
>>>> data receiving, causing the first byte of a received data transfer
>>>> to get replaced by 0.
>>>
>>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
>>
>> What I am sure is that the first byte of a packet has been replaced by 0
>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>
>> I did not dive into the serial-port code side of this problem since
>> I'm unfamiliar with that.
>>
>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>> causes the confusion on the rx side. So that would be an extra call into 
>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>> would not occur w/o that tx side change.
>>
>> That sounds like a likely candidate for causing this yes, as said
>> I'm unfamiliar with the serial-port code, but I did already suspect
>> that the change was causing some extra interrupt which somehow
>> interfered with the rx side.
>>
>>>> The issue has been seen on and the revert has been tested on
>>>> the following HW:
>>>>
>>>> Asus T100TA
>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>
>>>> Lenovo Yoga Tablet 2 1051L
>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>
>>>> Lenovo Yoga Book X91F
>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>
>>>> I have more hw which I believe is affected but these are the models
>>>> where I have done detailed testing.
>>>>
>>>> I would be happy to test any patches, or run a kernel with some extra
>>>> debugging added, just let me know what you need to help fixing this.
>>>
>>> How easy this is to trigger in general? (Mainly trying to gauge how 
>>> easy it will be to find the read and/or the irq that related to the 
>>> corrupted payload).
>>
>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>> calls, re-initializing and re-uploading the BT firmware 10 times and
>> then looked for "Frame reassembly failed" errors (+ more error caused
>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>
>> This would usually trigger at least twice with the 10 rmmod + modprobe
>> calls and typically more often then twice.
>>
>> With the revert (and with 5.18.y) I never see a single error even
>> with more rmmod + modprobe calls. Note it is not just the driver's
>> probe() which fails, sometimes there are also later errors, this
>> is just a somewhat convenient way to reproduce.
>>
>> So this is somewhat easy to trigger, not trivial to trigger,
>> but it also does not take waiting many hours before hitting it
>> once.
> 
> Okay, maybe the patch below will manage to filter the interesting part of 
> it (not tested beyond compiling). It should capture a few irq events after 
> the THRI is set when DMA Tx completes.
> 
> What I'd expect is to have iir value variation between the working and not 
> working case. Especially, if it's UART_IIR_THRI that shows up that might 
> be usable as a condition to return from handle_rx_dma() without touching
> DMA Rx.

Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
the revert and with your dmesg patch.

This is (part of) the HCI init without issues:

[  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  734.081492] Bluetooth: hci0: BCM: chip id 101

Here is another *successful* log with slightly different timing.

[  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  738.050142] Bluetooth: hci0: BCM: chip id 101

And here is the same part of the init sequence with an error:

[  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
[  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
[  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
[  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
[  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
[  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)

Here is a full dmesg from several rmmod + modprobes, for the errors look for
"Frame reassembly failed".

Note this is from a different boot, I had to add log_buf_len=50M to be able
to grab a full dmesg:

https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue

Regards,

Hans




> 
> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Subject: [PATCH] DEBUG: DMA Rx problem
> 
> A DEBUG PATCH, not intended for upstream inclusing.
> 
> ---
>  drivers/tty/serial/8250/8250_dma.c  |  4 +++-
>  drivers/tty/serial/8250/8250_port.c | 13 +++++++++++++
>  include/linux/serial_8250.h         |  1 +
>  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
> index 7fa66501792d..20d54abdded2 100644
> --- a/drivers/tty/serial/8250/8250_dma.c
> +++ b/drivers/tty/serial/8250/8250_dma.c
> @@ -32,8 +32,10 @@ static void __dma_tx_complete(void *param)
>  		uart_write_wakeup(&p->port);
>  
>  	ret = serial8250_tx_dma(p);
> -	if (ret || !dma->tx_running)
> +	if (ret || !dma->tx_running) {
> +		p->irq_debug = -5;
>  		serial8250_set_THRI(p);
> +	}
>  
>  	spin_unlock_irqrestore(&p->port.lock, flags);
>  }
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..193c08b1888a 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1923,6 +1923,9 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  	struct uart_8250_port *up = up_to_u8250p(port);
>  	bool skip_rx = false;
>  	unsigned long flags;
> +	long irq_debug;
> +	unsigned int tx_running = 0, rx_running = 0, tx_err = 0;
> +	unsigned int ier;
>  	u16 status;
>  
>  	if (iir & UART_IIR_NO_INT)
> @@ -1931,6 +1934,12 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  	spin_lock_irqsave(&port->lock, flags);
>  
>  	status = serial_lsr_in(up);
> +	if (up->dma) {
> +		rx_running = up->dma->rx_running;
> +		tx_running = up->dma->tx_running;
> +		tx_err = up->dma->tx_err;
> +	}
> +	ier = up->ier;
>  
>  	/*
>  	 * If port is stopped and there are no error conditions in the
> @@ -1957,7 +1966,11 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
>  			__stop_tx(up);
>  	}
>  
> +	irq_debug = up->irq_debug++;
>  	uart_unlock_and_check_sysrq_irqrestore(port, flags);
> +	if (irq_debug < 0)
> +		pr_err("8250irq: iir=%02x lsr+saved=%02x ier=%02x dma_t/rx/err=%u/%u/%u skiprx=%d cnt=%ld\n",
> +		       iir, status, ier, tx_running, rx_running, tx_err, skip_rx, irq_debug);
>  
>  	return 1;
>  }
> diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h
> index 741ed4807a9c..8d9d7de6728d 100644
> --- a/include/linux/serial_8250.h
> +++ b/include/linux/serial_8250.h
> @@ -107,6 +107,7 @@ struct uart_8250_port {
>  	unsigned char		mcr;
>  	unsigned char		cur_iotype;	/* Running I/O type */
>  	unsigned int		rpm_tx_active;
> +	long			irq_debug;
>  	unsigned char		canary;		/* non-zero during system sleep
>  						 *   if no_console_suspend
>  						 */
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 16:04       ` Hans de Goede
@ 2023-03-14 16:55         ` Ilpo Järvinen
  2023-03-14 19:02           ` Hans de Goede
  0 siblings, 1 reply; 10+ messages in thread
From: Ilpo Järvinen @ 2023-03-14 16:55 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

[-- Attachment #1: Type: text/plain, Size: 10356 bytes --]

On Tue, 14 Mar 2023, Hans de Goede wrote:
> On 3/14/23 12:48, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> >> On 3/14/23 11:55, Ilpo Järvinen wrote:
> >>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>
> >>>> I have spend the last couple of days debugging a problem with Bluetooth
> >>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >>>> and Cherry Trail devices.
> >>>>
> >>>> After much debugging I found out that sometimes the first byte of
> >>>> a received packet (data[0]) would be overwritten with a 0 byte.
> >>>>
> >>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>>>
> >>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>
> >>>> would sometimes turn into:
> >>>>
> >>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>
> >>>> Further investigation revealed that this goes away if I stop
> >>>> the dw_dmac module from loading, leading to:
> >>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >>>> and the UART working without DMA support.
> >>>>
> >>>> Testing various kernels showed that this problem was introduced
> >>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>>>
> >>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>>>
> >>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> >>>
> >>> You did something else too than just that because you cannot cleanly 
> >>> revert just e8ffbb71f783. Please indicate what happened to:
> >>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
> >>>
> >>> I guess you reverted that too and forgot to mention about it but I just 
> >>> want to be sure we're on the same page?
> >>
> >> I manually fixed up the revert, effectively
> >> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
> >> I did not revert f8d6e9d3ca5c in its entirety.
> >>
> >> I've attached my manual fixed up revert as a patch here.
> >>
> >>>> So it seems that that commit somehow interferes with DMA based
> >>>> data receiving, causing the first byte of a received data transfer
> >>>> to get replaced by 0.
> >>>
> >>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
> >>
> >> What I am sure is that the first byte of a packet has been replaced by 0
> >> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
> >>
> >> I did not dive into the serial-port code side of this problem since
> >> I'm unfamiliar with that.
> >>
> >>> Maybe the the extra interrupt that the tx side change will trigger somehow 
> >>> causes the confusion on the rx side. So that would be an extra call into 
> >>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> >>> would not occur w/o that tx side change.
> >>
> >> That sounds like a likely candidate for causing this yes, as said
> >> I'm unfamiliar with the serial-port code, but I did already suspect
> >> that the change was causing some extra interrupt which somehow
> >> interfered with the rx side.
> >>
> >>>> The issue has been seen on and the revert has been tested on
> >>>> the following HW:
> >>>>
> >>>> Asus T100TA
> >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>
> >>>> Lenovo Yoga Tablet 2 1051L
> >>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>
> >>>> Lenovo Yoga Book X91F
> >>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>>>
> >>>> I have more hw which I believe is affected but these are the models
> >>>> where I have done detailed testing.
> >>>>
> >>>> I would be happy to test any patches, or run a kernel with some extra
> >>>> debugging added, just let me know what you need to help fixing this.
> >>>
> >>> How easy this is to trigger in general? (Mainly trying to gauge how 
> >>> easy it will be to find the read and/or the irq that related to the 
> >>> corrupted payload).
> >>
> >> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
> >> calls, re-initializing and re-uploading the BT firmware 10 times and
> >> then looked for "Frame reassembly failed" errors (+ more error caused
> >> by this error) from the drivers/bluetooth/hci_bcm.c code.
> >>
> >> This would usually trigger at least twice with the 10 rmmod + modprobe
> >> calls and typically more often then twice.
> >>
> >> With the revert (and with 5.18.y) I never see a single error even
> >> with more rmmod + modprobe calls. Note it is not just the driver's
> >> probe() which fails, sometimes there are also later errors, this
> >> is just a somewhat convenient way to reproduce.
> >>
> >> So this is somewhat easy to trigger, not trivial to trigger,
> >> but it also does not take waiting many hours before hitting it
> >> once.
> > 
> > Okay, maybe the patch below will manage to filter the interesting part of 
> > it (not tested beyond compiling). It should capture a few irq events after 
> > the THRI is set when DMA Tx completes.
> > 
> > What I'd expect is to have iir value variation between the working and not 
> > working case. Especially, if it's UART_IIR_THRI that shows up that might 
> > be usable as a condition to return from handle_rx_dma() without touching
> > DMA Rx.
> 
> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
> the revert and with your dmesg patch.
> 
> This is (part of) the HCI init without issues:
> 
> [  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  734.081492] Bluetooth: hci0: BCM: chip id 101
> 
> Here is another *successful* log with slightly different timing.
> 
> [  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  738.050142] Bluetooth: hci0: BCM: chip id 101
> 
> And here is the same part of the init sequence with an error:
> 
> [  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
> [  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
> [  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
> [  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
> [  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
> [  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
> 
> Here is a full dmesg from several rmmod + modprobes, for the errors look for
> "Frame reassembly failed".
> 
> Note this is from a different boot, I had to add log_buf_len=50M to be able
> to grab a full dmesg:
> 
> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue

Thanks. I'll try to come up with another debug patch tomorrow to further 
inspect the payload that will be read after that point.

In the meantime you could try if the following patch is enough 
to solve the issue.


From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt

Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
connection failed due corrupted Rx payload. The problem was narrowed
down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
despite LSR having DR bit set, which is precondition for attempting to 
start DMA Rx.

This problem became apparent after e8ffbb71f783 ("serial: 8250: use
THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
using THRI interrupt.

Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
interrupt which has Rx related IIR value.

Reported-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

---
 drivers/tty/serial/8250/8250_port.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index fa43df05342b..a3ef2f42a06b 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
 static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
 {
 	switch (iir & 0x3f) {
+	case UART_IIR_THRI:
+		return false;
 	case UART_IIR_RDI:
 		if (!up->dma->rx_running)
 			break;

-- 
tg: (571079f5ba93..) fix/dma-rx-startup (depends on: tty-next)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 16:55         ` Ilpo Järvinen
@ 2023-03-14 19:02           ` Hans de Goede
  2023-03-15 14:47             ` Ilpo Järvinen
  0 siblings, 1 reply; 10+ messages in thread
From: Hans de Goede @ 2023-03-14 19:02 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

Hi,

On 3/14/23 17:55, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 12:48, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>
>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>>>> and Cherry Trail devices.
>>>>>>
>>>>>> After much debugging I found out that sometimes the first byte of
>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>>>
>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>>>
>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> would sometimes turn into:
>>>>>>
>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> Further investigation revealed that this goes away if I stop
>>>>>> the dw_dmac module from loading, leading to:
>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>>>> and the UART working without DMA support.
>>>>>>
>>>>>> Testing various kernels showed that this problem was introduced
>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>>>
>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>>>
>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>>>
>>>>> You did something else too than just that because you cannot cleanly 
>>>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>>>
>>>>> I guess you reverted that too and forgot to mention about it but I just 
>>>>> want to be sure we're on the same page?
>>>>
>>>> I manually fixed up the revert, effectively
>>>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>>>> I did not revert f8d6e9d3ca5c in its entirety.
>>>>
>>>> I've attached my manual fixed up revert as a patch here.
>>>>
>>>>>> So it seems that that commit somehow interferes with DMA based
>>>>>> data receiving, causing the first byte of a received data transfer
>>>>>> to get replaced by 0.
>>>>>
>>>>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
>>>>
>>>> What I am sure is that the first byte of a packet has been replaced by 0
>>>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>>>
>>>> I did not dive into the serial-port code side of this problem since
>>>> I'm unfamiliar with that.
>>>>
>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>>>> causes the confusion on the rx side. So that would be an extra call into 
>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>>>> would not occur w/o that tx side change.
>>>>
>>>> That sounds like a likely candidate for causing this yes, as said
>>>> I'm unfamiliar with the serial-port code, but I did already suspect
>>>> that the change was causing some extra interrupt which somehow
>>>> interfered with the rx side.
>>>>
>>>>>> The issue has been seen on and the revert has been tested on
>>>>>> the following HW:
>>>>>>
>>>>>> Asus T100TA
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Tablet 2 1051L
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Book X91F
>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>>>
>>>>>> I have more hw which I believe is affected but these are the models
>>>>>> where I have done detailed testing.
>>>>>>
>>>>>> I would be happy to test any patches, or run a kernel with some extra
>>>>>> debugging added, just let me know what you need to help fixing this.
>>>>>
>>>>> How easy this is to trigger in general? (Mainly trying to gauge how 
>>>>> easy it will be to find the read and/or the irq that related to the 
>>>>> corrupted payload).
>>>>
>>>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>>>> calls, re-initializing and re-uploading the BT firmware 10 times and
>>>> then looked for "Frame reassembly failed" errors (+ more error caused
>>>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>>>
>>>> This would usually trigger at least twice with the 10 rmmod + modprobe
>>>> calls and typically more often then twice.
>>>>
>>>> With the revert (and with 5.18.y) I never see a single error even
>>>> with more rmmod + modprobe calls. Note it is not just the driver's
>>>> probe() which fails, sometimes there are also later errors, this
>>>> is just a somewhat convenient way to reproduce.
>>>>
>>>> So this is somewhat easy to trigger, not trivial to trigger,
>>>> but it also does not take waiting many hours before hitting it
>>>> once.
>>>
>>> Okay, maybe the patch below will manage to filter the interesting part of 
>>> it (not tested beyond compiling). It should capture a few irq events after 
>>> the THRI is set when DMA Tx completes.
>>>
>>> What I'd expect is to have iir value variation between the working and not 
>>> working case. Especially, if it's UART_IIR_THRI that shows up that might 
>>> be usable as a condition to return from handle_rx_dma() without touching
>>> DMA Rx.
>>
>> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
>> the revert and with your dmesg patch.
>>
>> This is (part of) the HCI init without issues:
>>
>> [  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081492] Bluetooth: hci0: BCM: chip id 101
>>
>> Here is another *successful* log with slightly different timing.
>>
>> [  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.050142] Bluetooth: hci0: BCM: chip id 101
>>
>> And here is the same part of the init sequence with an error:
>>
>> [  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
>> [  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
>>
>> Here is a full dmesg from several rmmod + modprobes, for the errors look for
>> "Frame reassembly failed".
>>
>> Note this is from a different boot, I had to add log_buf_len=50M to be able
>> to grab a full dmesg:
>>
>> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue
> 
> Thanks. I'll try to come up with another debug patch tomorrow to further 
> inspect the payload that will be read after that point.
> 
> In the meantime you could try if the following patch is enough 
> to solve the issue.
> 
> 
> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> 
> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> connection failed due corrupted Rx payload. The problem was narrowed
> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> despite LSR having DR bit set, which is precondition for attempting to 
> start DMA Rx.
> 
> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> using THRI interrupt.
> 
> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> interrupt which has Rx related IIR value.
> 
> Reported-by: Hans de Goede <hdegoede@redhat.com>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

I can confirm that this fixes things for me:

Tested-by: Hans de Goede <hdegoede@redhat.com>

Regards,

Hans



> 
> ---
>  drivers/tty/serial/8250/8250_port.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..a3ef2f42a06b 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
>  static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
>  {
>  	switch (iir & 0x3f) {
> +	case UART_IIR_THRI:
> +		return false;
>  	case UART_IIR_RDI:
>  		if (!up->dma->rx_running)
>  			break;
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-14 19:02           ` Hans de Goede
@ 2023-03-15 14:47             ` Ilpo Järvinen
  2023-03-16 20:42               ` Hans de Goede
  0 siblings, 1 reply; 10+ messages in thread
From: Ilpo Järvinen @ 2023-03-15 14:47 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

[-- Attachment #1: Type: text/plain, Size: 9175 bytes --]

On Tue, 14 Mar 2023, Hans de Goede wrote:
> On 3/14/23 17:55, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> >> On 3/14/23 12:48, Ilpo Järvinen wrote:
> >>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
> >>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>>>
> >>>>>> I have spend the last couple of days debugging a problem with Bluetooth
> >>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >>>>>> and Cherry Trail devices.
> >>>>>>
> >>>>>> After much debugging I found out that sometimes the first byte of
> >>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
> >>>>>>
> >>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>>>>>
> >>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>
> >>>>>> would sometimes turn into:
> >>>>>>
> >>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>
> >>>>>> Further investigation revealed that this goes away if I stop
> >>>>>> the dw_dmac module from loading, leading to:
> >>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >>>>>> and the UART working without DMA support.
> >>>>>>
> >>>>>> Testing various kernels showed that this problem was introduced
> >>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>>>>>
> >>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>>>>>
> >>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> >>>>>
> >>>>>> So it seems that that commit somehow interferes with DMA based
> >>>>
> >>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
> >>>>> causes the confusion on the rx side. So that would be an extra call into 
> >>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> >>>>> would not occur w/o that tx side change.
> >>>>
> >>>> That sounds like a likely candidate for causing this yes, as said
> >>>> I'm unfamiliar with the serial-port code, but I did already suspect
> >>>> that the change was causing some extra interrupt which somehow
> >>>> interfered with the rx side.
> >>>>
> >>>>>> The issue has been seen on and the revert has been tested on
> >>>>>> the following HW:
> >>>>>>
> >>>>>> Asus T100TA
> >>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>
> >>>>>> Lenovo Yoga Tablet 2 1051L
> >>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>
> >>>>>> Lenovo Yoga Book X91F
> >>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>>>>>
> >>>>>> I have more hw which I believe is affected but these are the models
> >>>>>> where I have done detailed testing.
> >>>>>>
> >>>>>> I would be happy to test any patches, or run a kernel with some extra
> >>>>>> debugging added, just let me know what you need to help fixing this.

> > From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> > Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> > 
> > Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> > connection failed due corrupted Rx payload. The problem was narrowed
> > down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> > despite LSR having DR bit set, which is precondition for attempting to 
> > start DMA Rx.
> > 
> > This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> > THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> > using THRI interrupt.
> > 
> > Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> > interrupt which has Rx related IIR value.
> > 
> > Reported-by: Hans de Goede <hdegoede@redhat.com>
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> 
> I can confirm that this fixes things for me:
> 
> Tested-by: Hans de Goede <hdegoede@redhat.com>

Okay, thanks for testing.

Here's is a slightly improved debug patch which will count how many 
characters are received by DMA and non-DMA rx. It should be tested 
WITHOUT the fix.

I'm mostly interested in knowing if it's purely DMA Rx issue or whether 
the non-DMA rx is muddling the waters too. While investigating other 
issues I've seen UART_IIR_TIMEOUT (inter-character timeout) to often come 
so early from UART that the tail of characters is left there to be handled 
by the non-DMA path.


From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Subject: [PATCH] DEBUG: Dma rx-problem

A DEBUG PATCH, not intended for upstream inclusing.

---
 drivers/tty/serial/8250/8250_dma.c  | 17 ++++++++++++-----
 drivers/tty/serial/8250/8250_port.c | 19 +++++++++++++++++++
 include/linux/serial_8250.h         |  3 +++
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/drivers/tty/serial/8250/8250_dma.c b/drivers/tty/serial/8250/8250_dma.c
index 7fa66501792d..705c26941429 100644
--- a/drivers/tty/serial/8250/8250_dma.c
+++ b/drivers/tty/serial/8250/8250_dma.c
@@ -32,13 +32,15 @@ static void __dma_tx_complete(void *param)
 		uart_write_wakeup(&p->port);
 
 	ret = serial8250_tx_dma(p);
-	if (ret || !dma->tx_running)
+	if (ret || !dma->tx_running) {
+		p->irq_debug = -5;
 		serial8250_set_THRI(p);
+	}
 
 	spin_unlock_irqrestore(&p->port.lock, flags);
 }
 
-static void __dma_rx_complete(struct uart_8250_port *p)
+static int __dma_rx_complete(struct uart_8250_port *p)
 {
 	struct uart_8250_dma	*dma = p->dma;
 	struct tty_port		*tty_port = &p->port.state->port;
@@ -53,7 +55,7 @@ static void __dma_rx_complete(struct uart_8250_port *p)
 	 */
 	dma_status = dmaengine_tx_status(dma->rxchan, dma->rx_cookie, &state);
 	if (dma_status == DMA_IN_PROGRESS)
-		return;
+		return -1;
 
 	count = dma->rx_size - state.residue;
 
@@ -62,6 +64,8 @@ static void __dma_rx_complete(struct uart_8250_port *p)
 	dma->rx_running = 0;
 
 	tty_flip_buffer_push(tty_port);
+
+	return count;
 }
 
 static void dma_rx_complete(void *param)
@@ -69,10 +73,11 @@ static void dma_rx_complete(void *param)
 	struct uart_8250_port *p = param;
 	struct uart_8250_dma *dma = p->dma;
 	unsigned long flags;
+	int count = -1;
 
 	spin_lock_irqsave(&p->port.lock, flags);
 	if (dma->rx_running)
-		__dma_rx_complete(p);
+		count = __dma_rx_complete(p);
 
 	/*
 	 * Cannot be combined with the previous check because __dma_rx_complete()
@@ -81,6 +86,8 @@ static void dma_rx_complete(void *param)
 	if (!dma->rx_running && (serial_lsr_in(p) & UART_LSR_DR))
 		p->dma->rx_dma(p);
 	spin_unlock_irqrestore(&p->port.lock, flags);
+
+	pr_err("8250dma: rx complete %d\n", count);
 }
 
 int serial8250_tx_dma(struct uart_8250_port *p)
@@ -172,7 +179,7 @@ void serial8250_rx_dma_flush(struct uart_8250_port *p)
 
 	if (dma->rx_running) {
 		dmaengine_pause(dma->rxchan);
-		__dma_rx_complete(p);
+		p->dma_rx_count = __dma_rx_complete(p);
 		dmaengine_terminate_async(dma->rxchan);
 	}
 }
diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
index fa43df05342b..0c950ff7a8ee 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1811,6 +1811,7 @@ u16 serial8250_rx_chars(struct uart_8250_port *up, u16 lsr)
 	} while (lsr & (UART_LSR_DR | UART_LSR_BI));
 
 	tty_flip_buffer_push(&port->state->port);
+	up->rx_chars_count = 256 - max_count;
 	return lsr;
 }
 EXPORT_SYMBOL_GPL(serial8250_rx_chars);
@@ -1923,6 +1924,10 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 	struct uart_8250_port *up = up_to_u8250p(port);
 	bool skip_rx = false;
 	unsigned long flags;
+	long irq_debug;
+	unsigned int tx_running = 0, rx_running = 0, tx_err = 0;
+	unsigned int ier;
+	int rx_count, dma_count;
 	u16 status;
 
 	if (iir & UART_IIR_NO_INT)
@@ -1931,6 +1936,14 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 	spin_lock_irqsave(&port->lock, flags);
 
 	status = serial_lsr_in(up);
+	up->dma_rx_count = 0;
+	up->rx_chars_count = 0;
+	if (up->dma) {
+		rx_running = up->dma->rx_running;
+		tx_running = up->dma->tx_running;
+		tx_err = up->dma->tx_err;
+	}
+	ier = up->ier;
 
 	/*
 	 * If port is stopped and there are no error conditions in the
@@ -1957,7 +1970,13 @@ int serial8250_handle_irq(struct uart_port *port, unsigned int iir)
 			__stop_tx(up);
 	}
 
+	irq_debug = up->irq_debug++;
+	rx_count = up->rx_chars_count;
+	dma_count = up->dma_rx_count;
 	uart_unlock_and_check_sysrq_irqrestore(port, flags);
+	if (irq_debug < 0)
+		pr_err("8250irq: iir=%02x lsr+saved=%02x received=%d/%d ier=%02x dma_t/rx/err=%u/%u/%u cnt=%ld\n",
+		       iir, status, dma_count, rx_count, ier, tx_running, rx_running, tx_err, irq_debug);
 
 	return 1;
 }
diff --git a/include/linux/serial_8250.h b/include/linux/serial_8250.h
index 741ed4807a9c..4a0c0d549d7b 100644
--- a/include/linux/serial_8250.h
+++ b/include/linux/serial_8250.h
@@ -107,6 +107,9 @@ struct uart_8250_port {
 	unsigned char		mcr;
 	unsigned char		cur_iotype;	/* Running I/O type */
 	unsigned int		rpm_tx_active;
+	long			irq_debug;
+	int			rx_chars_count;
+	int			dma_rx_count;
 	unsigned char		canary;		/* non-zero during system sleep
 						 *   if no_console_suspend
 						 */

-- 
tg: (571079f5ba93..) debug/dma-rx-problem (depends on: tty-next)

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-15 14:47             ` Ilpo Järvinen
@ 2023-03-16 20:42               ` Hans de Goede
  2023-03-17 10:38                 ` Ilpo Järvinen
  0 siblings, 1 reply; 10+ messages in thread
From: Hans de Goede @ 2023-03-16 20:42 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

Hi,

On 3/15/23 15:47, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 17:55, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>> On 3/14/23 12:48, Ilpo Järvinen wrote:
>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>>>
>>>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>>>>>> and Cherry Trail devices.
>>>>>>>>
>>>>>>>> After much debugging I found out that sometimes the first byte of
>>>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>>>>>
>>>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>>>>>
>>>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>>>
>>>>>>>> would sometimes turn into:
>>>>>>>>
>>>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>>>
>>>>>>>> Further investigation revealed that this goes away if I stop
>>>>>>>> the dw_dmac module from loading, leading to:
>>>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>>>>>> and the UART working without DMA support.
>>>>>>>>
>>>>>>>> Testing various kernels showed that this problem was introduced
>>>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>>>>>
>>>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>>>>>
>>>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>>>>>
>>>>>>>> So it seems that that commit somehow interferes with DMA based
>>>>>>
>>>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>>>>>> causes the confusion on the rx side. So that would be an extra call into 
>>>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>>>>>> would not occur w/o that tx side change.
>>>>>>
>>>>>> That sounds like a likely candidate for causing this yes, as said
>>>>>> I'm unfamiliar with the serial-port code, but I did already suspect
>>>>>> that the change was causing some extra interrupt which somehow
>>>>>> interfered with the rx side.
>>>>>>
>>>>>>>> The issue has been seen on and the revert has been tested on
>>>>>>>> the following HW:
>>>>>>>>
>>>>>>>> Asus T100TA
>>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>>>
>>>>>>>> Lenovo Yoga Tablet 2 1051L
>>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>>>
>>>>>>>> Lenovo Yoga Book X91F
>>>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>>>>>
>>>>>>>> I have more hw which I believe is affected but these are the models
>>>>>>>> where I have done detailed testing.
>>>>>>>>
>>>>>>>> I would be happy to test any patches, or run a kernel with some extra
>>>>>>>> debugging added, just let me know what you need to help fixing this.
> 
>>> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
>>> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
>>>
>>> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
>>> connection failed due corrupted Rx payload. The problem was narrowed
>>> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
>>> despite LSR having DR bit set, which is precondition for attempting to 
>>> start DMA Rx.
>>>
>>> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
>>> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
>>> using THRI interrupt.
>>>
>>> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
>>> interrupt which has Rx related IIR value.
>>>
>>> Reported-by: Hans de Goede <hdegoede@redhat.com>
>>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
>>
>> I can confirm that this fixes things for me:
>>
>> Tested-by: Hans de Goede <hdegoede@redhat.com>
> 
> Okay, thanks for testing.
> 
> Here's is a slightly improved debug patch which will count how many 
> characters are received by DMA and non-DMA rx. It should be tested 
> WITHOUT the fix.
> 
> I'm mostly interested in knowing if it's purely DMA Rx issue or whether 
> the non-DMA rx is muddling the waters too. While investigating other 
> issues I've seen UART_IIR_TIMEOUT (inter-character timeout) to often come 
> so early from UART that the tail of characters is left there to be handled 
> by the non-DMA path.

Ok, here is a new log:

https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue-2

With one successful hci_uart probe and one failed. The failed one looks interesting:

[  576.808776] 8250irq: iir=cc lsr+saved=60 received=1/12 ier=0d dma_t/rx/err=0/1/0 cnt=-4
[  576.808870] Bluetooth: hci0: Frame reassembly failed (-84)

This is the only time that debug message shows anything but 0 for data received over dma. It looks like for all these small packets dma simply never kicks in and the extra IRQ from the THRI interrupt somehow starts a DMA transfer for the one byte and for some reason that DMA transfer always reads the byte as 0.

Note that the amount of received bytes is still correct, so the 1 byte transfer by DMA replaces one byte which would be otherwise read from MMIO (I guess its MMIO?), but it has the wrong (all bits 0) content.

Regards,

Hans



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
  2023-03-16 20:42               ` Hans de Goede
@ 2023-03-17 10:38                 ` Ilpo Järvinen
  0 siblings, 0 replies; 10+ messages in thread
From: Ilpo Järvinen @ 2023-03-17 10:38 UTC (permalink / raw)
  To: Hans de Goede
  Cc: Greg Kroah-Hartman, Jiri Slaby, open list:SERIAL DRIVERS, regressions

[-- Attachment #1: Type: text/plain, Size: 6469 bytes --]

On Thu, 16 Mar 2023, Hans de Goede wrote:
> On 3/15/23 15:47, Ilpo Järvinen wrote:
> > On Tue, 14 Mar 2023, Hans de Goede wrote:
> >> On 3/14/23 17:55, Ilpo Järvinen wrote:
> >>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>> On 3/14/23 12:48, Ilpo Järvinen wrote:
> >>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
> >>>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
> >>>>>>>
> >>>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
> >>>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
> >>>>>>>> and Cherry Trail devices.
> >>>>>>>>
> >>>>>>>> After much debugging I found out that sometimes the first byte of
> >>>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
> >>>>>>>>
> >>>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
> >>>>>>>>
> >>>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>>>
> >>>>>>>> would sometimes turn into:
> >>>>>>>>
> >>>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
> >>>>>>>>
> >>>>>>>> Further investigation revealed that this goes away if I stop
> >>>>>>>> the dw_dmac module from loading, leading to:
> >>>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
> >>>>>>>> and the UART working without DMA support.
> >>>>>>>>
> >>>>>>>> Testing various kernels showed that this problem was introduced
> >>>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
> >>>>>>>>
> >>>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
> >>>>>>>>
> >>>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
> >>>>>>>
> >>>>>>>> So it seems that that commit somehow interferes with DMA based
> >>>>>>
> >>>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
> >>>>>>> causes the confusion on the rx side. So that would be an extra call into 
> >>>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
> >>>>>>> would not occur w/o that tx side change.
> >>>>>>
> >>>>>> That sounds like a likely candidate for causing this yes, as said
> >>>>>> I'm unfamiliar with the serial-port code, but I did already suspect
> >>>>>> that the change was causing some extra interrupt which somehow
> >>>>>> interfered with the rx side.
> >>>>>>
> >>>>>>>> The issue has been seen on and the revert has been tested on
> >>>>>>>> the following HW:
> >>>>>>>>
> >>>>>>>> Asus T100TA
> >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>>>
> >>>>>>>> Lenovo Yoga Tablet 2 1051L
> >>>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
> >>>>>>>>
> >>>>>>>> Lenovo Yoga Book X91F
> >>>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
> >>>>>>>>
> >>>>>>>> I have more hw which I believe is affected but these are the models
> >>>>>>>> where I have done detailed testing.
> >>>>>>>>
> >>>>>>>> I would be happy to test any patches, or run a kernel with some extra
> >>>>>>>> debugging added, just let me know what you need to help fixing this.
> > 
> >>> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> >>> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> >>>
> >>> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> >>> connection failed due corrupted Rx payload. The problem was narrowed
> >>> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> >>> despite LSR having DR bit set, which is precondition for attempting to 
> >>> start DMA Rx.
> >>>
> >>> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> >>> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> >>> using THRI interrupt.
> >>>
> >>> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> >>> interrupt which has Rx related IIR value.
> >>>
> >>> Reported-by: Hans de Goede <hdegoede@redhat.com>
> >>> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> >>
> >> I can confirm that this fixes things for me:
> >>
> >> Tested-by: Hans de Goede <hdegoede@redhat.com>
> > 
> > Okay, thanks for testing.
> > 
> > Here's is a slightly improved debug patch which will count how many 
> > characters are received by DMA and non-DMA rx. It should be tested 
> > WITHOUT the fix.
> > 
> > I'm mostly interested in knowing if it's purely DMA Rx issue or whether 
> > the non-DMA rx is muddling the waters too. While investigating other 
> > issues I've seen UART_IIR_TIMEOUT (inter-character timeout) to often come 
> > so early from UART that the tail of characters is left there to be handled 
> > by the non-DMA path.
> 
> Ok, here is a new log:
> 
> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue-2
> 
> With one successful hci_uart probe and one failed. The failed one looks interesting:
> 
> [  576.808776] 8250irq: iir=cc lsr+saved=60 received=1/12 ier=0d dma_t/rx/err=0/1/0 cnt=-4
> [  576.808870] Bluetooth: hci0: Frame reassembly failed (-84)
> 
> This is the only time that debug message shows anything but 0 for data 
> received over dma.

For the record, there were plenty of =1/6 and a few =224/31.

> It looks like for all these small packets dma simply 
> never kicks in and the extra IRQ from the THRI interrupt somehow starts 
> a DMA transfer for the one byte and for some reason that DMA transfer 
> always reads the byte as 0. 

It would be kind of interesting to know if handle_rx_dma() would never 
call up->dma->rx_dma(up) but return true instead to see if non-DMA Rx gets
a corrupted byte too in the very same scenario (it's not the same test as 
your no DMA test you did initially because only DMA Rx gets disabled so it
hopefully hits the same timing, etc. by keeping DMA Tx on).

> Note that the amount of received bytes is still correct, so the 1 byte 
> transfer by DMA replaces one byte which would be otherwise read from 
> MMIO (I guess its MMIO?), but it has the wrong (all bits 0) content.

Yes, it's certainly is a corruption which shouldn't occur.

I was contemplating whether to apply the fix to all 8250 or just add 
HW specific UART_BUG_xx for this. In the end (especially while looking 
into these most recent debug logs) I think it's generally useful to _not_ 
start DMA on UART_IIR_THRI because it's impossible to make an informed 
decision with it whether to do DMA or not.

I've just sent the cleaned up fix patch out.

Thanks a lot for testing.


-- 
 i.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-03-17 10:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-14 10:20 [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx Hans de Goede
2023-03-14 10:55 ` Ilpo Järvinen
2023-03-14 11:11   ` Hans de Goede
2023-03-14 11:48     ` Ilpo Järvinen
2023-03-14 16:04       ` Hans de Goede
2023-03-14 16:55         ` Ilpo Järvinen
2023-03-14 19:02           ` Hans de Goede
2023-03-15 14:47             ` Ilpo Järvinen
2023-03-16 20:42               ` Hans de Goede
2023-03-17 10:38                 ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).