From: Hans de Goede <hdegoede@redhat.com>
To: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Jiri Slaby <jirislaby@kernel.org>,
"open list:SERIAL DRIVERS" <linux-serial@vger.kernel.org>,
"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
Date: Tue, 14 Mar 2023 20:02:57 +0100 [thread overview]
Message-ID: <00de13b2-0ed9-efe2-e8d8-c9370c04e80b@redhat.com> (raw)
In-Reply-To: <f729bfb4-cde1-9859-f1ff-58716772dfc4@linux.intel.com>
Hi,
On 3/14/23 17:55, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 12:48, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>
>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>>>> and Cherry Trail devices.
>>>>>>
>>>>>> After much debugging I found out that sometimes the first byte of
>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>>>
>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>>>
>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> would sometimes turn into:
>>>>>>
>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> Further investigation revealed that this goes away if I stop
>>>>>> the dw_dmac module from loading, leading to:
>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>>>> and the UART working without DMA support.
>>>>>>
>>>>>> Testing various kernels showed that this problem was introduced
>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>>>
>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>>>
>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>>>
>>>>> You did something else too than just that because you cannot cleanly
>>>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>>> f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>>>
>>>>> I guess you reverted that too and forgot to mention about it but I just
>>>>> want to be sure we're on the same page?
>>>>
>>>> I manually fixed up the revert, effectively
>>>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>>>> I did not revert f8d6e9d3ca5c in its entirety.
>>>>
>>>> I've attached my manual fixed up revert as a patch here.
>>>>
>>>>>> So it seems that that commit somehow interferes with DMA based
>>>>>> data receiving, causing the first byte of a received data transfer
>>>>>> to get replaced by 0.
>>>>>
>>>>> Okay, and you're sure the problem/corruption occurs on the receiving side?
>>>>
>>>> What I am sure is that the first byte of a packet has been replaced by 0
>>>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>>>
>>>> I did not dive into the serial-port code side of this problem since
>>>> I'm unfamiliar with that.
>>>>
>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow
>>>>> causes the confusion on the rx side. So that would be an extra call into
>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that
>>>>> would not occur w/o that tx side change.
>>>>
>>>> That sounds like a likely candidate for causing this yes, as said
>>>> I'm unfamiliar with the serial-port code, but I did already suspect
>>>> that the change was causing some extra interrupt which somehow
>>>> interfered with the rx side.
>>>>
>>>>>> The issue has been seen on and the revert has been tested on
>>>>>> the following HW:
>>>>>>
>>>>>> Asus T100TA
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Tablet 2 1051L
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Book X91F
>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>>>
>>>>>> I have more hw which I believe is affected but these are the models
>>>>>> where I have done detailed testing.
>>>>>>
>>>>>> I would be happy to test any patches, or run a kernel with some extra
>>>>>> debugging added, just let me know what you need to help fixing this.
>>>>>
>>>>> How easy this is to trigger in general? (Mainly trying to gauge how
>>>>> easy it will be to find the read and/or the irq that related to the
>>>>> corrupted payload).
>>>>
>>>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>>>> calls, re-initializing and re-uploading the BT firmware 10 times and
>>>> then looked for "Frame reassembly failed" errors (+ more error caused
>>>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>>>
>>>> This would usually trigger at least twice with the 10 rmmod + modprobe
>>>> calls and typically more often then twice.
>>>>
>>>> With the revert (and with 5.18.y) I never see a single error even
>>>> with more rmmod + modprobe calls. Note it is not just the driver's
>>>> probe() which fails, sometimes there are also later errors, this
>>>> is just a somewhat convenient way to reproduce.
>>>>
>>>> So this is somewhat easy to trigger, not trivial to trigger,
>>>> but it also does not take waiting many hours before hitting it
>>>> once.
>>>
>>> Okay, maybe the patch below will manage to filter the interesting part of
>>> it (not tested beyond compiling). It should capture a few irq events after
>>> the THRI is set when DMA Tx completes.
>>>
>>> What I'd expect is to have iir value variation between the working and not
>>> working case. Especially, if it's UART_IIR_THRI that shows up that might
>>> be usable as a condition to return from handle_rx_dma() without touching
>>> DMA Rx.
>>
>> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
>> the revert and with your dmesg patch.
>>
>> This is (part of) the HCI init without issues:
>>
>> [ 733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [ 733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [ 733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 734.081492] Bluetooth: hci0: BCM: chip id 101
>>
>> Here is another *successful* log with slightly different timing.
>>
>> [ 737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [ 737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [ 737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 738.050142] Bluetooth: hci0: BCM: chip id 101
>>
>> And here is the same part of the init sequence with an error:
>>
>> [ 771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [ 771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [ 771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [ 771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [ 771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
>> [ 771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
>>
>> Here is a full dmesg from several rmmod + modprobes, for the errors look for
>> "Frame reassembly failed".
>>
>> Note this is from a different boot, I had to add log_buf_len=50M to be able
>> to grab a full dmesg:
>>
>> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue
>
> Thanks. I'll try to come up with another debug patch tomorrow to further
> inspect the payload that will be read after that point.
>
> In the meantime you could try if the following patch is enough
> to solve the issue.
>
>
> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
>
> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> connection failed due corrupted Rx payload. The problem was narrowed
> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs
> despite LSR having DR bit set, which is precondition for attempting to
> start DMA Rx.
>
> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> using THRI interrupt.
>
> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> interrupt which has Rx related IIR value.
>
> Reported-by: Hans de Goede <hdegoede@redhat.com>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
I can confirm that this fixes things for me:
Tested-by: Hans de Goede <hdegoede@redhat.com>
Regards,
Hans
>
> ---
> drivers/tty/serial/8250/8250_port.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..a3ef2f42a06b 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
> static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
> {
> switch (iir & 0x3f) {
> + case UART_IIR_THRI:
> + return false;
> case UART_IIR_RDI:
> if (!up->dma->rx_running)
> break;
>
next prev parent reply other threads:[~2023-03-14 19:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-14 10:20 [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx Hans de Goede
2023-03-14 10:55 ` Ilpo Järvinen
2023-03-14 11:11 ` Hans de Goede
2023-03-14 11:48 ` Ilpo Järvinen
2023-03-14 16:04 ` Hans de Goede
2023-03-14 16:55 ` Ilpo Järvinen
2023-03-14 19:02 ` Hans de Goede [this message]
2023-03-15 14:47 ` Ilpo Järvinen
2023-03-16 20:42 ` Hans de Goede
2023-03-17 10:38 ` Ilpo Järvinen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=00de13b2-0ed9-efe2-e8d8-c9370c04e80b@redhat.com \
--to=hdegoede@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=jirislaby@kernel.org \
--cc=linux-serial@vger.kernel.org \
--cc=regressions@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).