regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Hans de Goede <hdegoede@redhat.com>
To: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jiri Slaby <jirislaby@kernel.org>,
	"open list:SERIAL DRIVERS" <linux-serial@vger.kernel.org>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>
Subject: Re: [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx
Date: Tue, 14 Mar 2023 20:02:57 +0100	[thread overview]
Message-ID: <00de13b2-0ed9-efe2-e8d8-c9370c04e80b@redhat.com> (raw)
In-Reply-To: <f729bfb4-cde1-9859-f1ff-58716772dfc4@linux.intel.com>

Hi,

On 3/14/23 17:55, Ilpo Järvinen wrote:
> On Tue, 14 Mar 2023, Hans de Goede wrote:
>> On 3/14/23 12:48, Ilpo Järvinen wrote:
>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>> On 3/14/23 11:55, Ilpo Järvinen wrote:
>>>>> On Tue, 14 Mar 2023, Hans de Goede wrote:
>>>>>
>>>>>> I have spend the last couple of days debugging a problem with Bluetooth
>>>>>> adapters (HCIs) connected over an UART connection on Intel Bay Trail
>>>>>> and Cherry Trail devices.
>>>>>>
>>>>>> After much debugging I found out that sometimes the first byte of
>>>>>> a received packet (data[0]) would be overwritten with a 0 byte.
>>>>>>
>>>>>> E.g. this packet received during init of a BCM4324B3 (1) Bluetooth HCI:
>>>>>>
>>>>>> 04 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> would sometimes turn into:
>>>>>>
>>>>>> 00 0e 0a 01 79 fc 00 54 fe ff ff 00 00
>>>>>>
>>>>>> Further investigation revealed that this goes away if I stop
>>>>>> the dw_dmac module from loading, leading to:
>>>>>> "dw-apb-uart 80860F0A:00: failed to request DMA"
>>>>>> and the UART working without DMA support.
>>>>>>
>>>>>> Testing various kernels showed that this problem was introduced
>>>>>> in v5.19, v5.18 - v5.18.19 are fine. An a git bisect points to:
>>>>>>
>>>>>> e8ffbb71f783 ("serial: 8250: use THRE & __stop_tx also with DMA")
>>>>>>
>>>>>> And reverting that on top of v6.3-rc2 indeed solves the problem.
>>>>>
>>>>> You did something else too than just that because you cannot cleanly 
>>>>> revert just e8ffbb71f783. Please indicate what happened to:
>>>>>   f8d6e9d3ca5c ("serial: 8250: Fix __stop_tx() & DMA Tx restart races")
>>>>>
>>>>> I guess you reverted that too and forgot to mention about it but I just 
>>>>> want to be sure we're on the same page?
>>>>
>>>> I manually fixed up the revert, effectively
>>>> dropping the drivers/tty/serial/8250/8250_port.c part of f8d6e9d3ca5c
>>>> I did not revert f8d6e9d3ca5c in its entirety.
>>>>
>>>> I've attached my manual fixed up revert as a patch here.
>>>>
>>>>>> So it seems that that commit somehow interferes with DMA based
>>>>>> data receiving, causing the first byte of a received data transfer
>>>>>> to get replaced by 0.
>>>>>
>>>>> Okay, and you're sure the problem/corruption occurs on the receiving side? 
>>>>
>>>> What I am sure is that the first byte of a packet has been replaced by 0
>>>> by the time drivers/tty/tty_buffer.c: receive_buf() gets called.
>>>>
>>>> I did not dive into the serial-port code side of this problem since
>>>> I'm unfamiliar with that.
>>>>
>>>>> Maybe the the extra interrupt that the tx side change will trigger somehow 
>>>>> causes the confusion on the rx side. So that would be an extra call into 
>>>>> handle_rx_dma() that could either do an extra flush or start DMA Rx that 
>>>>> would not occur w/o that tx side change.
>>>>
>>>> That sounds like a likely candidate for causing this yes, as said
>>>> I'm unfamiliar with the serial-port code, but I did already suspect
>>>> that the change was causing some extra interrupt which somehow
>>>> interfered with the rx side.
>>>>
>>>>>> The issue has been seen on and the revert has been tested on
>>>>>> the following HW:
>>>>>>
>>>>>> Asus T100TA
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Tablet 2 1051L
>>>>>> SoC: Bay Trail UART: 80860F0A WIFI: brcmfmac43241b4-sdio BT: BCM4324B3
>>>>>>
>>>>>> Lenovo Yoga Book X91F
>>>>>> Soc: Cherry Trail UART: 8086228A WIFI: brcmfmac4356-pcie BT: BCM4356A2
>>>>>>
>>>>>> I have more hw which I believe is affected but these are the models
>>>>>> where I have done detailed testing.
>>>>>>
>>>>>> I would be happy to test any patches, or run a kernel with some extra
>>>>>> debugging added, just let me know what you need to help fixing this.
>>>>>
>>>>> How easy this is to trigger in general? (Mainly trying to gauge how 
>>>>> easy it will be to find the read and/or the irq that related to the 
>>>>> corrupted payload).
>>>>
>>>> For the git bisect I did 10 consecutive "rmmod hci_uart; modprobe hci_uart"
>>>> calls, re-initializing and re-uploading the BT firmware 10 times and
>>>> then looked for "Frame reassembly failed" errors (+ more error caused
>>>> by this error) from the drivers/bluetooth/hci_bcm.c code.
>>>>
>>>> This would usually trigger at least twice with the 10 rmmod + modprobe
>>>> calls and typically more often then twice.
>>>>
>>>> With the revert (and with 5.18.y) I never see a single error even
>>>> with more rmmod + modprobe calls. Note it is not just the driver's
>>>> probe() which fails, sometimes there are also later errors, this
>>>> is just a somewhat convenient way to reproduce.
>>>>
>>>> So this is somewhat easy to trigger, not trivial to trigger,
>>>> but it also does not take waiting many hours before hitting it
>>>> once.
>>>
>>> Okay, maybe the patch below will manage to filter the interesting part of 
>>> it (not tested beyond compiling). It should capture a few irq events after 
>>> the THRI is set when DMA Tx completes.
>>>
>>> What I'd expect is to have iir value variation between the working and not 
>>> working case. Especially, if it's UART_IIR_THRI that shows up that might 
>>> be usable as a condition to return from handle_rx_dma() without touching
>>> DMA Rx.
>>
>> Here are 2 dmesg snippets both are with a 6.3-rc2 kernel *without*
>> the revert and with your dmesg patch.
>>
>> This is (part of) the HCI init without issues:
>>
>> [  733.682492] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  733.682628] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  733.927142] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.928311] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.932892] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.933818] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  733.970404] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  733.975558] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  734.080290] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081300] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  734.081492] Bluetooth: hci0: BCM: chip id 101
>>
>> Here is another *successful* log with slightly different timing.
>>
>> [  737.660532] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  737.660681] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  737.905238] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.906429] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.907782] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.908978] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  737.935929] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  737.940976] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.048476] 8250irq: iir=cc lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049530] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  738.049678] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  738.050142] Bluetooth: hci0: BCM: chip id 101
>>
>> And here is the same part of the init sequence with an error:
>>
>> [  771.247538] hci_uart_bcm serial0-0: supply vbat not found, using dummy regulator
>> [  771.247656] hci_uart_bcm serial0-0: supply vddio not found, using dummy regulator
>> [  771.493259] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.494378] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.495637] 8250irq: iir=c2 lsr+saved=20 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.496833] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.523597] 8250irq: iir=c2 lsr+saved=60 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.528708] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/0/0 skiprx=0 cnt=-4
>> [  771.632340] 8250irq: iir=c2 lsr+saved=61 ier=0f dma_t/rx/err=0/0/0 skiprx=0 cnt=-5
>> [  771.632462] 8250irq: iir=cc lsr+saved=60 ier=0d dma_t/rx/err=0/1/0 skiprx=0 cnt=-4
>> [  771.632568] Bluetooth: hci0: Frame reassembly failed (-84)
>>
>> Here is a full dmesg from several rmmod + modprobes, for the errors look for
>> "Frame reassembly failed".
>>
>> Note this is from a different boot, I had to add log_buf_len=50M to be able
>> to grab a full dmesg:
>>
>> https://fedorapeople.org/~jwrdegoede/dmesg-8250-dma-issue
> 
> Thanks. I'll try to come up with another debug patch tomorrow to further 
> inspect the payload that will be read after that point.
> 
> In the meantime you could try if the following patch is enough 
> to solve the issue.
> 
> 
> From: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Subject: [PATCH] serial: 8250: Prevent starting up DMA Rx on THRI interrupt
> 
> Hans de Goede reported Bluetooth adapters (HCIs) connected over an UART
> connection failed due corrupted Rx payload. The problem was narrowed
> down to DMA Rx starting on UART_IIR_THRI interrupt. The problem occurs 
> despite LSR having DR bit set, which is precondition for attempting to 
> start DMA Rx.
> 
> This problem became apparent after e8ffbb71f783 ("serial: 8250: use
> THRE & __stop_tx also with DMA") changed Tx stopping to get triggered
> using THRI interrupt.
> 
> Don't setup DMA Rx on UART_IIR_THRI but leave it to a subsequent
> interrupt which has Rx related IIR value.
> 
> Reported-by: Hans de Goede <hdegoede@redhat.com>
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

I can confirm that this fixes things for me:

Tested-by: Hans de Goede <hdegoede@redhat.com>

Regards,

Hans



> 
> ---
>  drivers/tty/serial/8250/8250_port.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/tty/serial/8250/8250_port.c b/drivers/tty/serial/8250/8250_port.c
> index fa43df05342b..a3ef2f42a06b 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1903,6 +1903,8 @@ EXPORT_SYMBOL_GPL(serial8250_modem_status);
>  static bool handle_rx_dma(struct uart_8250_port *up, unsigned int iir)
>  {
>  	switch (iir & 0x3f) {
> +	case UART_IIR_THRI:
> +		return false;
>  	case UART_IIR_RDI:
>  		if (!up->dma->rx_running)
>  			break;
> 


  reply	other threads:[~2023-03-14 19:03 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-14 10:20 [Regression] "serial: 8250: use THRE & __stop_tx also with DMA" breaks DMA rx Hans de Goede
2023-03-14 10:55 ` Ilpo Järvinen
2023-03-14 11:11   ` Hans de Goede
2023-03-14 11:48     ` Ilpo Järvinen
2023-03-14 16:04       ` Hans de Goede
2023-03-14 16:55         ` Ilpo Järvinen
2023-03-14 19:02           ` Hans de Goede [this message]
2023-03-15 14:47             ` Ilpo Järvinen
2023-03-16 20:42               ` Hans de Goede
2023-03-17 10:38                 ` Ilpo Järvinen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=00de13b2-0ed9-efe2-e8d8-c9370c04e80b@redhat.com \
    --to=hdegoede@redhat.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=jirislaby@kernel.org \
    --cc=linux-serial@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).