Re: XHCI vs PCM2903B/PCM2904 part 2

From: Rik van Riel <riel@surriel.com>
To: Mathias Nyman <mathias.nyman@linux.intel.com>,
	linux-usb <linux-usb@vger.kernel.org>
Cc: alsa-devel@alsa-project.org,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Mathias Nyman <mathias.nyman@intel.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Jaroslav Kysela <perex@perex.cz>, Takashi Iwai <tiwai@suse.com>
Subject: Re: XHCI vs PCM2903B/PCM2904 part 2
Date: Wed, 20 May 2020 15:53:25 -0400	[thread overview]
Message-ID: <438394b419955a430d884a175a6eda0a99ce4d08.camel@surriel.com> (raw)
In-Reply-To: <296d1b03-fbc3-6107-950e-00d38633ea1f@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 5293 bytes --]

On Wed, 2020-05-20 at 16:50 +0300, Mathias Nyman wrote:
> On 20.5.2020 14.26, Rik van Riel wrote:
> > After a few more weeks of digging, I have come to the tentative
> > conclusion that either the XHCI driver, or the USB sound driver,
> > or both, fail to handle USB errors correctly.
> > 
> > I have some questions at the bottom, after a (brief-ish)
> > explanation
> > of exactly what seems to go wrong.
> > 
> > TL;DR: arecord from a misbehaving device can hang forever
> > after a USB error, due to poll on /dev/snd/timer never returning.
> > 
> > The details: under some mysterious circumstances, the PCM290x
> > family sound chips can send more data than expected during an
> > isochronous transfer, leading to a babble error. Those
> > circumstances seem to in part depend on the USB host controller
> > and/or the electrical environment, since the chips work just
> > fine for most people.
> > 
> > Receiving data past the end of the isochronous transfer window
> > scheduled for a device results in the XHCI controller throwing
> > a babble error, which moves the endpoint into halted state.
> > 
> > This is followed by the host controller software sending a
> > reset endpoint command, and moving the endpoint into stopped
> > state, as specified on pages 164-165 of the XHCI specification.
> 
> Note that isoch endpoints should generate buffer overrun instead of
> babble detect error on TD babble conditions. 
> See xHCI spec 6.4.5 additional note 115.

Maybe it should, but I'm hitting this printk in handle_tx_event:

        case COMP_BABBLE_DETECTED_ERROR:
                xhci_dbg(xhci, "Babble error for slot %u ep %u on
endpoint\n",
                         slot_id, ep_index);
                status = -EOVERFLOW;
                break;

Tracing in the sound driver suggests that packet belongs to
the sound data device (not the sync device), because the
URB with -EOVERFLOW status for the isochronous packet in that
URB is the last one seen there.

> Or maybe a frame babble could halt an isoc endpoint, see xhci
> 4.10.2.4.1
> but then you should see a port error and port going to disabled
> state.
> 
> Any logs of this?
> 
> mount -t debugfs none /sys/kernel/debug
> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> < trigger the issue >
> Send output of dmesg
> Send content of /sys/kernel/debug/tracing/trace

dmesg: 
https://drive.google.com/open?id=1S2Qc8lroqA5-RMukuLBLWFGx10vEjG-i

usbtrace: 
https://drive.google.com/open?id=1cbLcOnAtQRW0Chgak6PNC0l4yJv__4uO

> > However, the USB sound driver seems to have no idea that this
> > error happened. The function retire_capture_urb looks at the
> > status of each isochronous frame, but seems to be under the
> > assumption that the sound device just keeps on running.
> > 
> > The function snd_complete_urb seems to only detect that the
> > device is not running if usb_submit_urb returns a failure.
> > 
> >         err = usb_submit_urb(urb, GFP_ATOMIC);
> >         if (err == 0)
> >                 return;
> > 
> >         usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n",
> > err);
> > 
> >         if (ep->data_subs && ep->data_subs->pcm_substream) {
> >                 substream = ep->data_subs->pcm_substream;
> >                 snd_pcm_stop_xrun(substream);
> >         }
> > 
> > However, the XHCI driver will happily submit an URB to a
> > stopped device. Looking at the call trace usb_submit_urb ->
> > xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring,
> > you can see this code:
> > 
> >         /* Make sure the endpoint has been added to xHC schedule */
> >         switch (ep_state) {
> > ...
> >         case EP_STATE_HALTED:
> >                 xhci_dbg(xhci, "WARN halted endpoint, queueing URB
> > anyway.\n");
> >         case EP_STATE_STOPPED:
> >         case EP_STATE_RUNNING:
> >                 break;
> > 
> > This leads me to a few questions:
> > - should retire_capture_urb call snd_pcm_stop_xrun,
> >   or another function like it, if it sees certain
> >   errors in the iso frame in the URB?
> > - should snd_complete_urb do something with these
> >   errors, too, in case they happen on the sync frames
> >   and not the data frames?
> > - does the XHCI code need to ring the doorbell when
> >   submitting an URB to a stopped device, or is it
> >   always up to the higher-level driver to fully reset
> >   the device before it can do anything useful?
> 
> xhci driver should ring the doorbell.
> 
> xhci_queue_isoc_tx()
>   giveback_first_trb()
>     xhci_ring_ep_doorbell()
> 
> when we are talking about babble or transaction errors the device
> might be
> completely unaware of the situation. Device side of the endpoint is
> not necessarily halted.
> xHC host will only halt its internal endpoint state, and it needs a
> reset
> endopoint command from the xhci driver to clear the internal halt
> state. 

Interesting. I had no idea the endpoint on each side
was controlled separately.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]