On Wed, 2020-05-20 at 16:50 +0300, Mathias Nyman wrote: > On 20.5.2020 14.26, Rik van Riel wrote: > > After a few more weeks of digging, I have come to the tentative > > conclusion that either the XHCI driver, or the USB sound driver, > > or both, fail to handle USB errors correctly. > > > > I have some questions at the bottom, after a (brief-ish) > > explanation > > of exactly what seems to go wrong. > > > > TL;DR: arecord from a misbehaving device can hang forever > > after a USB error, due to poll on /dev/snd/timer never returning. > > > > The details: under some mysterious circumstances, the PCM290x > > family sound chips can send more data than expected during an > > isochronous transfer, leading to a babble error. Those > > circumstances seem to in part depend on the USB host controller > > and/or the electrical environment, since the chips work just > > fine for most people. > > > > Receiving data past the end of the isochronous transfer window > > scheduled for a device results in the XHCI controller throwing > > a babble error, which moves the endpoint into halted state. > > > > This is followed by the host controller software sending a > > reset endpoint command, and moving the endpoint into stopped > > state, as specified on pages 164-165 of the XHCI specification. > > Note that isoch endpoints should generate buffer overrun instead of > babble detect error on TD babble conditions. > See xHCI spec 6.4.5 additional note 115. Maybe it should, but I'm hitting this printk in handle_tx_event: case COMP_BABBLE_DETECTED_ERROR: xhci_dbg(xhci, "Babble error for slot %u ep %u on endpoint\n", slot_id, ep_index); status = -EOVERFLOW; break; Tracing in the sound driver suggests that packet belongs to the sound data device (not the sync device), because the URB with -EOVERFLOW status for the isochronous packet in that URB is the last one seen there. > Or maybe a frame babble could halt an isoc endpoint, see xhci > 4.10.2.4.1 > but then you should see a port error and port going to disabled > state. > > Any logs of this? > > mount -t debugfs none /sys/kernel/debug > echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control > echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb > echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable > < trigger the issue > > Send output of dmesg > Send content of /sys/kernel/debug/tracing/trace dmesg: https://drive.google.com/open?id=1S2Qc8lroqA5-RMukuLBLWFGx10vEjG-i usbtrace: https://drive.google.com/open?id=1cbLcOnAtQRW0Chgak6PNC0l4yJv__4uO > > However, the USB sound driver seems to have no idea that this > > error happened. The function retire_capture_urb looks at the > > status of each isochronous frame, but seems to be under the > > assumption that the sound device just keeps on running. > > > > The function snd_complete_urb seems to only detect that the > > device is not running if usb_submit_urb returns a failure. > > > > err = usb_submit_urb(urb, GFP_ATOMIC); > > if (err == 0) > > return; > > > > usb_audio_err(ep->chip, "cannot submit urb (err = %d)\n", > > err); > > > > if (ep->data_subs && ep->data_subs->pcm_substream) { > > substream = ep->data_subs->pcm_substream; > > snd_pcm_stop_xrun(substream); > > } > > > > However, the XHCI driver will happily submit an URB to a > > stopped device. Looking at the call trace usb_submit_urb -> > > xhci_urb_enqueue -> xhci_queue_isoc_tx_prepare -> prepare_ring, > > you can see this code: > > > > /* Make sure the endpoint has been added to xHC schedule */ > > switch (ep_state) { > > ... > > case EP_STATE_HALTED: > > xhci_dbg(xhci, "WARN halted endpoint, queueing URB > > anyway.\n"); > > case EP_STATE_STOPPED: > > case EP_STATE_RUNNING: > > break; > > > > This leads me to a few questions: > > - should retire_capture_urb call snd_pcm_stop_xrun, > > or another function like it, if it sees certain > > errors in the iso frame in the URB? > > - should snd_complete_urb do something with these > > errors, too, in case they happen on the sync frames > > and not the data frames? > > - does the XHCI code need to ring the doorbell when > > submitting an URB to a stopped device, or is it > > always up to the higher-level driver to fully reset > > the device before it can do anything useful? > > xhci driver should ring the doorbell. > > xhci_queue_isoc_tx() > giveback_first_trb() > xhci_ring_ep_doorbell() > > when we are talking about babble or transaction errors the device > might be > completely unaware of the situation. Device side of the endpoint is > not necessarily halted. > xHC host will only halt its internal endpoint state, and it needs a > reset > endopoint command from the xhci driver to clear the internal halt > state. Interesting. I had no idea the endpoint on each side was controlled separately. -- All Rights Reversed.