All of lore.kernel.org
 help / color / mirror / Atom feed
* "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection
@ 2024-01-12 22:52 Michał Pecio
  2024-01-13 20:47 ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-12 22:52 UTC (permalink / raw)
  To: linux-usb

Hi,


This is a long standing issue seen in many kernel versions up to 6.7.

Disconnecting a high- or super-speed UVC camera from USB 3.0 port during
recording prints a lot of these errors (usually much more than here).


This appears to come from handle_tx_event() and it is my understanding
that the function tries to look up a transfer request previously issued
by the driver which is pointed to by a completion event received from
the hardware, and the pointer is found to be out of expected bounds.

It is a consistent pattern that problematic event-dma pointers are
within segment bounds, but they start one slot past trb-end and grow.

Comments in the code suggest that "the HC is hosed", but I suspect the
real problem might be the driver failing to retire the _previous_ TD
after it completed with similar errors and continuing to wait for its
completion events forever, or something else with similar outcome.


Or, at any rate, I *hope* that it's a harmless bug and not some memory
corruption... I only have very basic understanding of XHCI.


Any thoughts or suggestions how to fix or debug this further?


Thanks,
Michal


# camera connected
[ 2968.873423] usb 8-1: new high-speed USB device number 54 using xhci_hcd
[ 2969.155383] usb 8-1: New USB device found, idVendor=0bda, idProduct=58b0, bcdDevice= 7.01
[ 2969.155393] usb 8-1: New USB device strings: Mfr=3, Product=1, SerialNumber=2
[ 2969.155396] usb 8-1: Product: FULL HD 1080P Webcam
[ 2969.155398] usb 8-1: Manufacturer: Generic
[ 2969.155400] usb 8-1: SerialNumber: 200901010001
[ 2969.162076] usb 8-1: Found UVC 1.00 device FULL HD 1080P Webcam (0bda:58b0)
# recording started
[ 2978.289442] usb 8-1: Device requested 3072 B/frame bandwidth
[ 2978.289461] usb 8-1: Selecting alternate setting 7 (3072 B/frame bandwidth)
[ 2978.291230] usb 8-1: Allocated 5 URB buffers of 32x3072 bytes each
[ 2978.910333] usb 8-1: Device requested 3072 B/frame bandwidth
[ 2978.910352] usb 8-1: Selecting alternate setting 7 (3072 B/frame bandwidth)
[ 2978.913549] usb 8-1: Allocated 5 URB buffers of 32x3072 bytes each
# camera disconnected
[ 2988.728338] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728348] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313890 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.728449] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728452] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138a0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.728527] usb 8-1: USB disconnect, device number 54
[ 2988.728573] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728576] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138b0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.728698] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728701] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138c0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.728841] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728844] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138d0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.728948] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.728951] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138e0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729073] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729075] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d3138f0 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729197] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729199] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313900 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729322] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729324] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313910 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729447] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729450] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313920 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729572] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729575] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313930 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0
[ 2988.729697] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4
[ 2988.729700] xhci_hcd 0000:02:00.0: Looking for event-dma 000000012d313940 trb-start 000000012d313870 trb-end 000000012d313880 seg-start 000000012d313000 seg-end 000000012d313ff0

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection
  2024-01-12 22:52 "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection Michał Pecio
@ 2024-01-13 20:47 ` Michał Pecio
  2024-01-14 14:06   ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-13 20:47 UTC (permalink / raw)
  To: linux-usb; +Cc: Mathias Nyman

CC'ing XHCI driver maintainer.

> Disconnecting a high- or super-speed UVC camera from USB 3.0 port
> during recording prints a lot of these errors.

I made progress on this issue and I know what happens, I only don't
know whose fault it is and what the proper solution should be.


The host controller, in case it's a hardware bug:
NEC Corporation uPD720200 USB 3.0 Host Controller [1033:0194] (rev 03)


And this is what happens when errors appear on a running isoch EP.
I added debug code which traces execution and prints full contents of
the failing TRBs.


# errors in single TDs at offsets 3c0 and 3d0 are dealt with cleanly

[ 8350.611162] xhci_hcd 0000:02:00.0: handle_tx_event: ep_trb_dma 108b043c0 comp_code 4
[ 8350.611166] xhci_hcd 0000:02:00.0: dump first TRB: 108b043c0 -> 000000010986e6a0 0004099c 80001424
[ 8350.611168] xhci_hcd 0000:02:00.0: dump last TRB:  108b043c0 -> 000000010986e6a0 0004099c 80001424
[ 8350.611171] xhci_hcd 0000:02:00.0: process_isoc_td comp_code 4 last_in_td 1
[ 8350.611172] xhci_hcd 0000:02:00.0: finish_td: comp_code 4

[ 8350.611287] xhci_hcd 0000:02:00.0: handle_tx_event: ep_trb_dma 108b043d0 comp_code 4
[ 8350.611291] xhci_hcd 0000:02:00.0: dump first TRB: 108b043d0 -> 000000010986f03c 0004099c 80001424
[ 8350.611293] xhci_hcd 0000:02:00.0: dump last TRB:  108b043d0 -> 000000010986f03c 0004099c 80001424
[ 8350.611295] xhci_hcd 0000:02:00.0: process_isoc_td comp_code 4 last_in_td 1
[ 8350.611297] xhci_hcd 0000:02:00.0: finish_td: comp_code 4

# then a chained TD appears at 3e0 and 3f0 and signals an error at 3e0

[ 8350.611412] xhci_hcd 0000:02:00.0: handle_tx_event: ep_trb_dma 108b043e0 comp_code 4
[ 8350.611416] xhci_hcd 0000:02:00.0: dump first TRB: 108b043e0 -> 000000010986f9d8 00040628 80001414
[ 8350.611418] xhci_hcd 0000:02:00.0: dump last TRB:  108b043f0 -> 0000000109870000 00000374 00000424
[ 8350.611421] xhci_hcd 0000:02:00.0: process_isoc_td comp_code 4 last_in_td 0

# process_isoc_td found that it isn't the last TRB in this TD and didn't call finish_td
# the driver still waits for 3f0 completion, but 400 arrives instead

[ 8350.611536] xhci_hcd 0000:02:00.0: handle_tx_event: ep_trb_dma 108b04400 comp_code 4
[ 8350.611540] xhci_hcd 0000:02:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 4

# now everything is out of sync and the last message repeats ad nauseam


To my layman eye the chained TD looks OK (and has the right total size
of 0x99c). So the question is if the hardware is right to signal failure
on the first TRD only, and how the driver should handle it?


Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection
  2024-01-13 20:47 ` Michał Pecio
@ 2024-01-14 14:06   ` Michał Pecio
  2024-01-15 13:58     ` Mathias Nyman
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-14 14:06 UTC (permalink / raw)
  To: linux-usb; +Cc: Mathias Nyman

Hi Mathias,

I found that the code which causes my problems was specifically added
by your commit d104d0152a97f ("xhci: fix isoc endpoint dequeue from
advancing too far on transaction error").

Reverting this change removes the disconnection spam on my two NEC
hosts (different boards but same 1033:0194 rev 03 chip IDs). I have no
other hosts available to try at this time.

It also resolves similar spam and subsequent stream lockup on one
particular pair of host adapter and device, which suffers intermittent
transaction errors for reasons currently unknown. With d104d0152a97f
reverted this device loses frames as expected but keeps going.

So this surely looks like the right thing to do with my NEC hosts, but
of course d104d0152a97f was done for a reason, which apparently is that
some (unspecified) other hosts really work differently.

Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection
  2024-01-14 14:06   ` Michał Pecio
@ 2024-01-15 13:58     ` Mathias Nyman
  2024-01-15 16:27       ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-15 13:58 UTC (permalink / raw)
  To: Michał Pecio, linux-usb; +Cc: Mathias Nyman

On 14.1.2024 16.06, Michał Pecio wrote:
> Hi Mathias,
> 
> I found that the code which causes my problems was specifically added
> by your commit d104d0152a97f ("xhci: fix isoc endpoint dequeue from
> advancing too far on transaction error").
> 
> Reverting this change removes the disconnection spam on my two NEC
> hosts (different boards but same 1033:0194 rev 03 chip IDs). I have no
> other hosts available to try at this time.
> 
> It also resolves similar spam and subsequent stream lockup on one
> particular pair of host adapter and device, which suffers intermittent
> transaction errors for reasons currently unknown. With d104d0152a97f
> reverted this device loses frames as expected but keeps going.
> 
> So this surely looks like the right thing to do with my NEC hosts, but
> of course d104d0152a97f was done for a reason, which apparently is that
> some (unspecified) other hosts really work differently.

Thanks for debugging this issue.

Most hosts trigger transaction error events for either every TRB as
xhci spec says, (xhci section 4.10.2) or only on the last TRB in a TD
which has the "interrupt on completion" IOC flag set.
  
If we revert d104d0152a97f and finish the td on the first error event
then we will get a similar "Transfer event TRB DMA ptr not part of current TD"
message for each remaing TRB in that TD that issues a error event.

So this case where only the fist TRB triggers an error event isn't yet
handled by the driver.

I'll take a look at it.
Can you try out some testpatches if I post them?

Thanks
Mathias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection
  2024-01-15 13:58     ` Mathias Nyman
@ 2024-01-15 16:27       ` Michał Pecio
  2024-01-16 15:36         ` [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-15 16:27 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb, Mathias Nyman

Thanks for looking at this.

I recognize that the situation is tricky and simply reverting
d104d0152a97f is not a permanent option if other hosts need it.

Yes, I have set up a test machine to debug this and I can use it to
try potential solutions. If you have your own NEC uPD720200 specimen
you could reproduce it yourself, all it takes is unplugging any UVC
isochronous camera while recording with any software. This driver
queues plenty of buffers, so chances of hitting a split TD are good.


Regarding the spec, section 4.10.2 only states that an error shall
generate _a_ Transfer Event on current TRB regargdless of its flags
and that an error may occur on any TRB of a TD.

I think more relevant and useful is the final note in section 4.9.1:

> If an error is detected while processing a multi-TRB TD, the xHC shall
> generate a Transfer Event for the TRB that the error was detected on
> with the appropriate error Condition Code, then may advance to the next
> TD. If in the process of advancing to the next TD, a Transfer TRB is
> encountered with its IOC flag set, then the Condition Code of the
> Transfer Event generated for that Transfer TRB should be Success,
> because there was no error actually associated with the TRB that
> generated the Event. However, an xHC implementation may redundantly 
> assert the original error Condition Code.

To me it looks like the host is expected to skip the remaining TRBs
(section 4.10.2 also mentions continuing to the next ESIT on isoch EPs),
but subsequent text seems to assume that IOC flags will nevertheless be
honored by the xHC (NEC fails here). It is noteworthy that the host is
encouraged to respond "success", so the driver must remember earlier
errors anyway while waiting till the end of the TD.


Regards,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-15 16:27       ` Michał Pecio
@ 2024-01-16 15:36         ` Mathias Nyman
  2024-01-16 22:20           ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-16 15:36 UTC (permalink / raw)
  To: michal.pecio; +Cc: linux-usb, Mathias Nyman

The last TRB of a isoc TD might not trigger an event if there was
an error event for TRB mid TD. This is seen on a  NEC Corporation
uPD720200 USB 3.0 Host

Driver and xHC host get out of sync as driver is still expecting a
transfer event for that first TD, while xHC host is already sending
events for the next TD in the list. This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.

As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally

Reported-by: Michał Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 55 +++++++++++++++++++++++++++---------
 drivers/usb/host/xhci.h      |  1 +
 2 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 33806ae966f9..4869005379f5 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2401,8 +2401,11 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	case COMP_USB_TRANSACTION_ERROR:
 		frame->status = -EPROTO;
-		if (ep_trb != td->last_trb)
+		if (ep_trb != td->last_trb) {
+			td->error_mid_td = true;
+			/* FIXME update actual_length */
 			return 0;
+		}
 		break;
 	case COMP_STOPPED:
 		sum_trbs_for_length = true;
@@ -2808,17 +2811,44 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 
 		if (!ep_seg) {
-			if (!ep->skip ||
-			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
-				/* Some host controllers give a spurious
-				 * successful event after a short transfer.
-				 * Ignore it.
-				 */
-				if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
-						ep_ring->last_td_was_short) {
-					ep_ring->last_td_was_short = false;
-					goto cleanup;
+
+			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
+				skip_isoc_td(xhci, td, ep, status);
+				goto cleanup;
+			}
+
+			/*
+			 * Some hosts give a spurious success event after a short
+			 * transfer. Ignore it.
+			 */
+			if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
+			    ep_ring->last_td_was_short) {
+				ep_ring->last_td_was_short = false;
+				goto cleanup;
+			}
+
+			/*
+			 * if there was an error event mid TD then host may not
+			 * give an event for the last TRB on an isoc TD.
+			 * This event can be for the next TD, See xHCI 4.9.1.
+			 */
+			if (td->error_mid_td) {
+				struct xhci_td *td_next = list_next_entry(td, td_list);
+
+				ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb,
+						   td_next->last_trb, ep_trb_dma, false);
+				if (ep_seg) {
+					/* give back previous TD, start handling new */
+					ep_ring->dequeue = td->last_trb;
+					ep_ring->deq_seg = td->last_trb_seg;
+					inc_deq(xhci, ep_ring);
+					xhci_td_cleanup(xhci, td, ep_ring, td->status);
+					td = td_next;
 				}
+
+			}
+
+			if (!ep_seg) {
 				/* HC is busted, give up! */
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
@@ -2830,9 +2860,6 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 					  ep_trb_dma, true);
 				return -ESHUTDOWN;
 			}
-
-			skip_isoc_td(xhci, td, ep, status);
-			goto cleanup;
 		}
 		if (trb_comp_code == COMP_SHORT_PACKET)
 			ep_ring->last_td_was_short = true;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index a5c72a634e6a..6f82d404883f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1549,6 +1549,7 @@ struct xhci_td {
 	struct xhci_segment	*bounce_seg;
 	/* actual_length of the URB has already been set */
 	bool			urb_length_set;
+	bool			error_mid_td;
 	unsigned int		num_trbs;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-16 15:36         ` [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
@ 2024-01-16 22:20           ` Michał Pecio
  2024-01-17 10:46             ` Mathias Nyman
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-16 22:20 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

I applied your patch on v6.7 and it appears to be working. It removes
the disconnection spam and also handles intermittent transmission errors
on UVC without obvious glitches or errors messages, except one xhci_dbg
added to confirm that I'm really hitting this edge case.

Anything else that might be worth testing?


I have a question, though. What happens if there is no next TD because
a mid TD error has occured on the last packet queued by the client? Is
there any mechanism to retire that stuck TD on a NEC host which submits
one mid TD error event and then goes silent?

Would it be possible to retire the TD right after the first failed TRB?
(I imagine difficulties in determining when exactly the host has moved
its internal pointer past the remaining TRBs so they can be reused).


> -			if (!ep->skip ||
> -			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
> +			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
I like this. I would suggest another cleanup: the if(!ep_seg && stuf)
right above your change could be pulled inside if(!ep_seg).


> +			 * if there was an error event mid TD then host may not
> +			 * give an event for the last TRB on an isoc TD.
> +			 * This event can be for the next TD, See xHCI 4.9.1.
This seems to suggest that 4.9.1 encourages such behavior, but the
opposite is the case as far as I understand.


> +			if (td->error_mid_td) {
> +				struct xhci_td *td_next = list_next_entry(td, td_list);
This if needs && !list_is_last(&td->td_list, &ep_ring->td_list).

Otherwise a serious bug in the host (maybe in the driver too) tricks
us into grabbing a pointer to ep_ring instead, filling the subsequent
"TRB not part of current TD" message with mystifying garbage numbers.


> +				if (ep_seg) {
> +					/* give back previous TD, start handling new */
Suggested:
> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");


Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-16 22:20           ` Michał Pecio
@ 2024-01-17 10:46             ` Mathias Nyman
  2024-01-17 17:49               ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-17 10:46 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 17.1.2024 0.20, Michał Pecio wrote:
> I applied your patch on v6.7 and it appears to be working. It removes
> the disconnection spam and also handles intermittent transmission errors
> on UVC without obvious glitches or errors messages, except one xhci_dbg
> added to confirm that I'm really hitting this edge case.
> 
> Anything else that might be worth testing?
> 
> 
> I have a question, though. What happens if there is no next TD because
> a mid TD error has occured on the last packet queued by the client? Is
> there any mechanism to retire that stuck TD on a NEC host which submits
> one mid TD error event and then goes silent?

In disconnect cases usb core should flush the remaining URBs once
roothub code notices the disconnect.

But yes, if the last TD in a URB is a multi TRB isoc TD, and it has an error
MID TD then its stuck until timeout.

> 
> Would it be possible to retire the TD right after the first failed TRB?
> (I imagine difficulties in determining when exactly the host has moved
> its internal pointer past the remaining TRBs so they can be reused).

Probably not as a normal error handling routine.
We have the same "Transfer event TRB DMA ptr not part of current TD" issue
for hosts that do issue an event for the last TRB.

If the TD is given back immediately we also have a memory issue as the
DMA address pointed to by that last TRB might be accessed by the controller
_after_ driver gave back the TD, and possibly freed/unmapped it.

But for that special case where there are no more TDs queued it might
make sense

> 
> 
>> -			if (!ep->skip ||
>> -			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
>> +			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
> I like this. I would suggest another cleanup: the if(!ep_seg && stuf)
> right above your change could be pulled inside if(!ep_seg).

Noticed the same, but for stable kernel reasons it's probably better to limit
this patch to mostly fixing this bug.

> 
> 
>> +			 * if there was an error event mid TD then host may not
>> +			 * give an event for the last TRB on an isoc TD.
>> +			 * This event can be for the next TD, See xHCI 4.9.1.
> This seems to suggest that 4.9.1 encourages such behavior, but the
> opposite is the case as far as I understand.

I'll rephrase this.

> 
> 
>> +			if (td->error_mid_td) {
>> +				struct xhci_td *td_next = list_next_entry(td, td_list);
> This if needs && !list_is_last(&td->td_list, &ep_ring->td_list).

Thanks, nice catch, good point.

> 
> Otherwise a serious bug in the host (maybe in the driver too) tricks
> us into grabbing a pointer to ep_ring instead, filling the subsequent
> "TRB not part of current TD" message with mystifying garbage numbers.
> 
> 
>> +				if (ep_seg) {
>> +					/* give back previous TD, start handling new */
> Suggested:
>> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");

Makes sense.

Thanks
Mathias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-17 10:46             ` Mathias Nyman
@ 2024-01-17 17:49               ` Michał Pecio
  2024-01-18 11:00                 ` Isochronous error handling bug on VIA VL805 Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-17 17:49 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

> But yes, if the last TD in a URB is a multi TRB isoc TD, and it has
> an error MID TD then its stuck until timeout.

If there are timeouts to deal with hosts failing to respond then that's
good enough for me. It should be a rare case anyway.

I just don't like when stuff locks up forever and requires reconnection
or reboot to clean up.


> > Would it be possible to retire the TD right after the first failed
> > TRB?
> 
> Probably not as a normal error handling routine.
> We have the same "Transfer event TRB DMA ptr not part of current TD"
> issue for hosts that do issue an event for the last TRB.

Obviously it would require keeping some information about the retired
TD to detect subsequent completions and to prevent freeing of its
remaining TRBs. Probably much more effort than the current approach.
 
> But for that special case where there are no more TDs queued it might
> make sense

Wouldn't it still require remembering not to free the TRBs too fast?
(It seems to have worked in the early days, but feels dodgy).


> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");
On second thought, this could also print ep_trb_dma to be useful in
debugging "TRB not part of current TD" issues. Although anyone able to
compile the kernel with DYNAMIC_DEBUG could also edit as necessary...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Isochronous error handling bug on VIA VL805
  2024-01-17 17:49               ` Michał Pecio
@ 2024-01-18 11:00                 ` Michał Pecio
  2024-01-18 11:10                   ` Michał Pecio
  2024-01-18 13:54                   ` Mathias Nyman
  0 siblings, 2 replies; 21+ messages in thread
From: Michał Pecio @ 2024-01-18 11:00 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

I concurrently executed plan B for dealing with my NEC issues, which is
to simply order a host controller with other chip (VIA), and now I have
two controllers and three problems.

One is that it sometimes "dies" and requires a reboot, probably nothing
Linux can do about it.

The other is that it reports successful completion of the final TRB and
the driver overwrites frame->status set by prior errors. This seems to
only affect isochronous again, though I'm not 100% sure.


This change on top of yours takes care of it for transaction errors.

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index d2fe1f073e38..fce67493dfdf 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2375,6 +2375,12 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 	/* handle completion code */
 	switch (trb_comp_code) {
 	case COMP_SUCCESS:
+		/* Check for earlier errors; see xHCI 4.9.1 */
+		if (td->error_mid_td) {
+			xhci_info(xhci, "Got SUCCESS after mid TD error\n");
+			/* don't overwrite previously assigned status */
+			break;
+		}
 		if (remaining) {
 			frame->status = short_framestatus;
 			if (xhci->quirks & XHCI_TRUST_TX_LENGTH)


Interesting questions arise:

1. Should this be a separate patch?

2. Are there any errors that may need error_mid_td treatment on NEC?
Maybe BABBLE or others, unfortunately it isn't easy for me to produce
them and see what happens.

3. Are there any errors that may need "not success" treatment on VIA?
Any chance that these error sets aren't equal and I can't get away with
reusing your error_mid_td flag here?

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Isochronous error handling bug on VIA VL805
  2024-01-18 11:00                 ` Isochronous error handling bug on VIA VL805 Michał Pecio
@ 2024-01-18 11:10                   ` Michał Pecio
  2024-01-18 13:54                   ` Mathias Nyman
  1 sibling, 0 replies; 21+ messages in thread
From: Michał Pecio @ 2024-01-18 11:10 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

> 2. Are there any errors that may need error_mid_td treatment on NEC?
> 3. Are there any errors that may need "not success" treatment on VIA?

Sorry, I meant to write "any other errors", of course.
For VIA, the likely answer is "all errors" or "almost all".


And I forgot to add that while NEC seems to violate the spec, VIA
follows 4.9.1 exactly and other hosts are likely to do the same.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Isochronous error handling bug on VIA VL805
  2024-01-18 11:00                 ` Isochronous error handling bug on VIA VL805 Michał Pecio
  2024-01-18 11:10                   ` Michał Pecio
@ 2024-01-18 13:54                   ` Mathias Nyman
  2024-01-18 13:56                     ` [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
  1 sibling, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-18 13:54 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 18.1.2024 13.00, Michał Pecio wrote:
> I concurrently executed plan B for dealing with my NEC issues, which is
> to simply order a host controller with other chip (VIA), and now I have
> two controllers and three problems.
> 
> One is that it sometimes "dies" and requires a reboot, probably nothing
> Linux can do about it.
> 
> The other is that it reports successful completion of the final TRB and
> the driver overwrites frame->status set by prior errors. This seems to
> only affect isochronous again, though I'm not 100% sure.
> 
> 
> This change on top of yours takes care of it for transaction errors.
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index d2fe1f073e38..fce67493dfdf 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -2375,6 +2375,12 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
>   	/* handle completion code */
>   	switch (trb_comp_code) {
>   	case COMP_SUCCESS:
> +		/* Check for earlier errors; see xHCI 4.9.1 */
> +		if (td->error_mid_td) {
> +			xhci_info(xhci, "Got SUCCESS after mid TD error\n");
> +			/* don't overwrite previously assigned status */
> +			break;
> +		}
>   		if (remaining) {
>   			frame->status = short_framestatus;
>   			if (xhci->quirks & XHCI_TRUST_TX_LENGTH)
> 
> 
> Interesting questions arise:
> 
> 1. Should this be a separate patch?

Added this to the original patch.

> 
> 2. Are there any errors that may need error_mid_td treatment on NEC?
> Maybe BABBLE or others, unfortunately it isn't easy for me to produce
> them and see what happens.

Need to go through these, Isoc transfers should have as many error options as other types.

> 
> 3. Are there any errors that may need "not success" treatment on VIA?
> Any chance that these error sets aren't equal and I can't get away with
> reusing your error_mid_td flag here?
> 

Need to look at this as well after getting the first patch done.

I did play with the "error on last TD case",  patch for that is in a temp branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_missing_td_event
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_missing_td_event&id=681415c02f95f6615a4d497df4b202a4f1fb0cc1

But it might just add more code without any real world benefit so I
think I won't send it upstream.

Thanks
Mathias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-18 13:54                   ` Mathias Nyman
@ 2024-01-18 13:56                     ` Mathias Nyman
  2024-01-18 22:16                       ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-18 13:56 UTC (permalink / raw)
  To: michal.pecio; +Cc: linux-usb, Mathias Nyman

The last TRB of a isoc TD might not trigger an event if there was
an error event for TRB mid TD. This is seen on a NEC Corporation
uPD720200 USB 3.0 Host

After an error mid a multi-TRB TD the xHC should according to xhci 4.9.1
generate events for passed TRBs with IOC flag set if it proceeds to the
next TD. This event is either a copy of the original error, or a
"success" transfer event.

If that event is missing then the driver and xHC host get out of sync
as driver is still expecting a transfer event for that first TD, while
xHC host is already sending events for the next TD in the list.
This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.

As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally

Make sure TD status and tranferred length stay valid in all cases.

Reported-by: Michał Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
  v1 -> v2:
  - Check for empty TD list, suggested by Michał
  - Don't overwrite error status, suggested by Michał
  - Add a debug print for missed TD completion event, suggested by Michał
  - Reword commit message and code comments
  - Set correct TD transfer length in isoc error case

drivers/usb/host/xhci-ring.c | 81 +++++++++++++++++++++++++++++-------
 drivers/usb/host/xhci.h      |  1 +
 2 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 33806ae966f9..9f0f694358cc 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2376,6 +2376,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 	/* handle completion code */
 	switch (trb_comp_code) {
 	case COMP_SUCCESS:
+		/* Don't overwrite status if TD had an error, see xHCI 4.9.1 */
+		if (td->error_mid_td)
+			break;
 		if (remaining) {
 			frame->status = short_framestatus;
 			if (xhci->quirks & XHCI_TRUST_TX_LENGTH)
@@ -2401,9 +2404,20 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	case COMP_USB_TRANSACTION_ERROR:
 		frame->status = -EPROTO;
-		if (ep_trb != td->last_trb)
-			return 0;
-		break;
+		sum_trbs_for_length = true;
+
+		if (ep_trb == td->last_trb)
+			break;
+
+		/* Error mid TD, don't give TD back yet */
+		td->error_mid_td = true;
+		td->urb_length_set = true;
+
+		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
+			ep_trb_len - remaining;
+		td->urb->actual_length += frame->actual_length;
+		return 0;
+
 	case COMP_STOPPED:
 		sum_trbs_for_length = true;
 		break;
@@ -2422,6 +2436,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	}
 
+	if (td->urb_length_set)
+		goto finish_td;
+
 	if (sum_trbs_for_length)
 		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
 			ep_trb_len - remaining;
@@ -2430,6 +2447,7 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 
 	td->urb->actual_length += frame->actual_length;
 
+finish_td:
 	return finish_td(xhci, ep, ep_ring, td, trb_comp_code);
 }
 
@@ -2808,17 +2826,51 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 
 		if (!ep_seg) {
-			if (!ep->skip ||
-			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
-				/* Some host controllers give a spurious
-				 * successful event after a short transfer.
-				 * Ignore it.
-				 */
-				if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
-						ep_ring->last_td_was_short) {
-					ep_ring->last_td_was_short = false;
-					goto cleanup;
+
+			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
+				skip_isoc_td(xhci, td, ep, status);
+				goto cleanup;
+			}
+
+			/*
+			 * Some hosts give a spurious success event after a short
+			 * transfer. Ignore it.
+			 */
+			if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
+			    ep_ring->last_td_was_short) {
+				ep_ring->last_td_was_short = false;
+				goto cleanup;
+			}
+
+			/*
+			 * xhci 4.10.2 states isoc endpoints should continue
+			 * processing the next TD if there was an error mid TD.
+			 * So host like NEC don't generate an event for the last
+			 * isoc TRB even if the IOC flag is set.
+			 * xhci 4.9.1 states that if there are errors in mult-TRB
+			 * TDs xHC should generate an error for that TRB, and if xHC
+			 * proceeds to the next TD it should genete an event for
+			 * any TRB with IOC flag on the way. Other host follow this.
+			 * So this event might be for the next TD.
+			 */
+			if (td->error_mid_td &&
+			    !list_is_last(&td->td_list, &ep_ring->td_list)) {
+				struct xhci_td *td_next = list_next_entry(td, td_list);
+
+				ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb,
+						   td_next->last_trb, ep_trb_dma, false);
+				if (ep_seg) {
+					/* give back previous TD, start handling new */
+					xhci_dbg(xhci, "Missing TD completion event after mid TD error\n");
+					ep_ring->dequeue = td->last_trb;
+					ep_ring->deq_seg = td->last_trb_seg;
+					inc_deq(xhci, ep_ring);
+					xhci_td_cleanup(xhci, td, ep_ring, td->status);
+					td = td_next;
 				}
+			}
+
+			if (!ep_seg) {
 				/* HC is busted, give up! */
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
@@ -2830,9 +2882,6 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 					  ep_trb_dma, true);
 				return -ESHUTDOWN;
 			}
-
-			skip_isoc_td(xhci, td, ep, status);
-			goto cleanup;
 		}
 		if (trb_comp_code == COMP_SHORT_PACKET)
 			ep_ring->last_td_was_short = true;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index a5c72a634e6a..6f82d404883f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1549,6 +1549,7 @@ struct xhci_td {
 	struct xhci_segment	*bounce_seg;
 	/* actual_length of the URB has already been set */
 	bool			urb_length_set;
+	bool			error_mid_td;
 	unsigned int		num_trbs;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-18 13:56                     ` [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
@ 2024-01-18 22:16                       ` Michał Pecio
  2024-01-19 10:49                         ` Mathias Nyman
  0 siblings, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-18 22:16 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

My usual set of tests passes:
- no spam on disconnection from NEC
- no stream lockup on random errors on NEC
- no spam on disconnection from VIA
- finish_td called with right frame->status on VIA
  (checked by means of extra printks)

> +		/* Error mid TD, don't give TD back yet */
> +		td->error_mid_td = true;
> +		td->urb_length_set = true;
> +
> +		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
> +			ep_trb_len - remaining;
Not a problem with this patch, but I noticed that every single use of
this function ends up adding ep_trb_len, maybe it could be inclusive.

> +		td->urb->actual_length += frame->actual_length;
In your first email you mentioned hosts responding to every single TRB,
perhaps with the same error code repeated each time?

I imagine it could be problematic here if such hosts really exist and
if there are enough TRBs to execute this line twice. A check for the
error_mid_td bit previously set could help, if this is a real risk.


Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-18 22:16                       ` Michał Pecio
@ 2024-01-19 10:49                         ` Mathias Nyman
  2024-01-19 10:58                           ` [RTF PATCH v3] " Mathias Nyman
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-19 10:49 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 19.1.2024 0.16, Michał Pecio wrote:
> My usual set of tests passes:
> - no spam on disconnection from NEC
> - no stream lockup on random errors on NEC
> - no spam on disconnection from VIA
> - finish_td called with right frame->status on VIA
>    (checked by means of extra printks)
> 
>> +		/* Error mid TD, don't give TD back yet */
>> +		td->error_mid_td = true;
>> +		td->urb_length_set = true;
>> +
>> +		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
>> +			ep_trb_len - remaining;
> Not a problem with this patch, but I noticed that every single use of
> this function ends up adding ep_trb_len, maybe it could be inclusive.
> 
>> +		td->urb->actual_length += frame->actual_length;
> In your first email you mentioned hosts responding to every single TRB,
> perhaps with the same error code repeated each time?
> 
> I imagine it could be problematic here if such hosts really exist and
> if there are enough TRBs to execute this line twice. A check for the
> error_mid_td bit previously set could help, if this is a real risk.

Good point, refactored that code a bit and it now both looks nicer and should
solve this case.

One more patchround

Thanks
Mathias



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-19 10:49                         ` Mathias Nyman
@ 2024-01-19 10:58                           ` Mathias Nyman
  2024-01-19 21:54                             ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-19 10:58 UTC (permalink / raw)
  To: michal.pecio; +Cc: linux-usb, Mathias Nyman

The last TRB of a isoc TD might not trigger an event if there was
an error event for a TRB mid TD. This is seen on a NEC Corporation
uPD720200 USB 3.0 Host

After an error mid a multi-TRB TD the xHC should according to xhci 4.9.1
generate events for passed TRBs with IOC flag set if it proceeds to the
next TD. This event is either a copy of the original error, or a
"success" transfer event.

If that event is missing then the driver and xHC host get out of sync as
the driver is still expecting a transfer event for that first TD, while
xHC host is already sending events for the next TD in the list.
This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.

As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally

Make sure TD status and transferred length stay valid in both cases
with and without final TD completion event.

Reported-by: Michał Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
  v2 -> v3
  - Rework and fix setting TD transfer lengt issue in v2
  - Minor commit message tuning 

  v1 -> v2:
  - Check for empty TD list, suggested by Michał
  - Don't overwrite error status, suggested by Michał
  - Add a debug print for missed TD completion event, suggested by Michał
  - Reword commit message and code comments
  - Set correct TD transfer length in isoc error case


drivers/usb/host/xhci-ring.c | 74 +++++++++++++++++++++++++++++-------
 drivers/usb/host/xhci.h      |  1 +
 2 files changed, 61 insertions(+), 14 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 33806ae966f9..41be7d31a36e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2376,6 +2376,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 	/* handle completion code */
 	switch (trb_comp_code) {
 	case COMP_SUCCESS:
+		/* Don't overwrite status if TD had an error, see xHCI 4.9.1 */
+		if (td->error_mid_td)
+			break;
 		if (remaining) {
 			frame->status = short_framestatus;
 			if (xhci->quirks & XHCI_TRUST_TX_LENGTH)
@@ -2401,8 +2404,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	case COMP_USB_TRANSACTION_ERROR:
 		frame->status = -EPROTO;
+		sum_trbs_for_length = true;
 		if (ep_trb != td->last_trb)
-			return 0;
+			td->error_mid_td = true;
 		break;
 	case COMP_STOPPED:
 		sum_trbs_for_length = true;
@@ -2422,6 +2426,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	}
 
+	if (td->urb_length_set)
+		goto finish_td;
+
 	if (sum_trbs_for_length)
 		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
 			ep_trb_len - remaining;
@@ -2430,6 +2437,14 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 
 	td->urb->actual_length += frame->actual_length;
 
+finish_td:
+	/* Don't give back TD yet if we encountered an error mid TD */
+	if (td->error_mid_td && ep_trb != td->last_trb) {
+		xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n");
+		td->urb_length_set = true;
+		return 0;
+	}
+
 	return finish_td(xhci, ep, ep_ring, td, trb_comp_code);
 }
 
@@ -2808,17 +2823,51 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 
 		if (!ep_seg) {
-			if (!ep->skip ||
-			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
-				/* Some host controllers give a spurious
-				 * successful event after a short transfer.
-				 * Ignore it.
-				 */
-				if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
-						ep_ring->last_td_was_short) {
-					ep_ring->last_td_was_short = false;
-					goto cleanup;
+
+			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
+				skip_isoc_td(xhci, td, ep, status);
+				goto cleanup;
+			}
+
+			/*
+			 * Some hosts give a spurious success event after a short
+			 * transfer. Ignore it.
+			 */
+			if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
+			    ep_ring->last_td_was_short) {
+				ep_ring->last_td_was_short = false;
+				goto cleanup;
+			}
+
+			/*
+			 * xhci 4.10.2 states isoc endpoints should continue
+			 * processing the next TD if there was an error mid TD.
+			 * So host like NEC don't generate an event for the last
+			 * isoc TRB even if the IOC flag is set.
+			 * xhci 4.9.1 states that if there are errors in mult-TRB
+			 * TDs xHC should generate an error for that TRB, and if xHC
+			 * proceeds to the next TD it should genete an event for
+			 * any TRB with IOC flag on the way. Other host follow this.
+			 * So this event might be for the next TD.
+			 */
+			if (td->error_mid_td &&
+			    !list_is_last(&td->td_list, &ep_ring->td_list)) {
+				struct xhci_td *td_next = list_next_entry(td, td_list);
+
+				ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb,
+						   td_next->last_trb, ep_trb_dma, false);
+				if (ep_seg) {
+					/* give back previous TD, start handling new */
+					xhci_dbg(xhci, "Missing TD completion event after mid TD error\n");
+					ep_ring->dequeue = td->last_trb;
+					ep_ring->deq_seg = td->last_trb_seg;
+					inc_deq(xhci, ep_ring);
+					xhci_td_cleanup(xhci, td, ep_ring, td->status);
+					td = td_next;
 				}
+			}
+
+			if (!ep_seg) {
 				/* HC is busted, give up! */
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
@@ -2830,9 +2879,6 @@ static int handle_tx_event(struct xhci_hcd *xhci,
 					  ep_trb_dma, true);
 				return -ESHUTDOWN;
 			}
-
-			skip_isoc_td(xhci, td, ep, status);
-			goto cleanup;
 		}
 		if (trb_comp_code == COMP_SHORT_PACKET)
 			ep_ring->last_td_was_short = true;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index a5c72a634e6a..6f82d404883f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1549,6 +1549,7 @@ struct xhci_td {
 	struct xhci_segment	*bounce_seg;
 	/* actual_length of the URB has already been set */
 	bool			urb_length_set;
+	bool			error_mid_td;
 	unsigned int		num_trbs;
 };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-19 10:58                           ` [RTF PATCH v3] " Mathias Nyman
@ 2024-01-19 21:54                             ` Michał Pecio
  2024-01-22  9:03                               ` Michał Pecio
  2024-01-22 11:47                               ` Mathias Nyman
  0 siblings, 2 replies; 21+ messages in thread
From: Michał Pecio @ 2024-01-19 21:54 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

Usual tests passed (but only transaction errors are covered).


I noticed that with your new fix to the frame length bug, error_mid_td
always equals urb_length_set. So the new flag is perhaps not necessary
after all. The test in handle_tx_event() could be:

if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set &&
    !list_is_last(...  ))

I tried it and it works just as well.


While disconnecting my test camera from the VIA, I got this:

[95511.647441] xhci_hcd 0000:03:00.0: finishing TD with status -75 (comp_code 3 error_mid_td 0)
[95511.647453] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[95511.647457] xhci_hcd 0000:03:00.0: Looking for event-dma 0000000135121710 trb-start 0000000135121720 trb-end 0000000135121720 seg-start 0000000135121000 seg-end 0000000135121ff0
... followed by some ordinary transaction errors 2 milliseconds later.

Apparently a babble error, and it seems to have generated a "success"
which the event handler tried to match with the next TD. So a mid TD
babble may need the same treatment, which is not surprising.

Unfortunately I didn't print enough debug info to confirm this with
certainty and it is absolutely unreproducible, I have no idea why it
happened.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-19 21:54                             ` Michał Pecio
@ 2024-01-22  9:03                               ` Michał Pecio
  2024-01-22 13:37                                 ` Mathias Nyman
  2024-01-22 11:47                               ` Mathias Nyman
  1 sibling, 1 reply; 21+ messages in thread
From: Michał Pecio @ 2024-01-22  9:03 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

> Apparently a babble error, and it seems to have generated a "success"
> which the event handler tried to match with the next TD. So a mid TD
> babble may need the same treatment, which is not surprising.

This is now confirmed and fixed here. The change is obvious enough:
        case COMP_ISOCH_BUFFER_OVERRUN:
        case COMP_BABBLE_DETECTED_ERROR:
+               error_mid_td = true;
                frame->status = -EOVERFLOW;
                break;

I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's
the same story. BTW, error_mid_td is a local variable now and I use the
urb_length_set flag instead, as explained before.

I found that it can be reproduced on the VIA host, with enough tries it
can happen even on a chained TD. NEC doesn't signal these babble errors
but new mid TD event handling should cope with either host.

Debug trace ("interesting" is other than "success" or "short packet"):
[ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2
[ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
[ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event
[ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2
[ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
[ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error
[ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75


Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-19 21:54                             ` Michał Pecio
  2024-01-22  9:03                               ` Michał Pecio
@ 2024-01-22 11:47                               ` Mathias Nyman
  1 sibling, 0 replies; 21+ messages in thread
From: Mathias Nyman @ 2024-01-22 11:47 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 19.1.2024 23.54, Michał Pecio wrote:
> Usual tests passed (but only transaction errors are covered).

Thanks for testing, I'l add your a tested-by tag to the patch.

> 
> 
> I noticed that with your new fix to the frame length bug, error_mid_td
> always equals urb_length_set. So the new flag is perhaps not necessary
> after all. The test in handle_tx_event() could be:
> 
> if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set &&
>      !list_is_last(...  ))
> 
> I tried it and it works just as well.

This is probably true here, but urb_length_set and error_mid_td have a bit
different idea behind them

For example short control transfers may send an event mid TD (during data stage).
There is no error, we wait for the completion of the status stage but don't want
to change the transfer length set in the urb.

I suspect we will need to play with both flags for other error cases in the future.

Thanks
Mathias


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-22  9:03                               ` Michał Pecio
@ 2024-01-22 13:37                                 ` Mathias Nyman
  2024-01-22 17:10                                   ` Michał Pecio
  0 siblings, 1 reply; 21+ messages in thread
From: Mathias Nyman @ 2024-01-22 13:37 UTC (permalink / raw)
  To: Michał Pecio; +Cc: linux-usb

On 22.1.2024 11.03, Michał Pecio wrote:
>> Apparently a babble error, and it seems to have generated a "success"
>> which the event handler tried to match with the next TD. So a mid TD
>> babble may need the same treatment, which is not surprising.

Makes sense.

> 
> This is now confirmed and fixed here. The change is obvious enough:
>          case COMP_ISOCH_BUFFER_OVERRUN:
>          case COMP_BABBLE_DETECTED_ERROR:
> +               error_mid_td = true;
>                  frame->status = -EOVERFLOW;
>                  break;
> 
> I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's
> the same story. BTW, error_mid_td is a local variable now and I use the
> urb_length_set flag instead, as explained before.

To me it looks like COMP_BABBLE_DETECTED_ERROR and COMP_ISOCH_BUFFER_OVERRUN
have the same cause, device is sending too much data.

Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate endpoint
hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR case.

See xhci 6.4.5 "TRB completion codes" Footnote 115

> 
> I found that it can be reproduced on the VIA host, with enough tries it
> can happen even on a chained TD. NEC doesn't signal these babble errors
> but new mid TD event handling should cope with either host.

So looks like VIA host incorrectly sends babble for Isoc endpoints

> 
> Debug trace ("interesting" is other than "success" or "short packet"):
> [ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2
> [ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
> [ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event
> [ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2
> [ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
> [ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error
> [ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75

I'm afraid we end up tuning that original patch forever with these new findings,
so lets make this into a new patch on top of the previous one.
That one is tested and known to work in the transaction error case.

Let me know if you want to write it, otherwise I will

Thanks
Mathias

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RTF PATCH v3] xhci: process isoc TD properly when there was an error mid TD.
  2024-01-22 13:37                                 ` Mathias Nyman
@ 2024-01-22 17:10                                   ` Michał Pecio
  0 siblings, 0 replies; 21+ messages in thread
From: Michał Pecio @ 2024-01-22 17:10 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: linux-usb

> I'm afraid we end up tuning that original patch forever
Point taken. This patch is more important, as it fixes total failure
to deal with trivial bit errors on uPD720200.


> Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate
> endpoint hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR
> case.
> 
> See xhci 6.4.5 "TRB completion codes" Footnote 115
This applies to "TD Babble", which is:
submitted TD size < data from the device <= ESIT payload

Other babble conditions exist related to blatant device bugs (4.10.2.4)
and it looks like some may result in a halt or device disconnection even
on isoch endpoints(?)

Other babbles aren't included in "isoch buffer overrun" per 6.4.5.

I feel like one would need to systematically go through the whole spec
to avoid missing important details or special cases.


Thanks,
Michal

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-01-22 17:10 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-12 22:52 "Transfer event TRB DMA ptr not part of current TD" spam after USB disconnection Michał Pecio
2024-01-13 20:47 ` Michał Pecio
2024-01-14 14:06   ` Michał Pecio
2024-01-15 13:58     ` Mathias Nyman
2024-01-15 16:27       ` Michał Pecio
2024-01-16 15:36         ` [RFT PATCH] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
2024-01-16 22:20           ` Michał Pecio
2024-01-17 10:46             ` Mathias Nyman
2024-01-17 17:49               ` Michał Pecio
2024-01-18 11:00                 ` Isochronous error handling bug on VIA VL805 Michał Pecio
2024-01-18 11:10                   ` Michał Pecio
2024-01-18 13:54                   ` Mathias Nyman
2024-01-18 13:56                     ` [RFT PATCH v2] xhci: process isoc TD properly when there was an error mid TD Mathias Nyman
2024-01-18 22:16                       ` Michał Pecio
2024-01-19 10:49                         ` Mathias Nyman
2024-01-19 10:58                           ` [RTF PATCH v3] " Mathias Nyman
2024-01-19 21:54                             ` Michał Pecio
2024-01-22  9:03                               ` Michał Pecio
2024-01-22 13:37                                 ` Mathias Nyman
2024-01-22 17:10                                   ` Michał Pecio
2024-01-22 11:47                               ` Mathias Nyman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.