Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1

From: Mathias Nyman <mathias.nyman@linux.intel.com>
To: Fabian Melzow <fabian.melzow@gmail.com>, linux-usb@vger.kernel.org
Subject: Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
Date: Mon, 29 Jun 2020 20:47:24 +0300	[thread overview]
Message-ID: <264e8287-b538-0798-36a6-7eafc4387a8d@linux.intel.com> (raw)
In-Reply-To: <20200620211913.1535bac0@ping>

Hi

On 20.6.2020 22.19, Fabian Melzow wrote:
> Hi folks!
> 
> After running the first scanimage I get the following errors/warnings:
> 
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Successful Set TR Deq Ptr cmd, deq = @d7615080
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000de74ec7f (0xd7615000 dma), new deq ptr = 00000000ff17b5e1 (0xd7615080 dma), new cycle = 1
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: New dequeue pointer = 0xd7615080 (DMA)
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: New dequeue segment = 00000000de74ec7f (virtual)
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Cycle state = 0x1
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Finding endpoint context
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Removing canceled TD starting at 0xd7615070 (dma).
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Stopped on Transfer TRB for slot 1 ep 4
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Cancel URB 00000000fa372bb8, dev 3, ep 0x82, starting at offset 0xd7615070
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xdc707028 flags=0x0020]
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: Looking for event-dma 00000000d7615060 trb-start 00000000d7615070 trb-end 00000000d7615070 seg-start 00000000d7615000 seg-end 00000000d7615ff0
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc9b4c730 flags=0x0020]
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: ep 0x82 - asked for 1 bytes, 0 bytes untransferred
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Slot state = 3, EP state = 2
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000de74ec7f (0xd7615000 dma), new deq ptr = 00000000f035f701 (0xd7615060 dma), new cycle = 1
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: New dequeue pointer = 0xd7615060 (DMA)
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: New dequeue segment = 00000000de74ec7f (virtual)
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Cycle state = 0x1
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Finding endpoint context
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Removing canceled TD starting at 0xd7615050 (dma).
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Cancel URB 000000002fee127b, dev 3, ep 0x82, starting at offset 0xd7615050
> > 
> The complete output of the above script along lspci -vvv, lsusb -v
> and filtered journal (kernel messages and script markings) can be
> downloaded from
> 
> http://fabi.bplaced.net/debuglogs1.tar.xz

I had a look at both the traces and dmesg. 
First issue I see is that the attempt to recover from a transaction error with a soft retry
isn't working. We expect the hardware to retry the transfer but nothing seems to happen.
Soft retry is described in xhci specs 4.6.8.1 and is basically a reset endpoint command with TSP set,
followed by ringing the endpoint doorbell. Traces indicate driver does this correctly but hardware
isn't retrying. We get don't get any event, no error, success or stall.

This could be hardware flaw.
Any chance you could try this on a xHC from some other vendor?

Second issue is a driver flaw, when nothing happened for 20 seconds we see the URB is canceled.
xhci driver needs to stop then endpoint to cancel the URB, but there is a hw race and endpoint
ends up halted instead of stopped. The xhci driver can't handle a halted endpoint in its stop endpoint
handler properly, and the URB is never actually removed from the ring.

The reason you see the IO_PAGE_FAULT is probably because once the ring starts running the driver will
handle the cancelled URB, and touch already freed memory:  
AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xdc707028 flags=0x0020]

I have a patch for this second case, I haven't upstreamed it as it got some conflicting feedback earlier.
It won't solve the 20 second delay, but should solve the the IO_PAGE_FAULT and the
"WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state" message

Can you try it out?

From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 7 Jan 2020 16:12:17 +0200
Subject: [PATCH] xhci: fix halted endpoint at stop endpoint command completion

xhci 4.6.9: A Busy endpoint may asynchronously transition from the
Running to the Halted or Error state due to error conditions detected
while processing TRBs. A possible race condition may occur if software,
thinking an endpoint is in the Running state, issues a Stop Endpoint
Command however at the same time the xHC asynchronously transitions
the endpoint to the Halted or Error state. In this case, a Context
State Error may be generated for the command completion. Software
may verify that this case occurred by inspecting the EP State for
Halted or Error when a Stop Endpoint Command results in a Context
State Error.

Fix this case by resetting the halted endpoint after cleaning
up the canceled trbs from the ring.
If the TRB we halted on was canceled then queue a new set TR dequeue
pointer command as usually.
If it wasn't canceled then move past it with a set TR dequeue pointer
and give it back with -EPIPE status as in a normal halted endpoint case.

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 56 ++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 5c223e92b8db..ceb3fac3f1c9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -745,11 +745,13 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	struct xhci_ring *ep_ring;
 	struct xhci_virt_ep *ep;
 	struct xhci_td *cur_td = NULL;
+	struct xhci_td *halted_td = NULL;
 	struct xhci_td *last_unlinked_td;
 	struct xhci_ep_ctx *ep_ctx;
 	struct xhci_virt_device *vdev;
 	u64 hw_deq;
 	struct xhci_dequeue_state deq_state;
+	u32 comp_code;
 
 	if (unlikely(TRB_TO_SUSPEND_PORT(le32_to_cpu(trb->generic.field[3])))) {
 		if (!xhci->devs[slot_id])
@@ -764,9 +766,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	vdev = xhci->devs[slot_id];
 	ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index);
+
 	trace_xhci_handle_cmd_stop_ep(ep_ctx);
 
 	ep = &xhci->devs[slot_id]->eps[ep_index];
+	comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
+
+	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
+		/* endpoint is halted and needs to be reset */
+		if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED) {
+			ep->ep_state |= EP_HALTED;
+		}
+	}
+
 	last_unlinked_td = list_last_entry(&ep->cancelled_td_list,
 			struct xhci_td, cancelled_td_list);
 
@@ -833,16 +845,60 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	xhci_stop_watchdog_timer_in_irq(xhci, ep);
 
+	/*
+	 * If stop endpoint command raced with a halting endpoint we need to
+	 * reset the endpoint first. If the TD we halted on isn't cancelled we
+	 * must give it back with -EPIPE status, and move ring dequeue past it.
+	 * If we can't find hw_deq, or the TD we halted on, do a soft reset
+	 */
+	   /* FIXME, is there a risk EP_HALTED is set from other cases */
+	if (ep->ep_state & EP_HALTED) {
+		enum xhci_ep_reset_type reset_type = EP_SOFT_RESET;
+		struct xhci_td *td;
+
+		if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
+			reset_type = EP_HARD_RESET;
+		} else if (ep->ep_state & EP_HAS_STREAMS) {
+			/* soft reset, nothing else */
+		} else if (!list_empty(&ep->ring->td_list)) {
+			hw_deq = xhci_get_hw_deq(xhci, vdev, ep_index, 0);
+			hw_deq &= ~0xf;
+			td = list_first_entry(&ep->ring->td_list,
+						     struct xhci_td, td_list);
+			if (trb_in_td(xhci, td->start_seg, td->first_trb,
+				      td->last_trb, hw_deq, false)) {
+				halted_td = td;
+				reset_type = EP_HARD_RESET;
+				xhci_find_new_dequeue_state(xhci, slot_id,
+							    ep_index, 0, td,
+							    &deq_state);
+			}
+		}
+		xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+		/* FIXME xhci_clear_hub_tt_buffer(xhci, td, ep); */
+	}
+
 	/* If necessary, queue a Set Transfer Ring Dequeue Pointer command */
 	if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
 		xhci_queue_new_dequeue_state(xhci, slot_id, ep_index,
 					     &deq_state);
 		xhci_ring_cmd_db(xhci);
+	} else if (ep->ep_state & EP_HALTED) {
+		xhci_ring_cmd_db(xhci); /* for endpoint soft reset command */
 	} else {
 		/* Otherwise ring the doorbell(s) to restart queued transfers */
 		ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
 	}
 
+	/* If TD we halted on wasn't cancelled give it back with -EPIPE */
+	if (halted_td) {
+		xhci_unmap_td_bounce_buffer(xhci, ep->ring, halted_td);
+		list_del_init(&halted_td->td_list);
+		inc_td_cnt(halted_td->urb);
+		if (last_td_in_urb(halted_td))
+			xhci_giveback_urb_in_irq(xhci, halted_td, -EPIPE);
+	}
+
 	/*
 	 * Drop the lock and complete the URBs in the cancelled TD list.
 	 * New TDs to be cancelled might be added to the end of the list before
-- 
2.17.1