Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1

From: Mathias Nyman <mathias.nyman@linux.intel.com>
To: Fabian Melzow <fabian.melzow@gmail.com>
Cc: linux-usb@vger.kernel.org
Subject: Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
Date: Tue, 30 Jun 2020 22:03:29 +0300	[thread overview]
Message-ID: <f41aab00-ea04-bdd2-4174-33b2b19dc850@linux.intel.com> (raw)
In-Reply-To: <20200630185803.2a72c123@ping>

[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]

On 30.6.2020 19.58, Fabian Melzow wrote:
> Hi!
> 
> Am Mon, 29 Jun 2020 20:47:24 +0300
> schrieb Mathias Nyman <mathias.nyman@linux.intel.com>:
> 
>> First issue I see is that the attempt to recover from a transaction
>> error with a soft retry isn't working. We expect the hardware to
>> retry the transfer but nothing seems to happen. Soft retry is
>> described in xhci specs 4.6.8.1 and is basically a reset endpoint
>> command with TSP set, followed by ringing the endpoint doorbell.
>> Traces indicate driver does this correctly but hardware isn't
>> retrying. We get don't get any event, no error, success or stall.
>>
>> This could be hardware flaw.
>> Any chance you could try this on a xHC from some other vendor?
> 
> There is no other xHC hardware available to me.
> 
>> Second issue is a driver flaw, when nothing happened for 20 seconds
>> we see the URB is canceled. xhci driver needs to stop then endpoint
>> to cancel the URB, but there is a hw race and endpoint ends up halted
>> instead of stopped. The xhci driver can't handle a halted endpoint in
>> its stop endpoint handler properly, and the URB is never actually
>> removed from the ring.
>>
>> The reason you see the IO_PAGE_FAULT is probably because once the
>> ring starts running the driver will handle the cancelled URB, and
>> touch already freed memory: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x000d address=0xdc707028 flags=0x0020]
>>
>> I have a patch for this second case, I haven't upstreamed it as it
>> got some conflicting feedback earlier. It won't solve the 20 second
>> delay, but should solve the the IO_PAGE_FAULT and the "WARN Set TR
>> Deq Ptr cmd failed due to incorrect slot or ep state" message
>>
>> Can you try it out?
> 
> I successful applied the patch against Linux 5.7.4, but get this error when
> compiling drivers/usb/host/xhci-ring.c:
> 
>   CC [M]  drivers/usb/host/xhci-ring.o
> drivers/usb/host/xhci-ring.c: In function ‘xhci_handle_cmd_stop_ep’:
> drivers/usb/host/xhci-ring.c:857:3: error: implicit declaration of function ‘xhci_reset_halted_ep’ [-Werror=implicit-function-declaration]
>   857 |   xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
>       |   ^~~~~~~~~~~~~~~~~~~~
> 

Right, forgot that you need another patch before this.

both patches attached, also applied to 5.8-rc1 in branch "fix_invalid_context_at_stop_endpoint"
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint

-Mathias


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-xhci-crete-xhci_reset_halted_ep-helper-function.patch --]
[-- Type: text/x-patch; name="0001-xhci-crete-xhci_reset_halted_ep-helper-function.patch", Size: 2587 bytes --]

From ddb2004cb51fcf5acd668590c56fbf571ca66071 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 17 Dec 2019 14:03:34 +0200
Subject: [PATCH 1/2] xhci: crete xhci_reset_halted_ep() helper function

Create a separate helper function to only issue reset endpont
commands to clear halted endpoints.

This will be useful for cases where the a halted endpoint is discovered
while completing some other command, and needs to be cleared before
continuing.

No functional changes

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2c255d0620b0..5c223e92b8db 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -708,6 +708,26 @@ static void xhci_unmap_td_bounce_buffer(struct xhci_hcd *xhci,
 	seg->bounce_offs = 0;
 }
 
+static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
+				unsigned int ep_index, enum xhci_ep_reset_type reset_type)
+{
+	struct xhci_command *command;
+	int ret = 0;
+
+	command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
+	if (!command) {
+		ret = -ENOMEM;
+		goto done;
+	}
+
+	ret = xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type);
+done:
+	if (ret)
+		xhci_err(xhci, "ERROR queuing reset endpoint for slot %d ep_index %d, %d\n",
+			 slot_id, ep_index, ret);
+	return ret;
+}
+
 /*
  * When we get a command completion for a Stop Endpoint Command, we need to
  * unlink any cancelled TDs from the ring.  There are two ways to do that:
@@ -1855,7 +1875,7 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci,
 		enum xhci_ep_reset_type reset_type)
 {
 	struct xhci_virt_ep *ep = &xhci->devs[slot_id]->eps[ep_index];
-	struct xhci_command *command;
+	int err;
 
 	/*
 	 * Avoid resetting endpoint if link is inactive. Can cause host hang.
@@ -1864,13 +1884,11 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci,
 	if (xhci->devs[slot_id]->flags & VDEV_PORT_ERROR)
 		return;
 
-	command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
-	if (!command)
-		return;
-
 	ep->ep_state |= EP_HALTED;
 
-	xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type);
+	err = xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+	if (err)
+		return;
 
 	if (reset_type == EP_HARD_RESET) {
 		ep->ep_state |= EP_HARD_CLEAR_TOGGLE;
-- 
2.17.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-xhci-fix-halted-endpoint-at-stop-endpoint-command-co.patch --]
[-- Type: text/x-patch; name="0002-xhci-fix-halted-endpoint-at-stop-endpoint-command-co.patch", Size: 5167 bytes --]

From 9483b48fd7167358d76d86d48790df4301ec7b43 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 7 Jan 2020 16:12:17 +0200
Subject: [PATCH 2/2] xhci: fix halted endpoint at stop endpoint command
 completion

xhci 4.6.9: A Busy endpoint may asynchronously transition from the
Running to the Halted or Error state due to error conditions detected
while processing TRBs. A possible race condition may occur if software,
thinking an endpoint is in the Running state, issues a Stop Endpoint
Command however at the same time the xHC asynchronously transitions
the endpoint to the Halted or Error state. In this case, a Context
State Error may be generated for the command completion. Software
may verify that this case occurred by inspecting the EP State for
Halted or Error when a Stop Endpoint Command results in a Context
State Error.

Fix this case by resetting the halted endpoint after cleaning
up the calcelled trbs from the ring.
If the TRB we halted on was cancelled then queue a new set TR dequeue
pointer command as usually.
If it wasn't cancelled then move past it with a set TR dequeue pointer
and give it back with -EPIPE status as in a normal halted endpoint case.

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 56 ++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 5c223e92b8db..ceb3fac3f1c9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -745,11 +745,13 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	struct xhci_ring *ep_ring;
 	struct xhci_virt_ep *ep;
 	struct xhci_td *cur_td = NULL;
+	struct xhci_td *halted_td = NULL;
 	struct xhci_td *last_unlinked_td;
 	struct xhci_ep_ctx *ep_ctx;
 	struct xhci_virt_device *vdev;
 	u64 hw_deq;
 	struct xhci_dequeue_state deq_state;
+	u32 comp_code;
 
 	if (unlikely(TRB_TO_SUSPEND_PORT(le32_to_cpu(trb->generic.field[3])))) {
 		if (!xhci->devs[slot_id])
@@ -764,9 +766,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	vdev = xhci->devs[slot_id];
 	ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index);
+
 	trace_xhci_handle_cmd_stop_ep(ep_ctx);
 
 	ep = &xhci->devs[slot_id]->eps[ep_index];
+	comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
+
+	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
+		/* endpoint is halted and needs to be reset */
+		if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED) {
+			ep->ep_state |= EP_HALTED;
+		}
+	}
+
 	last_unlinked_td = list_last_entry(&ep->cancelled_td_list,
 			struct xhci_td, cancelled_td_list);
 
@@ -833,16 +845,60 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	xhci_stop_watchdog_timer_in_irq(xhci, ep);
 
+	/*
+	 * If stop endpoint command raced with a halting endpoint we need to
+	 * reset the endpoint first. If the TD we halted on isn't cancelled we
+	 * must give it back with -EPIPE status, and move ring dequeue past it.
+	 * If we can't find hw_deq, or the TD we halted on, do a soft reset
+	 */
+	   /* FIXME, is there a risk EP_HALTED is set from other cases */
+	if (ep->ep_state & EP_HALTED) {
+		enum xhci_ep_reset_type reset_type = EP_SOFT_RESET;
+		struct xhci_td *td;
+
+		if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
+			reset_type = EP_HARD_RESET;
+		} else if (ep->ep_state & EP_HAS_STREAMS) {
+			/* soft reset, nothing else */
+		} else if (!list_empty(&ep->ring->td_list)) {
+			hw_deq = xhci_get_hw_deq(xhci, vdev, ep_index, 0);
+			hw_deq &= ~0xf;
+			td = list_first_entry(&ep->ring->td_list,
+						     struct xhci_td, td_list);
+			if (trb_in_td(xhci, td->start_seg, td->first_trb,
+				      td->last_trb, hw_deq, false)) {
+				halted_td = td;
+				reset_type = EP_HARD_RESET;
+				xhci_find_new_dequeue_state(xhci, slot_id,
+							    ep_index, 0, td,
+							    &deq_state);
+			}
+		}
+		xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+		/* FIXME xhci_clear_hub_tt_buffer(xhci, td, ep); */
+	}
+
 	/* If necessary, queue a Set Transfer Ring Dequeue Pointer command */
 	if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
 		xhci_queue_new_dequeue_state(xhci, slot_id, ep_index,
 					     &deq_state);
 		xhci_ring_cmd_db(xhci);
+	} else if (ep->ep_state & EP_HALTED) {
+		xhci_ring_cmd_db(xhci); /* for endpoint soft reset command */
 	} else {
 		/* Otherwise ring the doorbell(s) to restart queued transfers */
 		ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
 	}
 
+	/* If TD we halted on wasn't cancelled give it back with -EPIPE */
+	if (halted_td) {
+		xhci_unmap_td_bounce_buffer(xhci, ep->ring, halted_td);
+		list_del_init(&halted_td->td_list);
+		inc_td_cnt(halted_td->urb);
+		if (last_td_in_urb(halted_td))
+			xhci_giveback_urb_in_irq(xhci, halted_td, -EPIPE);
+	}
+
 	/*
 	 * Drop the lock and complete the URBs in the cancelled TD list.
 	 * New TDs to be cancelled might be added to the end of the list before
-- 
2.17.1