From: Mathias Nyman <mathias.nyman@linux.intel.com>
To: Fabian Melzow <fabian.melzow@gmail.com>
Cc: linux-usb@vger.kernel.org
Subject: Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
Date: Tue, 30 Jun 2020 22:03:29 +0300 [thread overview]
Message-ID: <f41aab00-ea04-bdd2-4174-33b2b19dc850@linux.intel.com> (raw)
In-Reply-To: <20200630185803.2a72c123@ping>
[-- Attachment #1: Type: text/plain, Size: 2482 bytes --]
On 30.6.2020 19.58, Fabian Melzow wrote:
> Hi!
>
> Am Mon, 29 Jun 2020 20:47:24 +0300
> schrieb Mathias Nyman <mathias.nyman@linux.intel.com>:
>
>> First issue I see is that the attempt to recover from a transaction
>> error with a soft retry isn't working. We expect the hardware to
>> retry the transfer but nothing seems to happen. Soft retry is
>> described in xhci specs 4.6.8.1 and is basically a reset endpoint
>> command with TSP set, followed by ringing the endpoint doorbell.
>> Traces indicate driver does this correctly but hardware isn't
>> retrying. We get don't get any event, no error, success or stall.
>>
>> This could be hardware flaw.
>> Any chance you could try this on a xHC from some other vendor?
>
> There is no other xHC hardware available to me.
>
>> Second issue is a driver flaw, when nothing happened for 20 seconds
>> we see the URB is canceled. xhci driver needs to stop then endpoint
>> to cancel the URB, but there is a hw race and endpoint ends up halted
>> instead of stopped. The xhci driver can't handle a halted endpoint in
>> its stop endpoint handler properly, and the URB is never actually
>> removed from the ring.
>>
>> The reason you see the IO_PAGE_FAULT is probably because once the
>> ring starts running the driver will handle the cancelled URB, and
>> touch already freed memory: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x000d address=0xdc707028 flags=0x0020]
>>
>> I have a patch for this second case, I haven't upstreamed it as it
>> got some conflicting feedback earlier. It won't solve the 20 second
>> delay, but should solve the the IO_PAGE_FAULT and the "WARN Set TR
>> Deq Ptr cmd failed due to incorrect slot or ep state" message
>>
>> Can you try it out?
>
> I successful applied the patch against Linux 5.7.4, but get this error when
> compiling drivers/usb/host/xhci-ring.c:
>
> CC [M] drivers/usb/host/xhci-ring.o
> drivers/usb/host/xhci-ring.c: In function ‘xhci_handle_cmd_stop_ep’:
> drivers/usb/host/xhci-ring.c:857:3: error: implicit declaration of function ‘xhci_reset_halted_ep’ [-Werror=implicit-function-declaration]
> 857 | xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
> | ^~~~~~~~~~~~~~~~~~~~
>
Right, forgot that you need another patch before this.
both patches attached, also applied to 5.8-rc1 in branch "fix_invalid_context_at_stop_endpoint"
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
-Mathias
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-xhci-crete-xhci_reset_halted_ep-helper-function.patch --]
[-- Type: text/x-patch; name="0001-xhci-crete-xhci_reset_halted_ep-helper-function.patch", Size: 2587 bytes --]
From ddb2004cb51fcf5acd668590c56fbf571ca66071 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 17 Dec 2019 14:03:34 +0200
Subject: [PATCH 1/2] xhci: crete xhci_reset_halted_ep() helper function
Create a separate helper function to only issue reset endpont
commands to clear halted endpoints.
This will be useful for cases where the a halted endpoint is discovered
while completing some other command, and needs to be cleared before
continuing.
No functional changes
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
drivers/usb/host/xhci-ring.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 2c255d0620b0..5c223e92b8db 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -708,6 +708,26 @@ static void xhci_unmap_td_bounce_buffer(struct xhci_hcd *xhci,
seg->bounce_offs = 0;
}
+static int xhci_reset_halted_ep(struct xhci_hcd *xhci, unsigned int slot_id,
+ unsigned int ep_index, enum xhci_ep_reset_type reset_type)
+{
+ struct xhci_command *command;
+ int ret = 0;
+
+ command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
+ if (!command) {
+ ret = -ENOMEM;
+ goto done;
+ }
+
+ ret = xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type);
+done:
+ if (ret)
+ xhci_err(xhci, "ERROR queuing reset endpoint for slot %d ep_index %d, %d\n",
+ slot_id, ep_index, ret);
+ return ret;
+}
+
/*
* When we get a command completion for a Stop Endpoint Command, we need to
* unlink any cancelled TDs from the ring. There are two ways to do that:
@@ -1855,7 +1875,7 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci,
enum xhci_ep_reset_type reset_type)
{
struct xhci_virt_ep *ep = &xhci->devs[slot_id]->eps[ep_index];
- struct xhci_command *command;
+ int err;
/*
* Avoid resetting endpoint if link is inactive. Can cause host hang.
@@ -1864,13 +1884,11 @@ static void xhci_cleanup_halted_endpoint(struct xhci_hcd *xhci,
if (xhci->devs[slot_id]->flags & VDEV_PORT_ERROR)
return;
- command = xhci_alloc_command(xhci, false, GFP_ATOMIC);
- if (!command)
- return;
-
ep->ep_state |= EP_HALTED;
- xhci_queue_reset_ep(xhci, command, slot_id, ep_index, reset_type);
+ err = xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+ if (err)
+ return;
if (reset_type == EP_HARD_RESET) {
ep->ep_state |= EP_HARD_CLEAR_TOGGLE;
--
2.17.1
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-xhci-fix-halted-endpoint-at-stop-endpoint-command-co.patch --]
[-- Type: text/x-patch; name="0002-xhci-fix-halted-endpoint-at-stop-endpoint-command-co.patch", Size: 5167 bytes --]
From 9483b48fd7167358d76d86d48790df4301ec7b43 Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 7 Jan 2020 16:12:17 +0200
Subject: [PATCH 2/2] xhci: fix halted endpoint at stop endpoint command
completion
xhci 4.6.9: A Busy endpoint may asynchronously transition from the
Running to the Halted or Error state due to error conditions detected
while processing TRBs. A possible race condition may occur if software,
thinking an endpoint is in the Running state, issues a Stop Endpoint
Command however at the same time the xHC asynchronously transitions
the endpoint to the Halted or Error state. In this case, a Context
State Error may be generated for the command completion. Software
may verify that this case occurred by inspecting the EP State for
Halted or Error when a Stop Endpoint Command results in a Context
State Error.
Fix this case by resetting the halted endpoint after cleaning
up the calcelled trbs from the ring.
If the TRB we halted on was cancelled then queue a new set TR dequeue
pointer command as usually.
If it wasn't cancelled then move past it with a set TR dequeue pointer
and give it back with -EPIPE status as in a normal halted endpoint case.
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
drivers/usb/host/xhci-ring.c | 56 ++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 5c223e92b8db..ceb3fac3f1c9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -745,11 +745,13 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
struct xhci_ring *ep_ring;
struct xhci_virt_ep *ep;
struct xhci_td *cur_td = NULL;
+ struct xhci_td *halted_td = NULL;
struct xhci_td *last_unlinked_td;
struct xhci_ep_ctx *ep_ctx;
struct xhci_virt_device *vdev;
u64 hw_deq;
struct xhci_dequeue_state deq_state;
+ u32 comp_code;
if (unlikely(TRB_TO_SUSPEND_PORT(le32_to_cpu(trb->generic.field[3])))) {
if (!xhci->devs[slot_id])
@@ -764,9 +766,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
vdev = xhci->devs[slot_id];
ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index);
+
trace_xhci_handle_cmd_stop_ep(ep_ctx);
ep = &xhci->devs[slot_id]->eps[ep_index];
+ comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
+
+ if (comp_code == COMP_CONTEXT_STATE_ERROR) {
+ /* endpoint is halted and needs to be reset */
+ if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED) {
+ ep->ep_state |= EP_HALTED;
+ }
+ }
+
last_unlinked_td = list_last_entry(&ep->cancelled_td_list,
struct xhci_td, cancelled_td_list);
@@ -833,16 +845,60 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
xhci_stop_watchdog_timer_in_irq(xhci, ep);
+ /*
+ * If stop endpoint command raced with a halting endpoint we need to
+ * reset the endpoint first. If the TD we halted on isn't cancelled we
+ * must give it back with -EPIPE status, and move ring dequeue past it.
+ * If we can't find hw_deq, or the TD we halted on, do a soft reset
+ */
+ /* FIXME, is there a risk EP_HALTED is set from other cases */
+ if (ep->ep_state & EP_HALTED) {
+ enum xhci_ep_reset_type reset_type = EP_SOFT_RESET;
+ struct xhci_td *td;
+
+ if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
+ reset_type = EP_HARD_RESET;
+ } else if (ep->ep_state & EP_HAS_STREAMS) {
+ /* soft reset, nothing else */
+ } else if (!list_empty(&ep->ring->td_list)) {
+ hw_deq = xhci_get_hw_deq(xhci, vdev, ep_index, 0);
+ hw_deq &= ~0xf;
+ td = list_first_entry(&ep->ring->td_list,
+ struct xhci_td, td_list);
+ if (trb_in_td(xhci, td->start_seg, td->first_trb,
+ td->last_trb, hw_deq, false)) {
+ halted_td = td;
+ reset_type = EP_HARD_RESET;
+ xhci_find_new_dequeue_state(xhci, slot_id,
+ ep_index, 0, td,
+ &deq_state);
+ }
+ }
+ xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+ /* FIXME xhci_clear_hub_tt_buffer(xhci, td, ep); */
+ }
+
/* If necessary, queue a Set Transfer Ring Dequeue Pointer command */
if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
xhci_queue_new_dequeue_state(xhci, slot_id, ep_index,
&deq_state);
xhci_ring_cmd_db(xhci);
+ } else if (ep->ep_state & EP_HALTED) {
+ xhci_ring_cmd_db(xhci); /* for endpoint soft reset command */
} else {
/* Otherwise ring the doorbell(s) to restart queued transfers */
ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
}
+ /* If TD we halted on wasn't cancelled give it back with -EPIPE */
+ if (halted_td) {
+ xhci_unmap_td_bounce_buffer(xhci, ep->ring, halted_td);
+ list_del_init(&halted_td->td_list);
+ inc_td_cnt(halted_td->urb);
+ if (last_td_in_urb(halted_td))
+ xhci_giveback_urb_in_irq(xhci, halted_td, -EPIPE);
+ }
+
/*
* Drop the lock and complete the URBs in the cancelled TD list.
* New TDs to be cancelled might be added to the end of the list before
--
2.17.1
next prev parent reply other threads:[~2020-06-30 19:00 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-20 19:19 ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1 Fabian Melzow
2020-06-29 17:47 ` Mathias Nyman
2020-06-30 10:40 ` [PATCH] xhci: fix halted endpoint at stop endpoint command completion kernel test robot
2020-06-30 16:58 ` ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1 Fabian Melzow
2020-06-30 19:03 ` Mathias Nyman [this message]
2020-07-01 9:17 ` David Heinzelmann
2020-07-03 16:00 ` Mathias Nyman
2020-07-01 17:51 ` Fabian Melzow
2020-07-03 15:57 ` Mathias Nyman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f41aab00-ea04-bdd2-4174-33b2b19dc850@linux.intel.com \
--to=mathias.nyman@linux.intel.com \
--cc=fabian.melzow@gmail.com \
--cc=linux-usb@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).