linux-usb.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mathias Nyman <mathias.nyman@linux.intel.com>
To: Fabian Melzow <fabian.melzow@gmail.com>, linux-usb@vger.kernel.org
Subject: Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
Date: Mon, 29 Jun 2020 20:47:24 +0300	[thread overview]
Message-ID: <264e8287-b538-0798-36a6-7eafc4387a8d@linux.intel.com> (raw)
In-Reply-To: <20200620211913.1535bac0@ping>

Hi

On 20.6.2020 22.19, Fabian Melzow wrote:
> Hi folks!
> 
> After running the first scanimage I get the following errors/warnings:
> 
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Successful Set TR Deq Ptr cmd, deq = @d7615080
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000de74ec7f (0xd7615000 dma), new deq ptr = 00000000ff17b5e1 (0xd7615080 dma), new cycle = 1
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: New dequeue pointer = 0xd7615080 (DMA)
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: New dequeue segment = 00000000de74ec7f (virtual)
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Cycle state = 0x1
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Finding endpoint context
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Removing canceled TD starting at 0xd7615070 (dma).
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Stopped on Transfer TRB for slot 1 ep 4
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:42:15 ping kernel: xhci_hcd 0000:01:00.0: Cancel URB 00000000fa372bb8, dev 3, ep 0x82, starting at offset 0xd7615070
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xdc707028 flags=0x0020]
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: Looking for event-dma 00000000d7615060 trb-start 00000000d7615070 trb-end 00000000d7615070 seg-start 00000000d7615000 seg-end 00000000d7615ff0
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xc9b4c730 flags=0x0020]
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: ep 0x82 - asked for 1 bytes, 0 bytes untransferred
> Jun 20 19:41:55 ping kernel: xhci_hcd 0000:01:00.0: WARN Successful completion on short TX
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Slot state = 3, EP state = 2
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Set TR Deq Ptr cmd, new deq seg = 00000000de74ec7f (0xd7615000 dma), new deq ptr = 00000000f035f701 (0xd7615060 dma), new cycle = 1
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: New dequeue pointer = 0xd7615060 (DMA)
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: New dequeue segment = 00000000de74ec7f (virtual)
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Cycle state = 0x1
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Finding endpoint context
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Removing canceled TD starting at 0xd7615050 (dma).
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: // Ding dong!
> Jun 20 19:41:54 ping kernel: xhci_hcd 0000:01:00.0: Cancel URB 000000002fee127b, dev 3, ep 0x82, starting at offset 0xd7615050
> > 
> The complete output of the above script along lspci -vvv, lsusb -v
> and filtered journal (kernel messages and script markings) can be
> downloaded from
> 
> http://fabi.bplaced.net/debuglogs1.tar.xz

I had a look at both the traces and dmesg. 
First issue I see is that the attempt to recover from a transaction error with a soft retry
isn't working. We expect the hardware to retry the transfer but nothing seems to happen.
Soft retry is described in xhci specs 4.6.8.1 and is basically a reset endpoint command with TSP set,
followed by ringing the endpoint doorbell. Traces indicate driver does this correctly but hardware
isn't retrying. We get don't get any event, no error, success or stall.

This could be hardware flaw.
Any chance you could try this on a xHC from some other vendor?

Second issue is a driver flaw, when nothing happened for 20 seconds we see the URB is canceled.
xhci driver needs to stop then endpoint to cancel the URB, but there is a hw race and endpoint
ends up halted instead of stopped. The xhci driver can't handle a halted endpoint in its stop endpoint
handler properly, and the URB is never actually removed from the ring.

The reason you see the IO_PAGE_FAULT is probably because once the ring starts running the driver will
handle the cancelled URB, and touch already freed memory:  
AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0xdc707028 flags=0x0020]

I have a patch for this second case, I haven't upstreamed it as it got some conflicting feedback earlier.
It won't solve the 20 second delay, but should solve the the IO_PAGE_FAULT and the
"WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state" message

Can you try it out?

From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Tue, 7 Jan 2020 16:12:17 +0200
Subject: [PATCH] xhci: fix halted endpoint at stop endpoint command completion

xhci 4.6.9: A Busy endpoint may asynchronously transition from the
Running to the Halted or Error state due to error conditions detected
while processing TRBs. A possible race condition may occur if software,
thinking an endpoint is in the Running state, issues a Stop Endpoint
Command however at the same time the xHC asynchronously transitions
the endpoint to the Halted or Error state. In this case, a Context
State Error may be generated for the command completion. Software
may verify that this case occurred by inspecting the EP State for
Halted or Error when a Stop Endpoint Command results in a Context
State Error.

Fix this case by resetting the halted endpoint after cleaning
up the canceled trbs from the ring.
If the TRB we halted on was canceled then queue a new set TR dequeue
pointer command as usually.
If it wasn't canceled then move past it with a set TR dequeue pointer
and give it back with -EPIPE status as in a normal halted endpoint case.

Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 56 ++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 5c223e92b8db..ceb3fac3f1c9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -745,11 +745,13 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 	struct xhci_ring *ep_ring;
 	struct xhci_virt_ep *ep;
 	struct xhci_td *cur_td = NULL;
+	struct xhci_td *halted_td = NULL;
 	struct xhci_td *last_unlinked_td;
 	struct xhci_ep_ctx *ep_ctx;
 	struct xhci_virt_device *vdev;
 	u64 hw_deq;
 	struct xhci_dequeue_state deq_state;
+	u32 comp_code;
 
 	if (unlikely(TRB_TO_SUSPEND_PORT(le32_to_cpu(trb->generic.field[3])))) {
 		if (!xhci->devs[slot_id])
@@ -764,9 +766,19 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	vdev = xhci->devs[slot_id];
 	ep_ctx = xhci_get_ep_ctx(xhci, vdev->out_ctx, ep_index);
+
 	trace_xhci_handle_cmd_stop_ep(ep_ctx);
 
 	ep = &xhci->devs[slot_id]->eps[ep_index];
+	comp_code = GET_COMP_CODE(le32_to_cpu(event->status));
+
+	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
+		/* endpoint is halted and needs to be reset */
+		if (GET_EP_CTX_STATE(ep_ctx) == EP_STATE_HALTED) {
+			ep->ep_state |= EP_HALTED;
+		}
+	}
+
 	last_unlinked_td = list_last_entry(&ep->cancelled_td_list,
 			struct xhci_td, cancelled_td_list);
 
@@ -833,16 +845,60 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 
 	xhci_stop_watchdog_timer_in_irq(xhci, ep);
 
+	/*
+	 * If stop endpoint command raced with a halting endpoint we need to
+	 * reset the endpoint first. If the TD we halted on isn't cancelled we
+	 * must give it back with -EPIPE status, and move ring dequeue past it.
+	 * If we can't find hw_deq, or the TD we halted on, do a soft reset
+	 */
+	   /* FIXME, is there a risk EP_HALTED is set from other cases */
+	if (ep->ep_state & EP_HALTED) {
+		enum xhci_ep_reset_type reset_type = EP_SOFT_RESET;
+		struct xhci_td *td;
+
+		if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
+			reset_type = EP_HARD_RESET;
+		} else if (ep->ep_state & EP_HAS_STREAMS) {
+			/* soft reset, nothing else */
+		} else if (!list_empty(&ep->ring->td_list)) {
+			hw_deq = xhci_get_hw_deq(xhci, vdev, ep_index, 0);
+			hw_deq &= ~0xf;
+			td = list_first_entry(&ep->ring->td_list,
+						     struct xhci_td, td_list);
+			if (trb_in_td(xhci, td->start_seg, td->first_trb,
+				      td->last_trb, hw_deq, false)) {
+				halted_td = td;
+				reset_type = EP_HARD_RESET;
+				xhci_find_new_dequeue_state(xhci, slot_id,
+							    ep_index, 0, td,
+							    &deq_state);
+			}
+		}
+		xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
+		/* FIXME xhci_clear_hub_tt_buffer(xhci, td, ep); */
+	}
+
 	/* If necessary, queue a Set Transfer Ring Dequeue Pointer command */
 	if (deq_state.new_deq_ptr && deq_state.new_deq_seg) {
 		xhci_queue_new_dequeue_state(xhci, slot_id, ep_index,
 					     &deq_state);
 		xhci_ring_cmd_db(xhci);
+	} else if (ep->ep_state & EP_HALTED) {
+		xhci_ring_cmd_db(xhci); /* for endpoint soft reset command */
 	} else {
 		/* Otherwise ring the doorbell(s) to restart queued transfers */
 		ring_doorbell_for_active_rings(xhci, slot_id, ep_index);
 	}
 
+	/* If TD we halted on wasn't cancelled give it back with -EPIPE */
+	if (halted_td) {
+		xhci_unmap_td_bounce_buffer(xhci, ep->ring, halted_td);
+		list_del_init(&halted_td->td_list);
+		inc_td_cnt(halted_td->urb);
+		if (last_td_in_urb(halted_td))
+			xhci_giveback_urb_in_irq(xhci, halted_td, -EPIPE);
+	}
+
 	/*
 	 * Drop the lock and complete the URBs in the cancelled TD list.
 	 * New TDs to be cancelled might be added to the end of the list before
-- 
2.17.1




  reply	other threads:[~2020-06-29 19:10 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-20 19:19 ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1 Fabian Melzow
2020-06-29 17:47 ` Mathias Nyman [this message]
2020-06-30 10:40   ` [PATCH] xhci: fix halted endpoint at stop endpoint command completion kernel test robot
2020-06-30 16:58   ` ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1 Fabian Melzow
2020-06-30 19:03     ` Mathias Nyman
2020-07-01  9:17       ` David Heinzelmann
2020-07-03 16:00         ` Mathias Nyman
2020-07-01 17:51       ` Fabian Melzow
2020-07-03 15:57         ` Mathias Nyman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=264e8287-b538-0798-36a6-7eafc4387a8d@linux.intel.com \
    --to=mathias.nyman@linux.intel.com \
    --cc=fabian.melzow@gmail.com \
    --cc=linux-usb@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).