Linux-USB Archive on lore.kernel.org
 help / color / Atom feed
From: David Heinzelmann <heinzelmann.david@gmail.com>
To: Mathias Nyman <mathias.nyman@linux.intel.com>
Cc: Fabian Melzow <fabian.melzow@gmail.com>, linux-usb@vger.kernel.org
Subject: Re: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1
Date: Wed, 1 Jul 2020 11:17:55 +0200
Message-ID: <20200701091755.GA17742@dhe> (raw)
In-Reply-To: <f41aab00-ea04-bdd2-4174-33b2b19dc850@linux.intel.com>

On Tue, Jun 30, 2020 at 10:03:29PM +0300, Mathias Nyman wrote:
> On 30.6.2020 19.58, Fabian Melzow wrote:
> > Hi!
> > 
> > Am Mon, 29 Jun 2020 20:47:24 +0300
> > schrieb Mathias Nyman <mathias.nyman@linux.intel.com>:
> > 
> >> First issue I see is that the attempt to recover from a transaction
> >> error with a soft retry isn't working. We expect the hardware to
> >> retry the transfer but nothing seems to happen. Soft retry is
> >> described in xhci specs 4.6.8.1 and is basically a reset endpoint
> >> command with TSP set, followed by ringing the endpoint doorbell.
> >> Traces indicate driver does this correctly but hardware isn't
> >> retrying. We get don't get any event, no error, success or stall.
> >>
> >> This could be hardware flaw.
> >> Any chance you could try this on a xHC from some other vendor?
> > 
> > There is no other xHC hardware available to me.
> > 
> >> Second issue is a driver flaw, when nothing happened for 20 seconds
> >> we see the URB is canceled. xhci driver needs to stop then endpoint
> >> to cancel the URB, but there is a hw race and endpoint ends up halted
> >> instead of stopped. The xhci driver can't handle a halted endpoint in
> >> its stop endpoint handler properly, and the URB is never actually
> >> removed from the ring.
> >>
> >> The reason you see the IO_PAGE_FAULT is probably because once the
> >> ring starts running the driver will handle the cancelled URB, and
> >> touch already freed memory: AMD-Vi: Event logged [IO_PAGE_FAULT
> >> domain=0x000d address=0xdc707028 flags=0x0020]
> >>
> >> I have a patch for this second case, I haven't upstreamed it as it
> >> got some conflicting feedback earlier. It won't solve the 20 second
> >> delay, but should solve the the IO_PAGE_FAULT and the "WARN Set TR
> >> Deq Ptr cmd failed due to incorrect slot or ep state" message
> >>
> >> Can you try it out?
> > 
> > I successful applied the patch against Linux 5.7.4, but get this error when
> > compiling drivers/usb/host/xhci-ring.c:
> > 
> >   CC [M]  drivers/usb/host/xhci-ring.o
> > drivers/usb/host/xhci-ring.c: In function ‘xhci_handle_cmd_stop_ep’:
> > drivers/usb/host/xhci-ring.c:857:3: error: implicit declaration of function ‘xhci_reset_halted_ep’ [-Werror=implicit-function-declaration]
> >   857 |   xhci_reset_halted_ep(xhci, slot_id, ep_index, reset_type);
> >       |   ^~~~~~~~~~~~~~~~~~~~
> > 
> 
> Right, forgot that you need another patch before this.
> 
> both patches attached, also applied to 5.8-rc1 in branch "fix_invalid_context_at_stop_endpoint"
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_invalid_context_at_stop_endpoint
> 
> -Mathias
> 

Hi,

I've been investigating what appears to be an identical problem recently which I think is related
to the description from xhci 4.6.9. I have an fx2 device which stalls its endpoint while at the
same time the URBs are aborted from the software.

Kernel logs which occured when the xhci driver is stucked at the halted endpoint.

Jun 29 17:27:12 dhe-pc kernel: xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Jun 29 17:27:12 dhe-pc kernel: xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Jun 29 17:27:12 dhe-pc kernel: xhci_hcd 0000:00:14.0: WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
Jun 29 17:27:21 dhe-pc kernel: xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 6
Jun 29 17:27:21 dhe-pc kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 00000001173dc630 trb-start 00000001173dcc30 trb-end 0000000000000000 seg-start 00000001173dc000 seg-end 00000001173dcff0
Jun 29 17:27:21 dhe-pc kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 00000001173dc630 trb-start 0000000116e53000 trb-end 0000000116e53830 seg-start 0000000116e53000 seg-end 0000000116e53ff0

I have applied your patch to kernel version 5.8-rc3 and I can confirm that the patch solves
my problem for the halted endpoint.

I still get the following error messages from time to time.

Jul 01 09:56:56 dhe kernel: xhci_hcd 0000:00:14.0: WARN Event TRB for slot 11 ep 4 with no TDs queued?
Jul 01 09:56:56 dhe kernel: xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Jul 01 09:56:56 dhe kernel: xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.

or

Jul 01 10:09:27 dhe kernel: xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Jul 01 10:09:27 dhe kernel: xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 0000000473d66d90 trb-start 0000000473d66e70 trb-end 0000000000000000 seg-start 0000000473d66000 seg-end 0000000473d66ff0
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 0000000473d66d90 trb-start 0000000473d65000 trb-end 0000000473d65a70 seg-start 0000000473d65000 seg-end 0000000473d65ff0
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 13
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 0000000473d66e60 trb-start 0000000473d66e70 trb-end 0000000000000000 seg-start 0000000473d66000 seg-end 0000000473d66ff0
Jul 01 10:09:29 dhe kernel: xhci_hcd 0000:00:14.0: Looking for event-dma 0000000473d66e60 trb-start 0000000473d65000 trb-end 0000000473d65a70 seg-start 0000000473d65000 seg-end 0000000473d65ff0

but the core problem that the xhci is stucked at the halted endpoint state does not occur anymore.
If you are interested in more detailed debug logs I can provide them.

David


  reply index

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-20 19:19 Fabian Melzow
2020-06-29 17:47 ` Mathias Nyman
2020-06-30 10:40   ` [PATCH] xhci: fix halted endpoint at stop endpoint command completion kernel test robot
2020-06-30 16:58   ` ERROR Transfer event TRB DMA ptr not part of current TD ep_index 4 comp_code 1 Fabian Melzow
2020-06-30 19:03     ` Mathias Nyman
2020-07-01  9:17       ` David Heinzelmann [this message]
2020-07-03 16:00         ` Mathias Nyman
2020-07-01 17:51       ` Fabian Melzow
2020-07-03 15:57         ` Mathias Nyman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200701091755.GA17742@dhe \
    --to=heinzelmann.david@gmail.com \
    --cc=fabian.melzow@gmail.com \
    --cc=linux-usb@vger.kernel.org \
    --cc=mathias.nyman@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-USB Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-usb/0 linux-usb/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-usb linux-usb/ https://lore.kernel.org/linux-usb \
		linux-usb@vger.kernel.org
	public-inbox-index linux-usb

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-usb


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git