From: Mathias Nyman <mathias.nyman@linux.intel.com>
To: zwisler@google.com
Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>,
"linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
"kernel@collabora.com" <kernel@collabora.com>
Subject: Re: xhci problem -> general protection fault
Date: Mon, 12 Oct 2020 22:20:31 +0300 [thread overview]
Message-ID: <69f8cbc3-0ae7-cfb2-2fdd-556ada77381f@linux.intel.com> (raw)
In-Reply-To: <20201001164352.GA13249@google.com>
On 1.10.2020 19.43, zwisler@google.com wrote:
> On Tue, Sep 29, 2020 at 01:35:31AM +0300, Mathias Nyman wrote:
> <>
>> The race I was referring to is if a driver issues a "Stop endpoint" command,
>> and it races with an endpoint error/halt initiated by the xHC controller.
>>
>> The additional note in xhci 4.6.9 - Stop Endpoint, explains it:
>> "Note: A Busy endpoint may asynchronously transition from the Running to the Halted
>> or Error state due to error conditions detected while processing TRBs. A possible
>> race condition may occur if software, thinking an endpoint is in the Running state,
>> issues a Stop Endpoint Command however at the same time the xHC
>> asynchronously transitions the endpoint to the Halted or Error state. In this case,
>> a Context State Error may be generated for the command completion. Software
>> may verify that this case occurred by inspecting the EP State for Halted or Error
>> when a Stop Endpoint Command results in a Context State Error."
>>
>> There are several context state errors in your trace.
>>
>> Thanks
>> -Mathias
>
> Interestingly it looks like it's the actions that we take at the end of
> xhci_handle_cmd_set_deq() for the broken command which break the HC.
> Specifically, this line:
>
> dev->eps[ep_index].ep_state &= ~SET_DEQ_PENDING;
>
> If I skip this line when I notice that ep_ctx->deq==0, the system will keep
> running happily.
Skipping this will prevent this endpoint from running, and thus
preventing the issues seen if we continue.
>
> Here is a trace and dmesg for a run with the patch at the bottom of this mail.
> I trimmed the trace a bit since it was very large, but I think I've left the
> important bits intact:
>
> https://gist.github.com/rzwisler/422e55321d9d2db5fc258d6d5b93d018
>
> I've been able to run with this patch and survive through many "Mismatch"
> occurrences, both with ep_ctx->deq set to 0 and set to some other value which
> just seems to be wrong.
>
> It seems like there are a few other places where we notice that we're in a bad
> state, and we just bail, specifically these in xhci_queue_new_dequeue_state():
>
> addr = xhci_trb_virt_to_dma(deq_state->new_deq_seg,
> deq_state->new_deq_ptr);
> if (addr == 0) {
> xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
> xhci_warn(xhci, "WARN deq seg = %px, deq pt = %px\n",
> deq_state->new_deq_seg, deq_state->new_deq_ptr);
> return;
> }
> ep = &xhci->devs[slot_id]->eps[ep_index];
> if ((ep->ep_state & SET_DEQ_PENDING)) {
> xhci_warn(xhci, "WARN Cannot submit Set TR Deq Ptr\n");
> xhci_warn(xhci, "A Set TR Deq Ptr command is pending.\n");
> return;
> }
>
> Is noticing that the HC has given us bad data via the "Mismatch" check in
> xhci_handle_cmd_set_deq() and bailing out enough, or should we figure out
> exactly why the HC is getting into a bad state?
I'm rewriting how xhci driver handles halted and canceled transfers.
While looking into it I found an older case where hardware gives bad data
in the output context. This was 10 years ago and on some specic hardware,
see commit:
ac9d8fe7c6a8 USB: xhci: Add quirk for Fresco Logic xHCI hardware.
>
> I'm happy to gather logs with more debug or run other experiments, if that
> would be helpful. As it is I don't really know how to debug the internal
> state of the HC further, but hopefully the knowledge that the patch below
> makes a difference will help us move forward.
Great thanks, it will take some time before rewrite is ready.
-Mathias
next prev parent reply other threads:[~2020-10-12 19:16 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-17 15:30 xhci problem -> general protection fault Andrzej Pietrasiewicz
2020-09-18 10:50 ` Mathias Nyman
2020-09-18 14:20 ` Andrzej Pietrasiewicz
2020-09-25 13:40 ` Mathias Nyman
2020-09-25 21:05 ` Ross Zwisler
2020-09-28 13:32 ` Andrzej Pietrasiewicz
2020-09-29 7:13 ` Mathias Nyman
2020-10-01 14:13 ` Andrzej Pietrasiewicz
2020-09-28 22:35 ` Mathias Nyman
2020-10-01 16:43 ` zwisler
2020-10-12 19:20 ` Mathias Nyman [this message]
2020-10-12 21:53 ` zwisler
2020-10-13 7:49 ` Mathias Nyman
2020-10-13 8:29 ` Andrzej Pietrasiewicz
2020-10-13 16:44 ` zwisler
2020-11-19 16:52 ` Ross Zwisler
2020-11-23 15:06 ` Mathias Nyman
2020-12-02 22:59 ` Ross Zwisler
2020-12-04 18:07 ` Mathias Nyman
2020-12-08 17:24 ` Ross Zwisler
2020-12-09 13:11 ` Mathias Nyman
2020-12-09 18:54 ` Ross Zwisler
2020-12-30 12:33 ` Mathias Nyman
2021-01-06 18:52 ` Ross Zwisler
2021-01-07 8:57 ` Mathias Nyman
2021-01-07 16:07 ` Ross Zwisler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=69f8cbc3-0ae7-cfb2-2fdd-556ada77381f@linux.intel.com \
--to=mathias.nyman@linux.intel.com \
--cc=andrzej.p@collabora.com \
--cc=kernel@collabora.com \
--cc=linux-usb@vger.kernel.org \
--cc=zwisler@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).