* RE: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
@ 2021-05-19 16:14 Guido Kiener
2021-05-19 17:35 ` Alan Stern
2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
0 siblings, 2 replies; 11+ messages in thread
From: Guido Kiener @ 2021-05-19 16:14 UTC (permalink / raw)
To: Alan Stern, dave penkler
Cc: Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
x86
> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > When the host driver detects a protocol error while processing an
> > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > as halted.
> > >
> > > Not true. It does not mark the endpoint as halted, not unless it
> > > receives a STALL handshake from the device. A STALL is not a
> > > protocol error.
> > >
> > > > When the class driver resubmits the URB and the if the host driver
> > > > finds the endpoint still marked as halted it should return EPIPE
> > > > status on the resubmitted URB
> > >
> > > Irrelevant.
> > Not at all. The point is that when an application is talking to an
> > instrument over the usbtmc driver, the underlying host controller and
> > its driver will detect and silence a babbling endpoint.
>
> No, they won't. That is, they will detect a babble error and return an error status, but
> they won't silence the endpoint. What makes you think they will?
Maybe there is a misunderstanding. I guess that Dave wanted to propose:
"EPROTO is a link level issue and needs to be handled by the host driver.
When the host driver detects a protocol error while processing an
URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
as halted."
Is this a realistic fix for all host drivers?
-Guido
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
@ 2021-05-19 17:35 ` Alan Stern
2021-05-19 19:38 ` Thinh Nguyen
2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
1 sibling, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-19 17:35 UTC (permalink / raw)
To: Guido Kiener
Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman,
lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo,
syzkaller-bugs, tglx, x86
On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true. It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device. A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> >
> > No, they won't. That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint. What makes you think they will?
>
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status
The host controller drivers _do_ complete URBs with -EPROTO (or similar)
status when a link-level error occurs...
> and SHOULD mark the endpoint
> as halted."
but they don't mark the endpoint as halted. Even if they did, it
wouldn't fix anything because the kernel allows URBs to be submitted to
halted endpoints. In fact, it doesn't even keep track of which
endpoints are or are not halted.
> Is this a realistic fix for all host drivers?
No, it isn't.
An endpoint shouldn't be marked as halted unless it really is halted.
Otherwise a driver might be tempted to clear the Halt feature, and
some devices do not like to receive a Clear-Halt request for an endpoint
that isn't halted.
What we could do is what you suggested earlier: Note the fact that the
endpoint is in some sort of fault condition and disallow further
communication with the endpoint until the fault condition has been
cleared. (It isn't entirely obvious exactly what actions should clear
such a fault... I guess resetting or re-enabling the endpoint, or
resetting the entire device.)
Alan Stern
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
2021-05-19 17:35 ` Alan Stern
@ 2021-05-19 18:04 ` Lee Jones
1 sibling, 0 replies; 11+ messages in thread
From: Lee Jones @ 2021-05-19 18:04 UTC (permalink / raw)
To: Guido Kiener
Cc: Alan Stern, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, USB list, bp, dwmw, hpa, linux-kernel, luto,
mingo, syzkaller-bugs, tglx, x86
On Wed, 19 May 2021, Guido Kiener wrote:
> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true. It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device. A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> >
> > No, they won't. That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint. What makes you think they will?
>
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
> as halted."
> Is this a realistic fix for all host drivers?
>
> -Guido
Guido, would you mind taking a look at your mailer settings please? I
now have >=7 threads running through my inbox with the same subject.
For some reason your mailer is insisting on creating a new one for
each of your replies.
It's also adding odd "re: re: re: ..." prefixes.
TIA
--
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-19 17:35 ` Alan Stern
@ 2021-05-19 19:38 ` Thinh Nguyen
2021-05-20 2:01 ` Alan Stern
0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-19 19:38 UTC (permalink / raw)
To: Alan Stern, Guido Kiener
Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman,
lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo,
syzkaller-bugs, tglx, x86
Alan Stern wrote:
> On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
>>> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
>>>> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>
>>>>> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
>>>>>> When the host driver detects a protocol error while processing an
>>>>>> URB it completes the URB with EPROTO status and marks the endpoint
>>>>>> as halted.
>>>>>
>>>>> Not true. It does not mark the endpoint as halted, not unless it
>>>>> receives a STALL handshake from the device. A STALL is not a
>>>>> protocol error.
>>>>>
>>>>>> When the class driver resubmits the URB and the if the host driver
>>>>>> finds the endpoint still marked as halted it should return EPIPE
>>>>>> status on the resubmitted URB
>>>>>
>>>>> Irrelevant.
>>>> Not at all. The point is that when an application is talking to an
>>>> instrument over the usbtmc driver, the underlying host controller and
>>>> its driver will detect and silence a babbling endpoint.
>>>
>>> No, they won't. That is, they will detect a babble error and return an error status, but
>>> they won't silence the endpoint. What makes you think they will?
>>
>> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
>> "EPROTO is a link level issue and needs to be handled by the host driver.
>> When the host driver detects a protocol error while processing an
>> URB it SHOULD complete the URB with EPROTO status
>
> The host controller drivers _do_ complete URBs with -EPROTO (or similar)
> status when a link-level error occurs...
>
>> and SHOULD mark the endpoint
>> as halted."
>
> but they don't mark the endpoint as halted. Even if they did, it
> wouldn't fix anything because the kernel allows URBs to be submitted to
> halted endpoints. In fact, it doesn't even keep track of which
> endpoints are or are not halted.
>
>> Is this a realistic fix for all host drivers?
>
> No, it isn't.
>
> An endpoint shouldn't be marked as halted unless it really is halted.
> Otherwise a driver might be tempted to clear the Halt feature, and
> some devices do not like to receive a Clear-Halt request for an endpoint
> that isn't halted.
>
> What we could do is what you suggested earlier: Note the fact that the
> endpoint is in some sort of fault condition and disallow further
> communication with the endpoint until the fault condition has been
> cleared. (It isn't entirely obvious exactly what actions should clear
> such a fault... I guess resetting or re-enabling the endpoint, or
> resetting the entire device.)
>
> Alan Stern
>
Hi Alan,
Sorry if this diverges from the thread, but I've been wondering whether
to add a change for this also.
For xHCI hosts, after transactions errors, the endpoint will enter
halted state. The driver will attempt a few soft-retries before giving
up. According to the xHCI spec (section 4.6.8), a host may send a
ClearFeature(endpoint_halt) to recover and restart the transfer (see
"reset a pipe" in xhci spec), and the class driver can handle this after
receiving something like -EPROTO from xhci.
However, as you've pointed out, some devices don't like
ClearFeature(ep_halt) and may not properly synchronize with the host on
where it should restart.
Some OS (such as Windows) do this. Not sure if we also want this?
Currently the recovery is just a timeout and a port reset from the class
driver, but the timeout is usually defaulted to a long time (e.g. 30
seconds for storage class driver).
Thanks,
Thinh
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-19 19:38 ` Thinh Nguyen
@ 2021-05-20 2:01 ` Alan Stern
2021-05-20 20:30 ` Thinh Nguyen
0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-20 2:01 UTC (permalink / raw)
To: Thinh Nguyen
Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86
On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
> Hi Alan,
>
> Sorry if this diverges from the thread, but I've been wondering whether
> to add a change for this also.
>
> For xHCI hosts, after transactions errors, the endpoint will enter
> halted state.
No. You are misreading the xHCI spec. Section 4.6.8 says:
... the state of the associated Endpoint Context is set to
Halted...
Note this carefully. It says "Endpoint Context", not "endpoint".
The endpoint is part of the device, whereas the endpoint context is part
of the host controller. The device doesn't know when a transaction
error has occurred; consequently such errors do not affect the endpoint.
The host controller does know, and consequently such errors do affect
the endpoint context.
> The driver will attempt a few soft-retries before giving
> up. According to the xHCI spec (section 4.6.8), a host may send a
> ClearFeature(endpoint_halt) to recover and restart the transfer (see
Not quite. The section of the spec you're talking about says:
Software shall execute the following sequence to “reset a
pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to
device.
It does not say the host controller will do this; it says that software
will do it.
> "reset a pipe" in xhci spec), and the class driver can handle this after
> receiving something like -EPROTO from xhci.
>
> However, as you've pointed out, some devices don't like
> ClearFeature(ep_halt) and may not properly synchronize with the host on
> where it should restart.
>
> Some OS (such as Windows) do this. Not sure if we also want this?
In general we should do the same thing as Windows does, because most
hardware designers test their equipment on Windows systems but
relatively few test on Linux systems.
> Currently the recovery is just a timeout and a port reset from the class
This depends on the driver. Some perform no recovery at all.
> driver, but the timeout is usually defaulted to a long time (e.g. 30
> seconds for storage class driver).
That 30-second timeout in the mass-storage driver applies in situations
where a command fails to complete, not in situations where it completes
quickly but with a -EPROTO or -EPIPE error.
The fact is that only a small percentage of -EPROTO errors are
recoverable. Some of them can be handled by a port reset, which can be
pretty awkward to perform but does occasionally work. A lot of them
occur because the USB cable has been unplugged; obviously there's no way
to recover from that. With only a few exceptions, the best and simplest
approach is not to try to recover at all.
For the case in question (the syzbot bug report that started this
thread), the class driver doesn't try to perform any recovery. It just
resubmits the URB, getting into a tight retry loop which consumes too
much CPU time. Simply giving up would be preferable.
Alan Stern
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-20 2:01 ` Alan Stern
@ 2021-05-20 20:30 ` Thinh Nguyen
2021-05-24 15:18 ` Mathias Nyman
0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-20 20:30 UTC (permalink / raw)
To: Alan Stern, Thinh Nguyen, Mathias Nyman
Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86
+Mathias
Alan Stern wrote:
> On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
>> Hi Alan,
>>
>> Sorry if this diverges from the thread, but I've been wondering whether
>> to add a change for this also.
>>
>> For xHCI hosts, after transactions errors, the endpoint will enter
>> halted state.
>
> No. You are misreading the xHCI spec. Section 4.6.8 says:
>
> ... the state of the associated Endpoint Context is set to
> Halted...
>
> Note this carefully. It says "Endpoint Context", not "endpoint".
>
> The endpoint is part of the device, whereas the endpoint context is part
> of the host controller. The device doesn't know when a transaction
> error has occurred; consequently such errors do not affect the endpoint.
> The host controller does know, and consequently such errors do affect
> the endpoint context.
>
You're right, my mistake here.
>> The driver will attempt a few soft-retries before giving
>> up. According to the xHCI spec (section 4.6.8), a host may send a
>> ClearFeature(endpoint_halt) to recover and restart the transfer (see
>
> Not quite. The section of the spec you're talking about says:
>
> Software shall execute the following sequence to “reset a
> pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to
> device.
>
> It does not say the host controller will do this; it says that software
> will do it.
Sorry for being unclear. I meant from the class driver, see my next
sentence.
>
>> "reset a pipe" in xhci spec), and the class driver can handle this after
>> receiving something like -EPROTO from xhci.
>>
>> However, as you've pointed out, some devices don't like
>> ClearFeature(ep_halt) and may not properly synchronize with the host on
>> where it should restart.
>>
>> Some OS (such as Windows) do this. Not sure if we also want this?
>
> In general we should do the same thing as Windows does, because most
> hardware designers test their equipment on Windows systems but
> relatively few test on Linux systems.
>
>> Currently the recovery is just a timeout and a port reset from the class
>
> This depends on the driver. Some perform no recovery at all.
>
>> driver, but the timeout is usually defaulted to a long time (e.g. 30
>> seconds for storage class driver).
>
> That 30-second timeout in the mass-storage driver applies in situations
> where a command fails to complete, not in situations where it completes
> quickly but with a -EPROTO or -EPIPE error.
Hm... looks like we have a couple of issues in the uas storage class
driver and the xhci driver.
We may need to fix that in the uas storage driver because it doesn't
seem to handle it. (check uas_data_cmplt() in uas.c).
As for the xhci driver, there maybe a case where the stream URB never
gets to complete because the transaction err_count is not properly
updated. The err_count for transaction error is stored in ep_ring, but
the xhci driver may not be able to lookup the correct ep_ring based on
TRB address for streams. There are cases for streams where the event
TRBs have their TRB pointer field cleared to '0' (xhci spec section
4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
it automatically does a soft-retry. This is seen from one of our
testings that the driver was repeatedly doing soft-retry until the class
driver timed out.
Hi Mathias, maybe you have some comment on this? Thanks.
>
> The fact is that only a small percentage of -EPROTO errors are
> recoverable. Some of them can be handled by a port reset, which can be
> pretty awkward to perform but does occasionally work. A lot of them
> occur because the USB cable has been unplugged; obviously there's no way
> to recover from that. With only a few exceptions, the best and simplest
> approach is not to try to recover at all.
If the cable is unplugged, then we should get a connection change event
and the driver can handle it properly.
Yes, it's probably simplest to do a port reset and let the transfer be
incomplete/corrupted. However, I think we should give
ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
mechanism for storage class driver, even though that it may not be
foolproof.
>
> For the case in question (the syzbot bug report that started this
> thread), the class driver doesn't try to perform any recovery. It just
> resubmits the URB, getting into a tight retry loop which consumes too
> much CPU time. Simply giving up would be preferable.
>
> Alan Stern
>
I see. By giving up, you mean doing port reset right? Otherwise it needs
some other mechanism to synchronize with the device side.
Thanks,
Thinh
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-20 20:30 ` Thinh Nguyen
@ 2021-05-24 15:18 ` Mathias Nyman
2021-05-24 18:55 ` Alan Stern
0 siblings, 1 reply; 11+ messages in thread
From: Mathias Nyman @ 2021-05-24 15:18 UTC (permalink / raw)
To: Thinh Nguyen, Alan Stern, Mathias Nyman
Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86
On 20.5.2021 23.30, Thinh Nguyen wrote:
> +Mathias
>
...
> Hm... looks like we have a couple of issues in the uas storage class
> driver and the xhci driver.
>
> We may need to fix that in the uas storage driver because it doesn't
> seem to handle it. (check uas_data_cmplt() in uas.c).
>
> As for the xhci driver, there maybe a case where the stream URB never
> gets to complete because the transaction err_count is not properly
> updated. The err_count for transaction error is stored in ep_ring, but
> the xhci driver may not be able to lookup the correct ep_ring based on
> TRB address for streams. There are cases for streams where the event
> TRBs have their TRB pointer field cleared to '0' (xhci spec section
> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> it automatically does a soft-retry. This is seen from one of our
> testings that the driver was repeatedly doing soft-retry until the class
> driver timed out.
>
> Hi Mathias, maybe you have some comment on this? Thanks.
This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
We should add one and prevent a loop. after e few soft resets we can end with a
hard reset to clear the host side endpoint halt.
We don't know the URB that was being tansferred during the error, and can't
give it back with a proper error code.
In that sense we still end up waiting for a timeout and someone to cancel
the urb.
-Mathias
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-24 15:18 ` Mathias Nyman
@ 2021-05-24 18:55 ` Alan Stern
2021-05-24 19:23 ` Thinh Nguyen
0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-24 18:55 UTC (permalink / raw)
To: Mathias Nyman
Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler,
Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
x86
On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
> On 20.5.2021 23.30, Thinh Nguyen wrote:
> > As for the xhci driver, there maybe a case where the stream URB never
> > gets to complete because the transaction err_count is not properly
> > updated. The err_count for transaction error is stored in ep_ring, but
> > the xhci driver may not be able to lookup the correct ep_ring based on
> > TRB address for streams. There are cases for streams where the event
> > TRBs have their TRB pointer field cleared to '0' (xhci spec section
> > 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> > it automatically does a soft-retry. This is seen from one of our
> > testings that the driver was repeatedly doing soft-retry until the class
> > driver timed out.
> >
> > Hi Mathias, maybe you have some comment on this? Thanks.
>
> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
> We should add one and prevent a loop. after e few soft resets we can end with a
> hard reset to clear the host side endpoint halt.
>
> We don't know the URB that was being tansferred during the error, and can't
> give it back with a proper error code.
> In that sense we still end up waiting for a timeout and someone to cancel
> the urb.
That's not good. There may not be a timeout; drivers expect transfers
to complete with a failure, not to be retried indefinitely.
However, if you do know which endpoint/stream the error is connected to,
you should be able to get the URB. It will be the first one queued for
that endpoint/stream.
Alan Stern
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-24 18:55 ` Alan Stern
@ 2021-05-24 19:23 ` Thinh Nguyen
2021-05-24 22:16 ` Mathias Nyman
0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-24 19:23 UTC (permalink / raw)
To: Alan Stern, Mathias Nyman
Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler,
Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
x86
Alan Stern wrote:
> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>> As for the xhci driver, there maybe a case where the stream URB never
>>> gets to complete because the transaction err_count is not properly
>>> updated. The err_count for transaction error is stored in ep_ring, but
>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>> TRB address for streams. There are cases for streams where the event
>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>> it automatically does a soft-retry. This is seen from one of our
>>> testings that the driver was repeatedly doing soft-retry until the class
>>> driver timed out.
>>>
>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>
>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>> We should add one and prevent a loop. after e few soft resets we can end with a
>> hard reset to clear the host side endpoint halt.
>>
>> We don't know the URB that was being tansferred during the error, and can't
>> give it back with a proper error code.
>> In that sense we still end up waiting for a timeout and someone to cancel
>> the urb.
>
> That's not good. There may not be a timeout; drivers expect transfers
> to complete with a failure, not to be retried indefinitely.
>
> However, if you do know which endpoint/stream the error is connected to,
> you should be able to get the URB. It will be the first one queued for
> that endpoint/stream.
>
When the xhci can't recover a transfer with soft-retry, no outstanding
transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
we just don't know which stream or endpoint ring it's for, but we know
all the outstanding URBs of an endpoint. Let's may as well return an
error status for all of them after a limited number of soft-retries.
BR,
Thinh
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-24 19:23 ` Thinh Nguyen
@ 2021-05-24 22:16 ` Mathias Nyman
2021-05-24 22:48 ` Thinh Nguyen
0 siblings, 1 reply; 11+ messages in thread
From: Mathias Nyman @ 2021-05-24 22:16 UTC (permalink / raw)
To: Thinh Nguyen, Alan Stern
Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86
On 24.5.2021 22.23, Thinh Nguyen wrote:
> Alan Stern wrote:
>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>>> As for the xhci driver, there maybe a case where the stream URB never
>>>> gets to complete because the transaction err_count is not properly
>>>> updated. The err_count for transaction error is stored in ep_ring, but
>>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>>> TRB address for streams. There are cases for streams where the event
>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>>> it automatically does a soft-retry. This is seen from one of our
>>>> testings that the driver was repeatedly doing soft-retry until the class
>>>> driver timed out.
>>>>
>>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>>
>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>>> We should add one and prevent a loop. after e few soft resets we can end with a
>>> hard reset to clear the host side endpoint halt.
>>>
>>> We don't know the URB that was being tansferred during the error, and can't
>>> give it back with a proper error code.
>>> In that sense we still end up waiting for a timeout and someone to cancel
>>> the urb.
>>
>> That's not good. There may not be a timeout; drivers expect transfers
>> to complete with a failure, not to be retried indefinitely.
>>
>> However, if you do know which endpoint/stream the error is connected to,
>> you should be able to get the URB. It will be the first one queued for
>> that endpoint/stream.
>>
>
> When the xhci can't recover a transfer with soft-retry, no outstanding
> transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
> we just don't know which stream or endpoint ring it's for, but we know
> all the outstanding URBs of an endpoint. Let's may as well return an
> error status for all of them after a limited number of soft-retries.
We get the endpoint, but not the stream.
I guess we could walk through each stream of this endpoint, and return the
first URB of every stream that has a pending URB.
xHCI spec claims to supports 65533 streams per endpoint, but in real life
UAS probably only uses a few per endpoint?
-Mathias
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx
2021-05-24 22:16 ` Mathias Nyman
@ 2021-05-24 22:48 ` Thinh Nguyen
0 siblings, 0 replies; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-24 22:48 UTC (permalink / raw)
To: Mathias Nyman, Thinh Nguyen, Alan Stern
Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86
Mathias Nyman wrote:
> On 24.5.2021 22.23, Thinh Nguyen wrote:
>> Alan Stern wrote:
>>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>>>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>>>> As for the xhci driver, there maybe a case where the stream URB never
>>>>> gets to complete because the transaction err_count is not properly
>>>>> updated. The err_count for transaction error is stored in ep_ring, but
>>>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>>>> TRB address for streams. There are cases for streams where the event
>>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>>>> it automatically does a soft-retry. This is seen from one of our
>>>>> testings that the driver was repeatedly doing soft-retry until the class
>>>>> driver timed out.
>>>>>
>>>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>>>
>>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>>>> We should add one and prevent a loop. after e few soft resets we can end with a
>>>> hard reset to clear the host side endpoint halt.
>>>>
>>>> We don't know the URB that was being tansferred during the error, and can't
>>>> give it back with a proper error code.
>>>> In that sense we still end up waiting for a timeout and someone to cancel
>>>> the urb.
>>>
>>> That's not good. There may not be a timeout; drivers expect transfers
>>> to complete with a failure, not to be retried indefinitely.
>>>
>>> However, if you do know which endpoint/stream the error is connected to,
>>> you should be able to get the URB. It will be the first one queued for
>>> that endpoint/stream.
>>>
>>
>> When the xhci can't recover a transfer with soft-retry, no outstanding
>> transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
>> we just don't know which stream or endpoint ring it's for, but we know
>> all the outstanding URBs of an endpoint. Let's may as well return an
>> error status for all of them after a limited number of soft-retries.
>
> We get the endpoint, but not the stream.
Right.
>
> I guess we could walk through each stream of this endpoint, and return the
> first URB of every stream that has a pending URB.
> xHCI spec claims to supports 65533 streams per endpoint, but in real life
> UAS probably only uses a few per endpoint?
>
> -Mathias
>
Typically UASP devices advertise to support up to 32 streams. We notice
that some newer builds of Windows OS has a bug (or intentional?) that it
rejects any device that uses more or less than 32 streams (probably a
bug) in the descriptor.
I think we only need to do this if we don't know which stream the event
belongs to. Otherwise, we can keep the old logic.
BR,
Thinh
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-05-24 22:49 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
2021-05-19 17:35 ` Alan Stern
2021-05-19 19:38 ` Thinh Nguyen
2021-05-20 2:01 ` Alan Stern
2021-05-20 20:30 ` Thinh Nguyen
2021-05-24 15:18 ` Mathias Nyman
2021-05-24 18:55 ` Alan Stern
2021-05-24 19:23 ` Thinh Nguyen
2021-05-24 22:16 ` Mathias Nyman
2021-05-24 22:48 ` Thinh Nguyen
2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).