RE: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
@ 2021-05-19 16:14 Guido Kiener
  2021-05-19 17:35 ` Alan Stern
  2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
  0 siblings, 2 replies; 11+ messages in thread
From: Guido Kiener @ 2021-05-19 16:14 UTC (permalink / raw)
  To: Alan Stern, dave penkler
  Cc: Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
	bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
	x86

> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > When the host driver detects a protocol error while processing an
> > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > as halted.
> > >
> > > Not true.  It does not mark the endpoint as halted, not unless it
> > > receives a STALL handshake from the device.  A STALL is not a
> > > protocol error.
> > >
> > > > When the class driver resubmits the URB and the if the host driver
> > > > finds the endpoint still marked as halted it should return EPIPE
> > > > status on the resubmitted URB
> > >
> > > Irrelevant.
> > Not at all. The point is that when an application is talking to an
> > instrument over the usbtmc driver, the underlying host controller and
> > its driver will detect and silence a babbling endpoint.
> 
> No, they won't.  That is, they will detect a babble error and return an error status, but
> they won't silence the endpoint.  What makes you think they will?

Maybe there is a misunderstanding. I guess that Dave wanted to propose:
"EPROTO is a link level issue and needs to be handled by the host driver.
When the host driver detects a protocol error while processing an
URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
as halted."
Is this a realistic fix for all host drivers?

-Guido

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
@ 2021-05-19 17:35 ` Alan Stern
  2021-05-19 19:38   ` Thinh Nguyen
  2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
  1 sibling, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-19 17:35 UTC (permalink / raw)
  To: Guido Kiener
  Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman,
	lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo,
	syzkaller-bugs, tglx, x86

On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true.  It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device.  A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> > 
> > No, they won't.  That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint.  What makes you think they will?
> 
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status

The host controller drivers _do_ complete URBs with -EPROTO (or similar) 
status when a link-level error occurs...

> and SHOULD mark the endpoint
> as halted."

but they don't mark the endpoint as halted.  Even if they did, it 
wouldn't fix anything because the kernel allows URBs to be submitted to 
halted endpoints.  In fact, it doesn't even keep track of which 
endpoints are or are not halted.

> Is this a realistic fix for all host drivers?

No, it isn't.

An endpoint shouldn't be marked as halted unless it really is halted.  
Otherwise a driver might be tempted to clear the Halt feature, and 
some devices do not like to receive a Clear-Halt request for an endpoint 
that isn't halted.

What we could do is what you suggested earlier: Note the fact that the 
endpoint is in some sort of fault condition and disallow further 
communication with the endpoint until the fault condition has been 
cleared.  (It isn't entirely obvious exactly what actions should clear 
such a fault...  I guess resetting or re-enabling the endpoint, or 
resetting the entire device.)

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
  2021-05-19 17:35 ` Alan Stern
@ 2021-05-19 18:04 ` Lee Jones
  1 sibling, 0 replies; 11+ messages in thread
From: Lee Jones @ 2021-05-19 18:04 UTC (permalink / raw)
  To: Guido Kiener
  Cc: Alan Stern, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, USB list, bp, dwmw, hpa, linux-kernel, luto,
	mingo, syzkaller-bugs, tglx, x86

On Wed, 19 May 2021, Guido Kiener wrote:

> > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > > >
> > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > > When the host driver detects a protocol error while processing an
> > > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > > as halted.
> > > >
> > > > Not true.  It does not mark the endpoint as halted, not unless it
> > > > receives a STALL handshake from the device.  A STALL is not a
> > > > protocol error.
> > > >
> > > > > When the class driver resubmits the URB and the if the host driver
> > > > > finds the endpoint still marked as halted it should return EPIPE
> > > > > status on the resubmitted URB
> > > >
> > > > Irrelevant.
> > > Not at all. The point is that when an application is talking to an
> > > instrument over the usbtmc driver, the underlying host controller and
> > > its driver will detect and silence a babbling endpoint.
> > 
> > No, they won't.  That is, they will detect a babble error and return an error status, but
> > they won't silence the endpoint.  What makes you think they will?
> 
> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
> "EPROTO is a link level issue and needs to be handled by the host driver.
> When the host driver detects a protocol error while processing an
> URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
> as halted."
> Is this a realistic fix for all host drivers?
> 
> -Guido

Guido, would you mind taking a look at your mailer settings please?  I
now have >=7 threads running through my inbox with the same subject.
For some reason your mailer is insisting on creating a new one for
each of your replies.

It's also adding odd "re: re: re: ..." prefixes.

TIA

-- 
Lee Jones [李琼斯]
Senior Technical Lead - Developer Services
Linaro.org │ Open source software for Arm SoCs
Follow Linaro: Facebook | Twitter | Blog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-19 17:35 ` Alan Stern
@ 2021-05-19 19:38   ` Thinh Nguyen
  2021-05-20  2:01     ` Alan Stern
  0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-19 19:38 UTC (permalink / raw)
  To: Alan Stern, Guido Kiener
  Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman,
	lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo,
	syzkaller-bugs, tglx, x86

Alan Stern wrote:
> On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote:
>>> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
>>>> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
>>>>>
>>>>> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
>>>>>> When the host driver detects a protocol error while processing an
>>>>>> URB it completes the URB with EPROTO status and marks the endpoint
>>>>>> as halted.
>>>>>
>>>>> Not true.  It does not mark the endpoint as halted, not unless it
>>>>> receives a STALL handshake from the device.  A STALL is not a
>>>>> protocol error.
>>>>>
>>>>>> When the class driver resubmits the URB and the if the host driver
>>>>>> finds the endpoint still marked as halted it should return EPIPE
>>>>>> status on the resubmitted URB
>>>>>
>>>>> Irrelevant.
>>>> Not at all. The point is that when an application is talking to an
>>>> instrument over the usbtmc driver, the underlying host controller and
>>>> its driver will detect and silence a babbling endpoint.
>>>
>>> No, they won't.  That is, they will detect a babble error and return an error status, but
>>> they won't silence the endpoint.  What makes you think they will?
>>
>> Maybe there is a misunderstanding. I guess that Dave wanted to propose:
>> "EPROTO is a link level issue and needs to be handled by the host driver.
>> When the host driver detects a protocol error while processing an
>> URB it SHOULD complete the URB with EPROTO status
> 
> The host controller drivers _do_ complete URBs with -EPROTO (or similar) 
> status when a link-level error occurs...
> 
>> and SHOULD mark the endpoint
>> as halted."
> 
> but they don't mark the endpoint as halted.  Even if they did, it 
> wouldn't fix anything because the kernel allows URBs to be submitted to 
> halted endpoints.  In fact, it doesn't even keep track of which 
> endpoints are or are not halted.
> 
>> Is this a realistic fix for all host drivers?
> 
> No, it isn't.
> 
> An endpoint shouldn't be marked as halted unless it really is halted.  
> Otherwise a driver might be tempted to clear the Halt feature, and 
> some devices do not like to receive a Clear-Halt request for an endpoint 
> that isn't halted.
> 
> What we could do is what you suggested earlier: Note the fact that the 
> endpoint is in some sort of fault condition and disallow further 
> communication with the endpoint until the fault condition has been 
> cleared.  (It isn't entirely obvious exactly what actions should clear 
> such a fault...  I guess resetting or re-enabling the endpoint, or 
> resetting the entire device.)
> 
> Alan Stern
> 

Hi Alan,

Sorry if this diverges from the thread, but I've been wondering whether
to add a change for this also.

For xHCI hosts, after transactions errors, the endpoint will enter
halted state. The driver will attempt a few soft-retries before giving
up. According to the xHCI spec (section 4.6.8), a host may send a
ClearFeature(endpoint_halt) to recover and restart the transfer (see
"reset a pipe" in xhci spec), and the class driver can handle this after
receiving something like -EPROTO from xhci.

However, as you've pointed out, some devices don't like
ClearFeature(ep_halt) and may not properly synchronize with the host on
where it should restart.

Some OS (such as Windows) do this. Not sure if we also want this?
Currently the recovery is just a timeout and a port reset from the class
driver, but the timeout is usually defaulted to a long time (e.g. 30
seconds for storage class driver).

Thanks,
Thinh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-19 19:38   ` Thinh Nguyen
@ 2021-05-20  2:01     ` Alan Stern
  2021-05-20 20:30       ` Thinh Nguyen
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-20  2:01 UTC (permalink / raw)
  To: Thinh Nguyen
  Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
	linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86

On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
> Hi Alan,
> 
> Sorry if this diverges from the thread, but I've been wondering whether
> to add a change for this also.
> 
> For xHCI hosts, after transactions errors, the endpoint will enter
> halted state.

No.  You are misreading the xHCI spec.  Section 4.6.8 says:

	... the state of the associated Endpoint Context is set to 
	Halted...

Note this carefully.  It says "Endpoint Context", not "endpoint".

The endpoint is part of the device, whereas the endpoint context is part 
of the host controller.  The device doesn't know when a transaction 
error has occurred; consequently such errors do not affect the endpoint.  
The host controller does know, and consequently such errors do affect 
the endpoint context.

> The driver will attempt a few soft-retries before giving
> up. According to the xHCI spec (section 4.6.8), a host may send a
> ClearFeature(endpoint_halt) to recover and restart the transfer (see

Not quite.  The section of the spec you're talking about says:

	Software shall execute the following sequence to “reset a 
	pipe”....  Issue a ClearFeature(ENDPOINT_HALT) request to 
	device.

It does not say the host controller will do this; it says that software 
will do it.

> "reset a pipe" in xhci spec), and the class driver can handle this after
> receiving something like -EPROTO from xhci.
> 
> However, as you've pointed out, some devices don't like
> ClearFeature(ep_halt) and may not properly synchronize with the host on
> where it should restart.
> 
> Some OS (such as Windows) do this. Not sure if we also want this?

In general we should do the same thing as Windows does, because most 
hardware designers test their equipment on Windows systems but 
relatively few test on Linux systems.

> Currently the recovery is just a timeout and a port reset from the class

This depends on the driver.  Some perform no recovery at all.

> driver, but the timeout is usually defaulted to a long time (e.g. 30
> seconds for storage class driver).

That 30-second timeout in the mass-storage driver applies in situations 
where a command fails to complete, not in situations where it completes 
quickly but with a -EPROTO or -EPIPE error.

The fact is that only a small percentage of -EPROTO errors are 
recoverable.  Some of them can be handled by a port reset, which can be 
pretty awkward to perform but does occasionally work.  A lot of them 
occur because the USB cable has been unplugged; obviously there's no way 
to recover from that.  With only a few exceptions, the best and simplest 
approach is not to try to recover at all.

For the case in question (the syzbot bug report that started this 
thread), the class driver doesn't try to perform any recovery.  It just 
resubmits the URB, getting into a tight retry loop which consumes too 
much CPU time.  Simply giving up would be preferable.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-20  2:01     ` Alan Stern
@ 2021-05-20 20:30       ` Thinh Nguyen
  2021-05-24 15:18         ` Mathias Nyman
  0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-20 20:30 UTC (permalink / raw)
  To: Alan Stern, Thinh Nguyen, Mathias Nyman
  Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
	linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86

+Mathias

Alan Stern wrote:
> On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote:
>> Hi Alan,
>>
>> Sorry if this diverges from the thread, but I've been wondering whether
>> to add a change for this also.
>>
>> For xHCI hosts, after transactions errors, the endpoint will enter
>> halted state.
> 
> No.  You are misreading the xHCI spec.  Section 4.6.8 says:
> 
> 	... the state of the associated Endpoint Context is set to 
> 	Halted...
> 
> Note this carefully.  It says "Endpoint Context", not "endpoint".
> 
> The endpoint is part of the device, whereas the endpoint context is part 
> of the host controller.  The device doesn't know when a transaction 
> error has occurred; consequently such errors do not affect the endpoint.  
> The host controller does know, and consequently such errors do affect 
> the endpoint context.
> 

You're right, my mistake here.

>> The driver will attempt a few soft-retries before giving
>> up. According to the xHCI spec (section 4.6.8), a host may send a
>> ClearFeature(endpoint_halt) to recover and restart the transfer (see
> 
> Not quite.  The section of the spec you're talking about says:
> 
> 	Software shall execute the following sequence to “reset a 
> 	pipe”....  Issue a ClearFeature(ENDPOINT_HALT) request to 
> 	device.
> 
> It does not say the host controller will do this; it says that software 
> will do it.

Sorry for being unclear. I meant from the class driver, see my next
sentence.

> 
>> "reset a pipe" in xhci spec), and the class driver can handle this after
>> receiving something like -EPROTO from xhci.
>>
>> However, as you've pointed out, some devices don't like
>> ClearFeature(ep_halt) and may not properly synchronize with the host on
>> where it should restart.
>>
>> Some OS (such as Windows) do this. Not sure if we also want this?
> 
> In general we should do the same thing as Windows does, because most 
> hardware designers test their equipment on Windows systems but 
> relatively few test on Linux systems.
> 
>> Currently the recovery is just a timeout and a port reset from the class
> 
> This depends on the driver.  Some perform no recovery at all.
> 
>> driver, but the timeout is usually defaulted to a long time (e.g. 30
>> seconds for storage class driver).
> 
> That 30-second timeout in the mass-storage driver applies in situations 
> where a command fails to complete, not in situations where it completes 
> quickly but with a -EPROTO or -EPIPE error.

Hm... looks like we have a couple of issues in the uas storage class
driver and the xhci driver.

We may need to fix that in the uas storage driver because it doesn't
seem to handle it. (check uas_data_cmplt() in uas.c).

As for the xhci driver, there maybe a case where the stream URB never
gets to complete because the transaction err_count is not properly
updated. The err_count for transaction error is stored in ep_ring, but
the xhci driver may not be able to lookup the correct ep_ring based on
TRB address for streams. There are cases for streams where the event
TRBs have their TRB pointer field cleared to '0' (xhci spec section
4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
it automatically does a soft-retry. This is seen from one of our
testings that the driver was repeatedly doing soft-retry until the class
driver timed out.

Hi Mathias, maybe you have some comment on this? Thanks.

> 
> The fact is that only a small percentage of -EPROTO errors are 
> recoverable.  Some of them can be handled by a port reset, which can be 
> pretty awkward to perform but does occasionally work.  A lot of them 
> occur because the USB cable has been unplugged; obviously there's no way 
> to recover from that.  With only a few exceptions, the best and simplest 
> approach is not to try to recover at all.

If the cable is unplugged, then we should get a connection change event
and the driver can handle it properly.

Yes, it's probably simplest to do a port reset and let the transfer be
incomplete/corrupted. However, I think we should give
ClearFeature(ep_halt) some more thoughts as I think it can be a recovery
mechanism for storage class driver, even though that it may not be
foolproof.

> 
> For the case in question (the syzbot bug report that started this 
> thread), the class driver doesn't try to perform any recovery.  It just 
> resubmits the URB, getting into a tight retry loop which consumes too 
> much CPU time.  Simply giving up would be preferable.
> 
> Alan Stern
> 

I see. By giving up, you mean doing port reset right? Otherwise it needs
some other mechanism to synchronize with the device side.

Thanks,
Thinh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-20 20:30       ` Thinh Nguyen
@ 2021-05-24 15:18         ` Mathias Nyman
  2021-05-24 18:55           ` Alan Stern
  0 siblings, 1 reply; 11+ messages in thread
From: Mathias Nyman @ 2021-05-24 15:18 UTC (permalink / raw)
  To: Thinh Nguyen, Alan Stern, Mathias Nyman
  Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
	linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86

On 20.5.2021 23.30, Thinh Nguyen wrote:
> +Mathias
> 
...

> Hm... looks like we have a couple of issues in the uas storage class
> driver and the xhci driver.
> 
> We may need to fix that in the uas storage driver because it doesn't
> seem to handle it. (check uas_data_cmplt() in uas.c).
> 
> As for the xhci driver, there maybe a case where the stream URB never
> gets to complete because the transaction err_count is not properly
> updated. The err_count for transaction error is stored in ep_ring, but
> the xhci driver may not be able to lookup the correct ep_ring based on
> TRB address for streams. There are cases for streams where the event
> TRBs have their TRB pointer field cleared to '0' (xhci spec section
> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> it automatically does a soft-retry. This is seen from one of our
> testings that the driver was repeatedly doing soft-retry until the class
> driver timed out.
> 
> Hi Mathias, maybe you have some comment on this? Thanks.

This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
We should add one and prevent a loop. after e few soft resets we can end with a
hard reset to clear the host side endpoint halt.

We don't know the URB that was being tansferred during the error, and can't 
give it back with a proper error code.
In that sense we still end up waiting for a timeout and someone to cancel
the urb.

-Mathias

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-24 15:18         ` Mathias Nyman
@ 2021-05-24 18:55           ` Alan Stern
  2021-05-24 19:23             ` Thinh Nguyen
  0 siblings, 1 reply; 11+ messages in thread
From: Alan Stern @ 2021-05-24 18:55 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler,
	Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
	bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
	x86

On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
> On 20.5.2021 23.30, Thinh Nguyen wrote:
> > As for the xhci driver, there maybe a case where the stream URB never
> > gets to complete because the transaction err_count is not properly
> > updated. The err_count for transaction error is stored in ep_ring, but
> > the xhci driver may not be able to lookup the correct ep_ring based on
> > TRB address for streams. There are cases for streams where the event
> > TRBs have their TRB pointer field cleared to '0' (xhci spec section
> > 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
> > it automatically does a soft-retry. This is seen from one of our
> > testings that the driver was repeatedly doing soft-retry until the class
> > driver timed out.
> > 
> > Hi Mathias, maybe you have some comment on this? Thanks.
> 
> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
> We should add one and prevent a loop. after e few soft resets we can end with a
> hard reset to clear the host side endpoint halt.
> 
> We don't know the URB that was being tansferred during the error, and can't 
> give it back with a proper error code.
> In that sense we still end up waiting for a timeout and someone to cancel
> the urb.

That's not good.  There may not be a timeout; drivers expect transfers 
to complete with a failure, not to be retried indefinitely.

However, if you do know which endpoint/stream the error is connected to, 
you should be able to get the URB.  It will be the first one queued for 
that endpoint/stream.

Alan Stern

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-24 18:55           ` Alan Stern
@ 2021-05-24 19:23             ` Thinh Nguyen
  2021-05-24 22:16               ` Mathias Nyman
  0 siblings, 1 reply; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-24 19:23 UTC (permalink / raw)
  To: Alan Stern, Mathias Nyman
  Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler,
	Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
	bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
	x86

Alan Stern wrote:
> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>> As for the xhci driver, there maybe a case where the stream URB never
>>> gets to complete because the transaction err_count is not properly
>>> updated. The err_count for transaction error is stored in ep_ring, but
>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>> TRB address for streams. There are cases for streams where the event
>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>> it automatically does a soft-retry. This is seen from one of our
>>> testings that the driver was repeatedly doing soft-retry until the class
>>> driver timed out.
>>>
>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>
>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>> We should add one and prevent a loop. after e few soft resets we can end with a
>> hard reset to clear the host side endpoint halt.
>>
>> We don't know the URB that was being tansferred during the error, and can't 
>> give it back with a proper error code.
>> In that sense we still end up waiting for a timeout and someone to cancel
>> the urb.
> 
> That's not good.  There may not be a timeout; drivers expect transfers 
> to complete with a failure, not to be retried indefinitely.
> 
> However, if you do know which endpoint/stream the error is connected to, 
> you should be able to get the URB.  It will be the first one queued for 
> that endpoint/stream.
> 

When the xhci can't recover a transfer with soft-retry, no outstanding
transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
we just don't know which stream or endpoint ring it's for, but we know
all the outstanding URBs of an endpoint. Let's may as well return an
error status for all of them after a limited number of soft-retries.

BR,
Thinh

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-24 19:23             ` Thinh Nguyen
@ 2021-05-24 22:16               ` Mathias Nyman
  2021-05-24 22:48                 ` Thinh Nguyen
  0 siblings, 1 reply; 11+ messages in thread
From: Mathias Nyman @ 2021-05-24 22:16 UTC (permalink / raw)
  To: Thinh Nguyen, Alan Stern
  Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
	linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86

On 24.5.2021 22.23, Thinh Nguyen wrote:
> Alan Stern wrote:
>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>>> As for the xhci driver, there maybe a case where the stream URB never
>>>> gets to complete because the transaction err_count is not properly
>>>> updated. The err_count for transaction error is stored in ep_ring, but
>>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>>> TRB address for streams. There are cases for streams where the event
>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>>> it automatically does a soft-retry. This is seen from one of our
>>>> testings that the driver was repeatedly doing soft-retry until the class
>>>> driver timed out.
>>>>
>>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>>
>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>>> We should add one and prevent a loop. after e few soft resets we can end with a
>>> hard reset to clear the host side endpoint halt.
>>>
>>> We don't know the URB that was being tansferred during the error, and can't 
>>> give it back with a proper error code.
>>> In that sense we still end up waiting for a timeout and someone to cancel
>>> the urb.
>>
>> That's not good.  There may not be a timeout; drivers expect transfers 
>> to complete with a failure, not to be retried indefinitely.
>>
>> However, if you do know which endpoint/stream the error is connected to, 
>> you should be able to get the URB.  It will be the first one queued for 
>> that endpoint/stream.
>>
> 
> When the xhci can't recover a transfer with soft-retry, no outstanding
> transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
> we just don't know which stream or endpoint ring it's for, but we know
> all the outstanding URBs of an endpoint. Let's may as well return an
> error status for all of them after a limited number of soft-retries.

We get the endpoint, but not the stream.

I guess we could walk through each stream of this endpoint, and return the 
first URB of every stream that has a pending URB.
xHCI spec claims to supports 65533 streams per endpoint, but in real life 
UAS probably only uses a few per endpoint?

-Mathias 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [syzbot] INFO: rcu detected stall in tx
  2021-05-24 22:16               ` Mathias Nyman
@ 2021-05-24 22:48                 ` Thinh Nguyen
  0 siblings, 0 replies; 11+ messages in thread
From: Thinh Nguyen @ 2021-05-24 22:48 UTC (permalink / raw)
  To: Mathias Nyman, Thinh Nguyen, Alan Stern
  Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot,
	Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa,
	linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86

Mathias Nyman wrote:
> On 24.5.2021 22.23, Thinh Nguyen wrote:
>> Alan Stern wrote:
>>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote:
>>>> On 20.5.2021 23.30, Thinh Nguyen wrote:
>>>>> As for the xhci driver, there maybe a case where the stream URB never
>>>>> gets to complete because the transaction err_count is not properly
>>>>> updated. The err_count for transaction error is stored in ep_ring, but
>>>>> the xhci driver may not be able to lookup the correct ep_ring based on
>>>>> TRB address for streams. There are cases for streams where the event
>>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section
>>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error,
>>>>> it automatically does a soft-retry. This is seen from one of our
>>>>> testings that the driver was repeatedly doing soft-retry until the class
>>>>> driver timed out.
>>>>>
>>>>> Hi Mathias, maybe you have some comment on this? Thanks.
>>>>
>>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry.
>>>> We should add one and prevent a loop. after e few soft resets we can end with a
>>>> hard reset to clear the host side endpoint halt.
>>>>
>>>> We don't know the URB that was being tansferred during the error, and can't 
>>>> give it back with a proper error code.
>>>> In that sense we still end up waiting for a timeout and someone to cancel
>>>> the urb.
>>>
>>> That's not good.  There may not be a timeout; drivers expect transfers 
>>> to complete with a failure, not to be retried indefinitely.
>>>
>>> However, if you do know which endpoint/stream the error is connected to, 
>>> you should be able to get the URB.  It will be the first one queued for 
>>> that endpoint/stream.
>>>
>>
>> When the xhci can't recover a transfer with soft-retry, no outstanding
>> transfer can proceed/complete for the endpoint. If the TRB pointer is 0,
>> we just don't know which stream or endpoint ring it's for, but we know
>> all the outstanding URBs of an endpoint. Let's may as well return an
>> error status for all of them after a limited number of soft-retries.
> 
> We get the endpoint, but not the stream.

Right.

> 
> I guess we could walk through each stream of this endpoint, and return the 
> first URB of every stream that has a pending URB.
> xHCI spec claims to supports 65533 streams per endpoint, but in real life 
> UAS probably only uses a few per endpoint?
> 
> -Mathias 
> 

Typically UASP devices advertise to support up to 32 streams. We notice
that some newer builds of Windows OS has a bug (or intentional?) that it
rejects any device that uses more or less than 32 streams (probably a
bug) in the descriptor.

I think we only need to do this if we don't know which stream the event
belongs to. Otherwise, we can keep the old logic.

BR,
Thinh


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-05-24 22:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener
2021-05-19 17:35 ` Alan Stern
2021-05-19 19:38   ` Thinh Nguyen
2021-05-20  2:01     ` Alan Stern
2021-05-20 20:30       ` Thinh Nguyen
2021-05-24 15:18         ` Mathias Nyman
2021-05-24 18:55           ` Alan Stern
2021-05-24 19:23             ` Thinh Nguyen
2021-05-24 22:16               ` Mathias Nyman
2021-05-24 22:48                 ` Thinh Nguyen
2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).