* RE: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx
@ 2021-05-19 16:14 Guido Kiener
2021-05-19 17:35 ` Alan Stern
2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
0 siblings, 2 replies; 11+ messages in thread
From: Guido Kiener @ 2021-05-19 16:14 UTC (permalink / raw)
To: Alan Stern, dave penkler
Cc: Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list,
bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx,
x86
> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote:
> > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote:
> > >
> > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote:
> > > > When the host driver detects a protocol error while processing an
> > > > URB it completes the URB with EPROTO status and marks the endpoint
> > > > as halted.
> > >
> > > Not true. It does not mark the endpoint as halted, not unless it
> > > receives a STALL handshake from the device. A STALL is not a
> > > protocol error.
> > >
> > > > When the class driver resubmits the URB and the if the host driver
> > > > finds the endpoint still marked as halted it should return EPIPE
> > > > status on the resubmitted URB
> > >
> > > Irrelevant.
> > Not at all. The point is that when an application is talking to an
> > instrument over the usbtmc driver, the underlying host controller and
> > its driver will detect and silence a babbling endpoint.
>
> No, they won't. That is, they will detect a babble error and return an error status, but
> they won't silence the endpoint. What makes you think they will?
Maybe there is a misunderstanding. I guess that Dave wanted to propose:
"EPROTO is a link level issue and needs to be handled by the host driver.
When the host driver detects a protocol error while processing an
URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint
as halted."
Is this a realistic fix for all host drivers?
-Guido
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx 2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener @ 2021-05-19 17:35 ` Alan Stern 2021-05-19 19:38 ` Thinh Nguyen 2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones 1 sibling, 1 reply; 11+ messages in thread From: Alan Stern @ 2021-05-19 17:35 UTC (permalink / raw) To: Guido Kiener Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote: > > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote: > > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote: > > > > > > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote: > > > > > When the host driver detects a protocol error while processing an > > > > > URB it completes the URB with EPROTO status and marks the endpoint > > > > > as halted. > > > > > > > > Not true. It does not mark the endpoint as halted, not unless it > > > > receives a STALL handshake from the device. A STALL is not a > > > > protocol error. > > > > > > > > > When the class driver resubmits the URB and the if the host driver > > > > > finds the endpoint still marked as halted it should return EPIPE > > > > > status on the resubmitted URB > > > > > > > > Irrelevant. > > > Not at all. The point is that when an application is talking to an > > > instrument over the usbtmc driver, the underlying host controller and > > > its driver will detect and silence a babbling endpoint. > > > > No, they won't. That is, they will detect a babble error and return an error status, but > > they won't silence the endpoint. What makes you think they will? > > Maybe there is a misunderstanding. I guess that Dave wanted to propose: > "EPROTO is a link level issue and needs to be handled by the host driver. > When the host driver detects a protocol error while processing an > URB it SHOULD complete the URB with EPROTO status The host controller drivers _do_ complete URBs with -EPROTO (or similar) status when a link-level error occurs... > and SHOULD mark the endpoint > as halted." but they don't mark the endpoint as halted. Even if they did, it wouldn't fix anything because the kernel allows URBs to be submitted to halted endpoints. In fact, it doesn't even keep track of which endpoints are or are not halted. > Is this a realistic fix for all host drivers? No, it isn't. An endpoint shouldn't be marked as halted unless it really is halted. Otherwise a driver might be tempted to clear the Halt feature, and some devices do not like to receive a Clear-Halt request for an endpoint that isn't halted. What we could do is what you suggested earlier: Note the fact that the endpoint is in some sort of fault condition and disallow further communication with the endpoint until the fault condition has been cleared. (It isn't entirely obvious exactly what actions should clear such a fault... I guess resetting or re-enabling the endpoint, or resetting the entire device.) Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-19 17:35 ` Alan Stern @ 2021-05-19 19:38 ` Thinh Nguyen 2021-05-20 2:01 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Thinh Nguyen @ 2021-05-19 19:38 UTC (permalink / raw) To: Alan Stern, Guido Kiener Cc: dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 Alan Stern wrote: > On Wed, May 19, 2021 at 04:14:20PM +0000, Guido Kiener wrote: >>> On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote: >>>> On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote: >>>>> >>>>> On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote: >>>>>> When the host driver detects a protocol error while processing an >>>>>> URB it completes the URB with EPROTO status and marks the endpoint >>>>>> as halted. >>>>> >>>>> Not true. It does not mark the endpoint as halted, not unless it >>>>> receives a STALL handshake from the device. A STALL is not a >>>>> protocol error. >>>>> >>>>>> When the class driver resubmits the URB and the if the host driver >>>>>> finds the endpoint still marked as halted it should return EPIPE >>>>>> status on the resubmitted URB >>>>> >>>>> Irrelevant. >>>> Not at all. The point is that when an application is talking to an >>>> instrument over the usbtmc driver, the underlying host controller and >>>> its driver will detect and silence a babbling endpoint. >>> >>> No, they won't. That is, they will detect a babble error and return an error status, but >>> they won't silence the endpoint. What makes you think they will? >> >> Maybe there is a misunderstanding. I guess that Dave wanted to propose: >> "EPROTO is a link level issue and needs to be handled by the host driver. >> When the host driver detects a protocol error while processing an >> URB it SHOULD complete the URB with EPROTO status > > The host controller drivers _do_ complete URBs with -EPROTO (or similar) > status when a link-level error occurs... > >> and SHOULD mark the endpoint >> as halted." > > but they don't mark the endpoint as halted. Even if they did, it > wouldn't fix anything because the kernel allows URBs to be submitted to > halted endpoints. In fact, it doesn't even keep track of which > endpoints are or are not halted. > >> Is this a realistic fix for all host drivers? > > No, it isn't. > > An endpoint shouldn't be marked as halted unless it really is halted. > Otherwise a driver might be tempted to clear the Halt feature, and > some devices do not like to receive a Clear-Halt request for an endpoint > that isn't halted. > > What we could do is what you suggested earlier: Note the fact that the > endpoint is in some sort of fault condition and disallow further > communication with the endpoint until the fault condition has been > cleared. (It isn't entirely obvious exactly what actions should clear > such a fault... I guess resetting or re-enabling the endpoint, or > resetting the entire device.) > > Alan Stern > Hi Alan, Sorry if this diverges from the thread, but I've been wondering whether to add a change for this also. For xHCI hosts, after transactions errors, the endpoint will enter halted state. The driver will attempt a few soft-retries before giving up. According to the xHCI spec (section 4.6.8), a host may send a ClearFeature(endpoint_halt) to recover and restart the transfer (see "reset a pipe" in xhci spec), and the class driver can handle this after receiving something like -EPROTO from xhci. However, as you've pointed out, some devices don't like ClearFeature(ep_halt) and may not properly synchronize with the host on where it should restart. Some OS (such as Windows) do this. Not sure if we also want this? Currently the recovery is just a timeout and a port reset from the class driver, but the timeout is usually defaulted to a long time (e.g. 30 seconds for storage class driver). Thanks, Thinh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-19 19:38 ` Thinh Nguyen @ 2021-05-20 2:01 ` Alan Stern 2021-05-20 20:30 ` Thinh Nguyen 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2021-05-20 2:01 UTC (permalink / raw) To: Thinh Nguyen Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote: > Hi Alan, > > Sorry if this diverges from the thread, but I've been wondering whether > to add a change for this also. > > For xHCI hosts, after transactions errors, the endpoint will enter > halted state. No. You are misreading the xHCI spec. Section 4.6.8 says: ... the state of the associated Endpoint Context is set to Halted... Note this carefully. It says "Endpoint Context", not "endpoint". The endpoint is part of the device, whereas the endpoint context is part of the host controller. The device doesn't know when a transaction error has occurred; consequently such errors do not affect the endpoint. The host controller does know, and consequently such errors do affect the endpoint context. > The driver will attempt a few soft-retries before giving > up. According to the xHCI spec (section 4.6.8), a host may send a > ClearFeature(endpoint_halt) to recover and restart the transfer (see Not quite. The section of the spec you're talking about says: Software shall execute the following sequence to “reset a pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to device. It does not say the host controller will do this; it says that software will do it. > "reset a pipe" in xhci spec), and the class driver can handle this after > receiving something like -EPROTO from xhci. > > However, as you've pointed out, some devices don't like > ClearFeature(ep_halt) and may not properly synchronize with the host on > where it should restart. > > Some OS (such as Windows) do this. Not sure if we also want this? In general we should do the same thing as Windows does, because most hardware designers test their equipment on Windows systems but relatively few test on Linux systems. > Currently the recovery is just a timeout and a port reset from the class This depends on the driver. Some perform no recovery at all. > driver, but the timeout is usually defaulted to a long time (e.g. 30 > seconds for storage class driver). That 30-second timeout in the mass-storage driver applies in situations where a command fails to complete, not in situations where it completes quickly but with a -EPROTO or -EPIPE error. The fact is that only a small percentage of -EPROTO errors are recoverable. Some of them can be handled by a port reset, which can be pretty awkward to perform but does occasionally work. A lot of them occur because the USB cable has been unplugged; obviously there's no way to recover from that. With only a few exceptions, the best and simplest approach is not to try to recover at all. For the case in question (the syzbot bug report that started this thread), the class driver doesn't try to perform any recovery. It just resubmits the URB, getting into a tight retry loop which consumes too much CPU time. Simply giving up would be preferable. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-20 2:01 ` Alan Stern @ 2021-05-20 20:30 ` Thinh Nguyen 2021-05-24 15:18 ` Mathias Nyman 0 siblings, 1 reply; 11+ messages in thread From: Thinh Nguyen @ 2021-05-20 20:30 UTC (permalink / raw) To: Alan Stern, Thinh Nguyen, Mathias Nyman Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 +Mathias Alan Stern wrote: > On Wed, May 19, 2021 at 07:38:52PM +0000, Thinh Nguyen wrote: >> Hi Alan, >> >> Sorry if this diverges from the thread, but I've been wondering whether >> to add a change for this also. >> >> For xHCI hosts, after transactions errors, the endpoint will enter >> halted state. > > No. You are misreading the xHCI spec. Section 4.6.8 says: > > ... the state of the associated Endpoint Context is set to > Halted... > > Note this carefully. It says "Endpoint Context", not "endpoint". > > The endpoint is part of the device, whereas the endpoint context is part > of the host controller. The device doesn't know when a transaction > error has occurred; consequently such errors do not affect the endpoint. > The host controller does know, and consequently such errors do affect > the endpoint context. > You're right, my mistake here. >> The driver will attempt a few soft-retries before giving >> up. According to the xHCI spec (section 4.6.8), a host may send a >> ClearFeature(endpoint_halt) to recover and restart the transfer (see > > Not quite. The section of the spec you're talking about says: > > Software shall execute the following sequence to “reset a > pipe”.... Issue a ClearFeature(ENDPOINT_HALT) request to > device. > > It does not say the host controller will do this; it says that software > will do it. Sorry for being unclear. I meant from the class driver, see my next sentence. > >> "reset a pipe" in xhci spec), and the class driver can handle this after >> receiving something like -EPROTO from xhci. >> >> However, as you've pointed out, some devices don't like >> ClearFeature(ep_halt) and may not properly synchronize with the host on >> where it should restart. >> >> Some OS (such as Windows) do this. Not sure if we also want this? > > In general we should do the same thing as Windows does, because most > hardware designers test their equipment on Windows systems but > relatively few test on Linux systems. > >> Currently the recovery is just a timeout and a port reset from the class > > This depends on the driver. Some perform no recovery at all. > >> driver, but the timeout is usually defaulted to a long time (e.g. 30 >> seconds for storage class driver). > > That 30-second timeout in the mass-storage driver applies in situations > where a command fails to complete, not in situations where it completes > quickly but with a -EPROTO or -EPIPE error. Hm... looks like we have a couple of issues in the uas storage class driver and the xhci driver. We may need to fix that in the uas storage driver because it doesn't seem to handle it. (check uas_data_cmplt() in uas.c). As for the xhci driver, there maybe a case where the stream URB never gets to complete because the transaction err_count is not properly updated. The err_count for transaction error is stored in ep_ring, but the xhci driver may not be able to lookup the correct ep_ring based on TRB address for streams. There are cases for streams where the event TRBs have their TRB pointer field cleared to '0' (xhci spec section 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, it automatically does a soft-retry. This is seen from one of our testings that the driver was repeatedly doing soft-retry until the class driver timed out. Hi Mathias, maybe you have some comment on this? Thanks. > > The fact is that only a small percentage of -EPROTO errors are > recoverable. Some of them can be handled by a port reset, which can be > pretty awkward to perform but does occasionally work. A lot of them > occur because the USB cable has been unplugged; obviously there's no way > to recover from that. With only a few exceptions, the best and simplest > approach is not to try to recover at all. If the cable is unplugged, then we should get a connection change event and the driver can handle it properly. Yes, it's probably simplest to do a port reset and let the transfer be incomplete/corrupted. However, I think we should give ClearFeature(ep_halt) some more thoughts as I think it can be a recovery mechanism for storage class driver, even though that it may not be foolproof. > > For the case in question (the syzbot bug report that started this > thread), the class driver doesn't try to perform any recovery. It just > resubmits the URB, getting into a tight retry loop which consumes too > much CPU time. Simply giving up would be preferable. > > Alan Stern > I see. By giving up, you mean doing port reset right? Otherwise it needs some other mechanism to synchronize with the device side. Thanks, Thinh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-20 20:30 ` Thinh Nguyen @ 2021-05-24 15:18 ` Mathias Nyman 2021-05-24 18:55 ` Alan Stern 0 siblings, 1 reply; 11+ messages in thread From: Mathias Nyman @ 2021-05-24 15:18 UTC (permalink / raw) To: Thinh Nguyen, Alan Stern, Mathias Nyman Cc: Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On 20.5.2021 23.30, Thinh Nguyen wrote: > +Mathias > ... > Hm... looks like we have a couple of issues in the uas storage class > driver and the xhci driver. > > We may need to fix that in the uas storage driver because it doesn't > seem to handle it. (check uas_data_cmplt() in uas.c). > > As for the xhci driver, there maybe a case where the stream URB never > gets to complete because the transaction err_count is not properly > updated. The err_count for transaction error is stored in ep_ring, but > the xhci driver may not be able to lookup the correct ep_ring based on > TRB address for streams. There are cases for streams where the event > TRBs have their TRB pointer field cleared to '0' (xhci spec section > 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, > it automatically does a soft-retry. This is seen from one of our > testings that the driver was repeatedly doing soft-retry until the class > driver timed out. > > Hi Mathias, maybe you have some comment on this? Thanks. This is true, if TRB pointer is 0 then there is no retry limit for soft retry. We should add one and prevent a loop. after e few soft resets we can end with a hard reset to clear the host side endpoint halt. We don't know the URB that was being tansferred during the error, and can't give it back with a proper error code. In that sense we still end up waiting for a timeout and someone to cancel the urb. -Mathias ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-24 15:18 ` Mathias Nyman @ 2021-05-24 18:55 ` Alan Stern 2021-05-24 19:23 ` Thinh Nguyen 0 siblings, 1 reply; 11+ messages in thread From: Alan Stern @ 2021-05-24 18:55 UTC (permalink / raw) To: Mathias Nyman Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote: > On 20.5.2021 23.30, Thinh Nguyen wrote: > > As for the xhci driver, there maybe a case where the stream URB never > > gets to complete because the transaction err_count is not properly > > updated. The err_count for transaction error is stored in ep_ring, but > > the xhci driver may not be able to lookup the correct ep_ring based on > > TRB address for streams. There are cases for streams where the event > > TRBs have their TRB pointer field cleared to '0' (xhci spec section > > 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, > > it automatically does a soft-retry. This is seen from one of our > > testings that the driver was repeatedly doing soft-retry until the class > > driver timed out. > > > > Hi Mathias, maybe you have some comment on this? Thanks. > > This is true, if TRB pointer is 0 then there is no retry limit for soft retry. > We should add one and prevent a loop. after e few soft resets we can end with a > hard reset to clear the host side endpoint halt. > > We don't know the URB that was being tansferred during the error, and can't > give it back with a proper error code. > In that sense we still end up waiting for a timeout and someone to cancel > the urb. That's not good. There may not be a timeout; drivers expect transfers to complete with a failure, not to be retried indefinitely. However, if you do know which endpoint/stream the error is connected to, you should be able to get the URB. It will be the first one queued for that endpoint/stream. Alan Stern ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-24 18:55 ` Alan Stern @ 2021-05-24 19:23 ` Thinh Nguyen 2021-05-24 22:16 ` Mathias Nyman 0 siblings, 1 reply; 11+ messages in thread From: Thinh Nguyen @ 2021-05-24 19:23 UTC (permalink / raw) To: Alan Stern, Mathias Nyman Cc: Thinh Nguyen, Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 Alan Stern wrote: > On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote: >> On 20.5.2021 23.30, Thinh Nguyen wrote: >>> As for the xhci driver, there maybe a case where the stream URB never >>> gets to complete because the transaction err_count is not properly >>> updated. The err_count for transaction error is stored in ep_ring, but >>> the xhci driver may not be able to lookup the correct ep_ring based on >>> TRB address for streams. There are cases for streams where the event >>> TRBs have their TRB pointer field cleared to '0' (xhci spec section >>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, >>> it automatically does a soft-retry. This is seen from one of our >>> testings that the driver was repeatedly doing soft-retry until the class >>> driver timed out. >>> >>> Hi Mathias, maybe you have some comment on this? Thanks. >> >> This is true, if TRB pointer is 0 then there is no retry limit for soft retry. >> We should add one and prevent a loop. after e few soft resets we can end with a >> hard reset to clear the host side endpoint halt. >> >> We don't know the URB that was being tansferred during the error, and can't >> give it back with a proper error code. >> In that sense we still end up waiting for a timeout and someone to cancel >> the urb. > > That's not good. There may not be a timeout; drivers expect transfers > to complete with a failure, not to be retried indefinitely. > > However, if you do know which endpoint/stream the error is connected to, > you should be able to get the URB. It will be the first one queued for > that endpoint/stream. > When the xhci can't recover a transfer with soft-retry, no outstanding transfer can proceed/complete for the endpoint. If the TRB pointer is 0, we just don't know which stream or endpoint ring it's for, but we know all the outstanding URBs of an endpoint. Let's may as well return an error status for all of them after a limited number of soft-retries. BR, Thinh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-24 19:23 ` Thinh Nguyen @ 2021-05-24 22:16 ` Mathias Nyman 2021-05-24 22:48 ` Thinh Nguyen 0 siblings, 1 reply; 11+ messages in thread From: Mathias Nyman @ 2021-05-24 22:16 UTC (permalink / raw) To: Thinh Nguyen, Alan Stern Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On 24.5.2021 22.23, Thinh Nguyen wrote: > Alan Stern wrote: >> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote: >>> On 20.5.2021 23.30, Thinh Nguyen wrote: >>>> As for the xhci driver, there maybe a case where the stream URB never >>>> gets to complete because the transaction err_count is not properly >>>> updated. The err_count for transaction error is stored in ep_ring, but >>>> the xhci driver may not be able to lookup the correct ep_ring based on >>>> TRB address for streams. There are cases for streams where the event >>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section >>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, >>>> it automatically does a soft-retry. This is seen from one of our >>>> testings that the driver was repeatedly doing soft-retry until the class >>>> driver timed out. >>>> >>>> Hi Mathias, maybe you have some comment on this? Thanks. >>> >>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry. >>> We should add one and prevent a loop. after e few soft resets we can end with a >>> hard reset to clear the host side endpoint halt. >>> >>> We don't know the URB that was being tansferred during the error, and can't >>> give it back with a proper error code. >>> In that sense we still end up waiting for a timeout and someone to cancel >>> the urb. >> >> That's not good. There may not be a timeout; drivers expect transfers >> to complete with a failure, not to be retried indefinitely. >> >> However, if you do know which endpoint/stream the error is connected to, >> you should be able to get the URB. It will be the first one queued for >> that endpoint/stream. >> > > When the xhci can't recover a transfer with soft-retry, no outstanding > transfer can proceed/complete for the endpoint. If the TRB pointer is 0, > we just don't know which stream or endpoint ring it's for, but we know > all the outstanding URBs of an endpoint. Let's may as well return an > error status for all of them after a limited number of soft-retries. We get the endpoint, but not the stream. I guess we could walk through each stream of this endpoint, and return the first URB of every stream that has a pending URB. xHCI spec claims to supports 65533 streams per endpoint, but in real life UAS probably only uses a few per endpoint? -Mathias ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [syzbot] INFO: rcu detected stall in tx 2021-05-24 22:16 ` Mathias Nyman @ 2021-05-24 22:48 ` Thinh Nguyen 0 siblings, 0 replies; 11+ messages in thread From: Thinh Nguyen @ 2021-05-24 22:48 UTC (permalink / raw) To: Mathias Nyman, Thinh Nguyen, Alan Stern Cc: Mathias Nyman, Guido Kiener, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, lee.jones, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 Mathias Nyman wrote: > On 24.5.2021 22.23, Thinh Nguyen wrote: >> Alan Stern wrote: >>> On Mon, May 24, 2021 at 06:18:59PM +0300, Mathias Nyman wrote: >>>> On 20.5.2021 23.30, Thinh Nguyen wrote: >>>>> As for the xhci driver, there maybe a case where the stream URB never >>>>> gets to complete because the transaction err_count is not properly >>>>> updated. The err_count for transaction error is stored in ep_ring, but >>>>> the xhci driver may not be able to lookup the correct ep_ring based on >>>>> TRB address for streams. There are cases for streams where the event >>>>> TRBs have their TRB pointer field cleared to '0' (xhci spec section >>>>> 4.12.2). If the xhci driver doesn't see ep_ring for transaction error, >>>>> it automatically does a soft-retry. This is seen from one of our >>>>> testings that the driver was repeatedly doing soft-retry until the class >>>>> driver timed out. >>>>> >>>>> Hi Mathias, maybe you have some comment on this? Thanks. >>>> >>>> This is true, if TRB pointer is 0 then there is no retry limit for soft retry. >>>> We should add one and prevent a loop. after e few soft resets we can end with a >>>> hard reset to clear the host side endpoint halt. >>>> >>>> We don't know the URB that was being tansferred during the error, and can't >>>> give it back with a proper error code. >>>> In that sense we still end up waiting for a timeout and someone to cancel >>>> the urb. >>> >>> That's not good. There may not be a timeout; drivers expect transfers >>> to complete with a failure, not to be retried indefinitely. >>> >>> However, if you do know which endpoint/stream the error is connected to, >>> you should be able to get the URB. It will be the first one queued for >>> that endpoint/stream. >>> >> >> When the xhci can't recover a transfer with soft-retry, no outstanding >> transfer can proceed/complete for the endpoint. If the TRB pointer is 0, >> we just don't know which stream or endpoint ring it's for, but we know >> all the outstanding URBs of an endpoint. Let's may as well return an >> error status for all of them after a limited number of soft-retries. > > We get the endpoint, but not the stream. Right. > > I guess we could walk through each stream of this endpoint, and return the > first URB of every stream that has a pending URB. > xHCI spec claims to supports 65533 streams per endpoint, but in real life > UAS probably only uses a few per endpoint? > > -Mathias > Typically UASP devices advertise to support up to 32 streams. We notice that some newer builds of Windows OS has a bug (or intentional?) that it rejects any device that uses more or less than 32 streams (probably a bug) in the descriptor. I think we only need to do this if we don't know which stream the event belongs to. Otherwise, we can keep the old logic. BR, Thinh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx 2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener 2021-05-19 17:35 ` Alan Stern @ 2021-05-19 18:04 ` Lee Jones 1 sibling, 0 replies; 11+ messages in thread From: Lee Jones @ 2021-05-19 18:04 UTC (permalink / raw) To: Guido Kiener Cc: Alan Stern, dave penkler, Dmitry Vyukov, syzbot, Greg Kroah-Hartman, USB list, bp, dwmw, hpa, linux-kernel, luto, mingo, syzkaller-bugs, tglx, x86 On Wed, 19 May 2021, Guido Kiener wrote: > > On Wed, May 19, 2021 at 10:48:29AM +0200, dave penkler wrote: > > > On Sat, 8 May 2021 at 16:29, Alan Stern <stern@rowland.harvard.edu> wrote: > > > > > > > > On Sat, May 08, 2021 at 10:14:41AM +0200, dave penkler wrote: > > > > > When the host driver detects a protocol error while processing an > > > > > URB it completes the URB with EPROTO status and marks the endpoint > > > > > as halted. > > > > > > > > Not true. It does not mark the endpoint as halted, not unless it > > > > receives a STALL handshake from the device. A STALL is not a > > > > protocol error. > > > > > > > > > When the class driver resubmits the URB and the if the host driver > > > > > finds the endpoint still marked as halted it should return EPIPE > > > > > status on the resubmitted URB > > > > > > > > Irrelevant. > > > Not at all. The point is that when an application is talking to an > > > instrument over the usbtmc driver, the underlying host controller and > > > its driver will detect and silence a babbling endpoint. > > > > No, they won't. That is, they will detect a babble error and return an error status, but > > they won't silence the endpoint. What makes you think they will? > > Maybe there is a misunderstanding. I guess that Dave wanted to propose: > "EPROTO is a link level issue and needs to be handled by the host driver. > When the host driver detects a protocol error while processing an > URB it SHOULD complete the URB with EPROTO status and SHOULD mark the endpoint > as halted." > Is this a realistic fix for all host drivers? > > -Guido Guido, would you mind taking a look at your mailer settings please? I now have >=7 threads running through my inbox with the same subject. For some reason your mailer is insisting on creating a new one for each of your replies. It's also adding odd "re: re: re: ..." prefixes. TIA -- Lee Jones [李琼斯] Senior Technical Lead - Developer Services Linaro.org │ Open source software for Arm SoCs Follow Linaro: Facebook | Twitter | Blog ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2021-05-24 22:49 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-19 16:14 Re: Re: Re: Re: Re: [syzbot] INFO: rcu detected stall in tx Guido Kiener 2021-05-19 17:35 ` Alan Stern 2021-05-19 19:38 ` Thinh Nguyen 2021-05-20 2:01 ` Alan Stern 2021-05-20 20:30 ` Thinh Nguyen 2021-05-24 15:18 ` Mathias Nyman 2021-05-24 18:55 ` Alan Stern 2021-05-24 19:23 ` Thinh Nguyen 2021-05-24 22:16 ` Mathias Nyman 2021-05-24 22:48 ` Thinh Nguyen 2021-05-19 18:04 ` Re: Re: Re: Re: " Lee Jones
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).