All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
@ 2016-03-18  7:01 Rajesh Bhagat
  2016-03-18 11:20 ` Mathias Nyman
  2016-03-18 14:21 ` Alan Stern
  0 siblings, 2 replies; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-18  7:01 UTC (permalink / raw)
  To: linux-usb, linux-kernel; +Cc: gregkh, mathias.nyman, sriram.dash, Rajesh Bhagat

We are facing issue while performing the system resume operation from STR
where XHCI is going to indefinite hang/sleep state due to
wait_for_completion API called in function xhci_alloc_dev for command
TRB_ENABLE_SLOT which never completes.

Now, xhci_handle_command_timeout function is called and prints
"Command timeout" message but never calls complete API for above
TRB_ENABLE_SLOT command as xhci_abort_cmd_ring is successful.

Solution to above problem is:
1. calling xhci_cleanup_command_queue API even if xhci_abort_cmd_ring
   is successful or not.
2. checking the status of reset_device in usb core code.

Before Fix:
root@phoenix:~# echo mem > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
PM: suspend of devices complete after 103.144 msecs
PM: late suspend of devices complete after 1.503 msecs
PM: noirq suspend of devices complete after 1.220 msecs
Disabling non-boot CPUs ...
CPU1: shutdown
Retrying again to check for CPU kill
CPU1 killed.
Enabling non-boot CPUs ...
CPU1 is up
PM: noirq resume of devices complete after 1.996 msecs
PM: early resume of devices complete after 1.152 msecs
usb usb1: root hub lost power or was reset
usb usb2: root hub lost power or was reset
----- <<hangs indefinitely>> --------------

After Fix:
root@phoenix:~#  echo mem > /sys/power/state
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
PM: suspend of devices complete after 103.086 msecs
PM: late suspend of devices complete after 1.517 msecs
PM: noirq suspend of devices complete after 1.217 msecs
Disabling non-boot CPUs ...
CPU1: shutdown
Retrying again to check for CPU kill
CPU1 killed.
Enabling non-boot CPUs ...
CPU1 is up
PM: noirq resume of devices complete after 1.991 msecs
PM: early resume of devices complete after 1.239 msecs
usb usb1: root hub lost power or was reset
usb usb2: root hub lost power or was reset
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
PM: resume of devices complete after 75567.769 msecs
Restarting tasks ...
usb 1-1: USB disconnect, device number 2
usb 2-1: USB disconnect, device number 2
usb 2-1.1: USB disconnect, device number 3
done.
root@phoenix:~#

Signed-off-by: Sriram Dash <sriram.dash@nxp.com>
Signed-off-by: Rajesh Bhagat <rajesh.bhagat@nxp.com>
---
 drivers/usb/core/hub.c       |   12 ++++++++----
 drivers/usb/host/xhci-ring.c |    2 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 38cc4ba..c906018 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2897,10 +2897,14 @@ done:
 			/* The xHC may think the device is already reset,
 			 * so ignore the status.
 			 */
-			if (hcd->driver->reset_device)
-				hcd->driver->reset_device(hcd, udev);
-
-			usb_set_device_state(udev, USB_STATE_DEFAULT);
+			if (hcd->driver->reset_device) {
+				status = hcd->driver->reset_device(hcd, udev);
+				if (status == 0)
+					usb_set_device_state(udev, USB_STATE_DEFAULT);
+				else
+					usb_set_device_state(udev, USB_STATE_NOTATTACHED);
+			} else
+				usb_set_device_state(udev, USB_STATE_DEFAULT);
 		}
 	} else {
 		if (udev)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 7cf6621..be8fd61 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1272,9 +1272,9 @@ void xhci_handle_command_timeout(unsigned long data)
 		spin_unlock_irqrestore(&xhci->lock, flags);
 		xhci_dbg(xhci, "Command timeout\n");
 		ret = xhci_abort_cmd_ring(xhci);
+		xhci_cleanup_command_queue(xhci);
 		if (unlikely(ret == -ESHUTDOWN)) {
 			xhci_err(xhci, "Abort command ring failed\n");
-			xhci_cleanup_command_queue(xhci);
 			usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
 			xhci_dbg(xhci, "xHCI host controller is dead.\n");
 		}
-- 
1.7.7.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-18  7:01 [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout Rajesh Bhagat
@ 2016-03-18 11:20 ` Mathias Nyman
  2016-03-21  4:18   ` Rajesh Bhagat
  2016-03-18 14:21 ` Alan Stern
  1 sibling, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-18 11:20 UTC (permalink / raw)
  To: Rajesh Bhagat, linux-usb, linux-kernel; +Cc: gregkh, mathias.nyman, sriram.dash

On 18.03.2016 09:01, Rajesh Bhagat wrote:
> We are facing issue while performing the system resume operation from STR
> where XHCI is going to indefinite hang/sleep state due to
> wait_for_completion API called in function xhci_alloc_dev for command
> TRB_ENABLE_SLOT which never completes.
>
> Now, xhci_handle_command_timeout function is called and prints
> "Command timeout" message but never calls complete API for above
> TRB_ENABLE_SLOT command as xhci_abort_cmd_ring is successful.
>
> Solution to above problem is:
> 1. calling xhci_cleanup_command_queue API even if xhci_abort_cmd_ring
>     is successful or not.
> 2. checking the status of reset_device in usb core code.


Hi

I think clearing the whole command ring is a bit too much in this case.
It may cause issues for all attached devices when one command times out.

We need to look in more detail why we fail to call completion for that one aborted command.

The bigger question is why the timeout happens in the first place?

What kernel version, and what xhci vendor was this triggered on?

It's possible that the timeout is related either to the locking issue found by Chris Bainbridge:
http://marc.info/?l=linux-usb&m=145493945408601&w=2

or the resume issues in this thread, (see full thread)
http://marc.info/?l=linux-usb&m=145477850706552&w=2

Does any of those proposed solutions fix the command timeout for you?

-Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-18  7:01 [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout Rajesh Bhagat
  2016-03-18 11:20 ` Mathias Nyman
@ 2016-03-18 14:21 ` Alan Stern
  2016-03-21  4:22   ` Rajesh Bhagat
  1 sibling, 1 reply; 18+ messages in thread
From: Alan Stern @ 2016-03-18 14:21 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: linux-usb, linux-kernel, gregkh, mathias.nyman, sriram.dash

On Fri, 18 Mar 2016, Rajesh Bhagat wrote:

> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -2897,10 +2897,14 @@ done:
>  			/* The xHC may think the device is already reset,
>  			 * so ignore the status.
>  			 */
> -			if (hcd->driver->reset_device)
> -				hcd->driver->reset_device(hcd, udev);
> -
> -			usb_set_device_state(udev, USB_STATE_DEFAULT);
> +			if (hcd->driver->reset_device) {
> +				status = hcd->driver->reset_device(hcd, udev);
> +				if (status == 0)
> +					usb_set_device_state(udev, USB_STATE_DEFAULT);
> +				else
> +					usb_set_device_state(udev, USB_STATE_NOTATTACHED);
> +			} else
> +				usb_set_device_state(udev, USB_STATE_DEFAULT);

This is a really bad patch:

You left in the comment about ignoring the status, but then you changed 
the code so that it doesn't ignore the status!

You also called usb_set_device_state() more times than necessary.  You 
could have done it like this:

			if (hcd->driver->reset_device)
				status = hcd->driver->reset_device(hcd, udev);
			if (status == 0)
				usb_set_device_state(udev, USB_STATE_DEFAULT);
			else
				usb_set_device_state(udev, USB_STATE_NOTATTACHED);

(Even that could be simplified further, by combining it with the code
that follows.)

Finally, you violated the 80-column limit.

Alan Stern

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-18 11:20 ` Mathias Nyman
@ 2016-03-21  4:18   ` Rajesh Bhagat
  2016-03-21  9:16     ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-21  4:18 UTC (permalink / raw)
  To: Mathias Nyman, linux-usb, linux-kernel; +Cc: gregkh, mathias.nyman, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Friday, March 18, 2016 4:51 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: gregkh@linuxfoundation.org; mathias.nyman@intel.com; Sriram Dash
> <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 18.03.2016 09:01, Rajesh Bhagat wrote:
> > We are facing issue while performing the system resume operation from
> > STR where XHCI is going to indefinite hang/sleep state due to
> > wait_for_completion API called in function xhci_alloc_dev for command
> > TRB_ENABLE_SLOT which never completes.
> >
> > Now, xhci_handle_command_timeout function is called and prints
> > "Command timeout" message but never calls complete API for above
> > TRB_ENABLE_SLOT command as xhci_abort_cmd_ring is successful.
> >
> > Solution to above problem is:
> > 1. calling xhci_cleanup_command_queue API even if xhci_abort_cmd_ring
> >     is successful or not.
> > 2. checking the status of reset_device in usb core code.
> 
> 
> Hi
> 
> I think clearing the whole command ring is a bit too much in this case.
> It may cause issues for all attached devices when one command times out.
> 


Hi Mathias, 

I understand your point, But I want to understand how would completion handler be called 
if a command is timed out and xhci_abort_cmd_ring is successful. In this case all the code 
would be waiting on completion handler forever. 
	

> We need to look in more detail why we fail to call completion for that one aborted
> command.
> 

I checked the below code, Please correct me if I am wrong

code waiting on wait_for_completion: 
int xhci_alloc_dev(struct usb_hcd *hcd, struct usb_device *udev)
{
...
        ret = xhci_queue_slot_control(xhci, command, TRB_ENABLE_SLOT, 0);
...

        wait_for_completion(command->completion); <=== waiting for command to complete 


code calling completion handler:
1. handle_cmd_completion -> xhci_complete_del_and_free_cmd
2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) -> xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd

In our case command is timed out, Hence we hit the case #2 but xhci_abort_cmd_ring is success which 
does not calls complete. 


> The bigger question is why the timeout happens in the first place?
> 

We are doing suspend resume operation, It might be controller issue :(, IMO software should not 
hang/stop if hardware is not behaving correct. 

> What kernel version, and what xhci vendor was this triggered on?
> 

We are using 4.1.8 kernel

> It's possible that the timeout is related either to the locking issue found by Chris
> Bainbridge:
> http://marc.info/?l=linux-usb&m=145493945408601&w=2
> 
> or the resume issues in this thread, (see full thread)
> http://marc.info/?l=linux-usb&m=145477850706552&w=2
> 
> Does any of those proposed solutions fix the command timeout for you?
> 

I will check the above patches and share status.

> -Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-18 14:21 ` Alan Stern
@ 2016-03-21  4:22   ` Rajesh Bhagat
  2016-03-21 15:02     ` Alan Stern
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-21  4:22 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-usb, linux-kernel, gregkh, mathias.nyman, Sriram Dash



> -----Original Message-----
> From: Alan Stern [mailto:stern@rowland.harvard.edu]
> Sent: Friday, March 18, 2016 7:51 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: linux-usb@vger.kernel.org; linux-kernel@vger.kernel.org;
> gregkh@linuxfoundation.org; mathias.nyman@intel.com; Sriram Dash
> <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On Fri, 18 Mar 2016, Rajesh Bhagat wrote:
> 
> > --- a/drivers/usb/core/hub.c
> > +++ b/drivers/usb/core/hub.c
> > @@ -2897,10 +2897,14 @@ done:
> >  			/* The xHC may think the device is already reset,
> >  			 * so ignore the status.
> >  			 */
> > -			if (hcd->driver->reset_device)
> > -				hcd->driver->reset_device(hcd, udev);
> > -
> > -			usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +			if (hcd->driver->reset_device) {
> > +				status = hcd->driver->reset_device(hcd, udev);
> > +				if (status == 0)
> > +					usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +				else
> > +					usb_set_device_state(udev,
> USB_STATE_NOTATTACHED);
> > +			} else
> > +				usb_set_device_state(udev, USB_STATE_DEFAULT);
> 
> This is a really bad patch:
> 
> You left in the comment about ignoring the status, but then you changed the code so
> that it doesn't ignore the status!
> 

My Apologies, I completely missed the above comment which was added before. 

> You also called usb_set_device_state() more times than necessary.  You could have
> done it like this:
> 
> 			if (hcd->driver->reset_device)
> 				status = hcd->driver->reset_device(hcd, udev);
> 			if (status == 0)
> 				usb_set_device_state(udev, USB_STATE_DEFAULT);
> 			else
> 				usb_set_device_state(udev, USB_STATE_NOTATTACHED);
> 
> (Even that could be simplified further, by combining it with the code that follows.)
> 
> Finally, you violated the 80-column limit.
> 

I agree to your point. Actually the intent was to check the return status of reset_device which 
is currently being ignored. I encountered the reset_device failure case in resume operation (STR) 
which is increasing the time of resume and causing unexpected crashes if return value is not checked. 

Do you agree it should be checked here? If yes, I can rework this patch. 

> Alan Stern

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-21  4:18   ` Rajesh Bhagat
@ 2016-03-21  9:16     ` Mathias Nyman
  2016-03-22  5:19       ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-21  9:16 UTC (permalink / raw)
  To: Rajesh Bhagat, Mathias Nyman, linux-usb, linux-kernel; +Cc: gregkh, Sriram Dash

On 21.03.2016 06:18, Rajesh Bhagat wrote:
>
>
>>
>> Hi
>>
>> I think clearing the whole command ring is a bit too much in this case.
>> It may cause issues for all attached devices when one command times out.
>>
>
> Hi Mathias,
>
> I understand your point, But I want to understand how would completion handler be called
> if a command is timed out and xhci_abort_cmd_ring is successful. In this case all the code
> would be waiting on completion handler forever.
> 	
>
> 2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) -> xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd
>
> In our case command is timed out, Hence we hit the case #2 but xhci_abort_cmd_ring is success which
> does not calls complete.

xhci_abort_cmd_ring() will write CA bit (CMD_RING_ABORT) to CRCR register.
This will generate a command completion event with status "command aborted" for the pending command.
This event is then followed by a "command ring stopped" command completion event.

See xHCI specs 5.4.5 and 4.6.1.2

handle_cmd_completion() will check if cmd_comp_code == COMP_CMD_ABORT, goto event_handled, and call
xhci_complete_del_and_free_cmd(cmd, cmd_comp_code) for the aborted command.

If xHCI already processed the aborted command, we might only get a command ring stopped event, in this
case handle_cmd_completion() will call xhci_handle_stopped_cmd_ring(xhci, cmd), which will turn the
commands that were tagged for "abort" that still remain on the command ring to NO-OP commands.

The completion callback will be called for these NO-OP command later when we get a command completion event
for them.
  
>> What kernel version, and what xhci vendor was this triggered on?
>>
>
> We are using 4.1.8 kernel
>

Are you able to try a more recent version?

-Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-21  4:22   ` Rajesh Bhagat
@ 2016-03-21 15:02     ` Alan Stern
  0 siblings, 0 replies; 18+ messages in thread
From: Alan Stern @ 2016-03-21 15:02 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: linux-usb, linux-kernel, gregkh, mathias.nyman, Sriram Dash

On Mon, 21 Mar 2016, Rajesh Bhagat wrote:

> I agree to your point. Actually the intent was to check the return status of reset_device which 
> is currently being ignored. I encountered the reset_device failure case in resume operation (STR) 
> which is increasing the time of resume and causing unexpected crashes if return value is not checked. 
> 
> Do you agree it should be checked here? If yes, I can rework this patch. 

I don't know.  You should ask the xHCI maintainter, Mathias.

Alan Stern

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-21  9:16     ` Mathias Nyman
@ 2016-03-22  5:19       ` Rajesh Bhagat
  2016-03-22 12:05         ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-22  5:19 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@intel.com]
> Sent: Monday, March 21, 2016 2:46 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>; Mathias Nyman
> <mathias.nyman@linux.intel.com>; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Cc: gregkh@linuxfoundation.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 21.03.2016 06:18, Rajesh Bhagat wrote:
> >
> >
> >>
> >> Hi
> >>
> >> I think clearing the whole command ring is a bit too much in this case.
> >> It may cause issues for all attached devices when one command times out.
> >>
> >
> > Hi Mathias,
> >
> > I understand your point, But I want to understand how would completion
> > handler be called if a command is timed out and xhci_abort_cmd_ring is
> > successful. In this case all the code would be waiting on completion handler forever.
> >
> >
> > 2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) ->
> > xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd
> >
> > In our case command is timed out, Hence we hit the case #2 but
> > xhci_abort_cmd_ring is success which does not calls complete.
> 
> xhci_abort_cmd_ring() will write CA bit (CMD_RING_ABORT) to CRCR register.
> This will generate a command completion event with status "command aborted" for
> the pending command.
> This event is then followed by a "command ring stopped" command completion event.
> 
> See xHCI specs 5.4.5 and 4.6.1.2
> 
> handle_cmd_completion() will check if cmd_comp_code == COMP_CMD_ABORT, goto
> event_handled, and call xhci_complete_del_and_free_cmd(cmd, cmd_comp_code) for
> the aborted command.
> 
> If xHCI already processed the aborted command, we might only get a command ring
> stopped event, in this case handle_cmd_completion() will call
> xhci_handle_stopped_cmd_ring(xhci, cmd), which will turn the commands that were
> tagged for "abort" that still remain on the command ring to NO-OP commands.
> 
> The completion callback will be called for these NO-OP command later when we get a
> command completion event for them.
> 

Thanks Mathias for detailed explanation. Now I understand how completion handler is 
supposed to be called in this scenario. 

But in our case, somehow we are not getting any event and handle_cmd_completion function 
is not getting called even after successful xhci_abort_cmd_ring when command timed out. 

Now, my point here is code prior to this patch xhci: rework command timeout and cancellation,
Code would have returned in case command timed out in xhci_alloc_dev itself.

-       /* XXX: how much time for xHC slot assignment? */
-       timeleft = wait_for_completion_interruptible_timeout(
-                       command->completion,
-                       XHCI_CMD_DEFAULT_TIMEOUT);
-       if (timeleft <= 0) {
-               xhci_warn(xhci, "%s while waiting for a slot\n",
-                               timeleft == 0 ? "Timeout" : "Signal");
-               /* cancel the enable slot request */
-               ret = xhci_cancel_cmd(xhci, NULL, command->command_trb);
-               return ret;
-       }
+       wait_for_completion(command->completion);

But after this patch, we are waiting for hardware event, which is somehow not generated 
and causing a hang scenario. 

IMO, The assumption that "xhci_abort_cmd_ring would always generate an event 
and handle_cmd_completion would be called" will not be always be true if HW is in bad state.

Please share your opinion.

> >> What kernel version, and what xhci vendor was this triggered on?
> >>
> >
> > We are using 4.1.8 kernel
> >
> 
> Are you able to try a more recent version?
> 

Using a newer kernel version would be bit difficult, but I would surely try it.

> -Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-22  5:19       ` Rajesh Bhagat
@ 2016-03-22 12:05         ` Mathias Nyman
  2016-03-23  3:53           ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-22 12:05 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash

On 22.03.2016 07:19, Rajesh Bhagat wrote:
>
>
>> -----Original Message-----
>> From: Mathias Nyman [mailto:mathias.nyman@intel.com]
>> Sent: Monday, March 21, 2016 2:46 PM
>> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>; Mathias Nyman
>> <mathias.nyman@linux.intel.com>; linux-usb@vger.kernel.org; linux-
>> kernel@vger.kernel.org
>> Cc: gregkh@linuxfoundation.org; Sriram Dash <sriram.dash@nxp.com>
>> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
>> commmand timeout
>>
>> On 21.03.2016 06:18, Rajesh Bhagat wrote:
>>>
>>>
>>>>
>>>> Hi
>>>>
>>>> I think clearing the whole command ring is a bit too much in this case.
>>>> It may cause issues for all attached devices when one command times out.
>>>>
>>>
>>> Hi Mathias,
>>>
>>> I understand your point, But I want to understand how would completion
>>> handler be called if a command is timed out and xhci_abort_cmd_ring is
>>> successful. In this case all the code would be waiting on completion handler forever.
>>>
>>>
>>> 2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) ->
>>> xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd
>>>
>>> In our case command is timed out, Hence we hit the case #2 but
>>> xhci_abort_cmd_ring is success which does not calls complete.
>>
>> xhci_abort_cmd_ring() will write CA bit (CMD_RING_ABORT) to CRCR register.
>> This will generate a command completion event with status "command aborted" for
>> the pending command.
>> This event is then followed by a "command ring stopped" command completion event.
>>
>> See xHCI specs 5.4.5 and 4.6.1.2
>>
>> handle_cmd_completion() will check if cmd_comp_code == COMP_CMD_ABORT, goto
>> event_handled, and call xhci_complete_del_and_free_cmd(cmd, cmd_comp_code) for
>> the aborted command.
>>
>> If xHCI already processed the aborted command, we might only get a command ring
>> stopped event, in this case handle_cmd_completion() will call
>> xhci_handle_stopped_cmd_ring(xhci, cmd), which will turn the commands that were
>> tagged for "abort" that still remain on the command ring to NO-OP commands.
>>
>> The completion callback will be called for these NO-OP command later when we get a
>> command completion event for them.
>>
>
> Thanks Mathias for detailed explanation. Now I understand how completion handler is
> supposed to be called in this scenario.
>
> But in our case, somehow we are not getting any event and handle_cmd_completion function
> is not getting called even after successful xhci_abort_cmd_ring when command timed out.
>
> Now, my point here is code prior to this patch xhci: rework command timeout and cancellation,
> Code would have returned in case command timed out in xhci_alloc_dev itself.
>
> -       /* XXX: how much time for xHC slot assignment? */
> -       timeleft = wait_for_completion_interruptible_timeout(
> -                       command->completion,
> -                       XHCI_CMD_DEFAULT_TIMEOUT);
> -       if (timeleft <= 0) {
> -               xhci_warn(xhci, "%s while waiting for a slot\n",
> -                               timeleft == 0 ? "Timeout" : "Signal");
> -               /* cancel the enable slot request */
> -               ret = xhci_cancel_cmd(xhci, NULL, command->command_trb);
> -               return ret;
> -       }
> +       wait_for_completion(command->completion);
>
> But after this patch, we are waiting for hardware event, which is somehow not generated
> and causing a hang scenario.
>
> IMO, The assumption that "xhci_abort_cmd_ring would always generate an event
> and handle_cmd_completion would be called" will not be always be true if HW is in bad state.
>
> Please share your opinion.
>

writing the CA (command abort) bit in CRCR (command ring control register)  will stop the command ring,
and CRR (command ring running) will be set to 0 by xHC. xhci_abort_cmd_ring() polls this bit up to 5 seconds.
If it's not 0 then the driver considers the command abort as failed.

The scenario you're thinking of is that xHC would still react to CA bit set, it would stop the command
ring and set CRR 0, but not send a command completion event.

Have you tried adding some debug to handle_cmd_completion() and see if you receive any event after command abortion?

The patches I mentioned earlier are also worth trying out, they might fix the cause why command aborts in the first place.

-Mathias



    

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-22 12:05         ` Mathias Nyman
@ 2016-03-23  3:53           ` Rajesh Bhagat
  2016-03-23 14:22             ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-23  3:53 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Tuesday, March 22, 2016 5:36 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 22.03.2016 07:19, Rajesh Bhagat wrote:
> >
> >
> >> -----Original Message-----
> >> From: Mathias Nyman [mailto:mathias.nyman@intel.com]
> >> Sent: Monday, March 21, 2016 2:46 PM
> >> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>; Mathias Nyman
> >> <mathias.nyman@linux.intel.com>; linux-usb@vger.kernel.org; linux-
> >> kernel@vger.kernel.org
> >> Cc: gregkh@linuxfoundation.org; Sriram Dash <sriram.dash@nxp.com>
> >> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation
> >> due to XHCI commmand timeout
> >>
> >> On 21.03.2016 06:18, Rajesh Bhagat wrote:
> >>>
> >>>
> >>>>
> >>>> Hi
> >>>>
> >>>> I think clearing the whole command ring is a bit too much in this case.
> >>>> It may cause issues for all attached devices when one command times out.
> >>>>
> >>>
> >>> Hi Mathias,
> >>>
> >>> I understand your point, But I want to understand how would
> >>> completion handler be called if a command is timed out and
> >>> xhci_abort_cmd_ring is successful. In this case all the code would be waiting on
> completion handler forever.
> >>>
> >>>
> >>> 2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) ->
> >>> xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd
> >>>
> >>> In our case command is timed out, Hence we hit the case #2 but
> >>> xhci_abort_cmd_ring is success which does not calls complete.
> >>
> >> xhci_abort_cmd_ring() will write CA bit (CMD_RING_ABORT) to CRCR register.
> >> This will generate a command completion event with status "command
> >> aborted" for the pending command.
> >> This event is then followed by a "command ring stopped" command completion
> event.
> >>
> >> See xHCI specs 5.4.5 and 4.6.1.2
> >>
> >> handle_cmd_completion() will check if cmd_comp_code ==
> >> COMP_CMD_ABORT, goto event_handled, and call
> >> xhci_complete_del_and_free_cmd(cmd, cmd_comp_code) for the aborted
> command.
> >>
> >> If xHCI already processed the aborted command, we might only get a
> >> command ring stopped event, in this case handle_cmd_completion() will
> >> call xhci_handle_stopped_cmd_ring(xhci, cmd), which will turn the
> >> commands that were tagged for "abort" that still remain on the command ring to
> NO-OP commands.
> >>
> >> The completion callback will be called for these NO-OP command later
> >> when we get a command completion event for them.
> >>
> >
> > Thanks Mathias for detailed explanation. Now I understand how
> > completion handler is supposed to be called in this scenario.
> >
> > But in our case, somehow we are not getting any event and
> > handle_cmd_completion function is not getting called even after successful
> xhci_abort_cmd_ring when command timed out.
> >
> > Now, my point here is code prior to this patch xhci: rework command
> > timeout and cancellation, Code would have returned in case command timed out in
> xhci_alloc_dev itself.
> >
> > -       /* XXX: how much time for xHC slot assignment? */
> > -       timeleft = wait_for_completion_interruptible_timeout(
> > -                       command->completion,
> > -                       XHCI_CMD_DEFAULT_TIMEOUT);
> > -       if (timeleft <= 0) {
> > -               xhci_warn(xhci, "%s while waiting for a slot\n",
> > -                               timeleft == 0 ? "Timeout" : "Signal");
> > -               /* cancel the enable slot request */
> > -               ret = xhci_cancel_cmd(xhci, NULL, command->command_trb);
> > -               return ret;
> > -       }
> > +       wait_for_completion(command->completion);
> >
> > But after this patch, we are waiting for hardware event, which is
> > somehow not generated and causing a hang scenario.
> >
> > IMO, The assumption that "xhci_abort_cmd_ring would always generate an
> > event and handle_cmd_completion would be called" will not be always be true if HW
> is in bad state.
> >
> > Please share your opinion.
> >
> 
> writing the CA (command abort) bit in CRCR (command ring control register)  will stop
> the command ring, and CRR (command ring running) will be set to 0 by xHC.
> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
> If it's not 0 then the driver considers the command abort as failed.
> 
> The scenario you're thinking of is that xHC would still react to CA bit set, it would stop
> the command ring and set CRR 0, but not send a command completion event.
> 
> Have you tried adding some debug to handle_cmd_completion() and see if you receive
> any event after command abortion?
> 

Yes. We have added debug prints at first line of handle_cmd_completion, and we are not getting 
those prints. The last print messages that we get are as below from xhci_alloc_dev while resume 
operation: 

xhci-hcd xhci-hcd.0.auto: Command timeout
xhci-hcd xhci-hcd.0.auto: Abort command ring

May be somehow, USB controller is in bad state and not responding to the commands. 

Please suggest how XHCI driver can handle such situations. 

> The patches I mentioned earlier are also worth trying out, they might fix the cause
> why command aborts in the first place.
> 

We tried both the patches you mentioned. But they dint come to rescue here. 

> -Mathias
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-23  3:53           ` Rajesh Bhagat
@ 2016-03-23 14:22             ` Mathias Nyman
  2016-03-28  6:13               ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-23 14:22 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash

On 23.03.2016 05:53, Rajesh Bhagat wrote:

>>> IMO, The assumption that "xhci_abort_cmd_ring would always generate an
>>> event and handle_cmd_completion would be called" will not be always be true if HW
>> is in bad state.
>>>
>>> Please share your opinion.
>>>
>>
>> writing the CA (command abort) bit in CRCR (command ring control register)  will stop
>> the command ring, and CRR (command ring running) will be set to 0 by xHC.
>> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
>> If it's not 0 then the driver considers the command abort as failed.
>>
>> The scenario you're thinking of is that xHC would still react to CA bit set, it would stop
>> the command ring and set CRR 0, but not send a command completion event.
>>
>> Have you tried adding some debug to handle_cmd_completion() and see if you receive
>> any event after command abortion?
>>
>
> Yes. We have added debug prints at first line of handle_cmd_completion, and we are not getting
> those prints. The last print messages that we get are as below from xhci_alloc_dev while resume
> operation:
>
> xhci-hcd xhci-hcd.0.auto: Command timeout
> xhci-hcd xhci-hcd.0.auto: Abort command ring
>
> May be somehow, USB controller is in bad state and not responding to the commands.
>
> Please suggest how XHCI driver can handle such situations.
>

Restart the command timeout timer when writing the command abort bit.
If we get theIf we get the abort event the timer is deleted.

Otherwise if the timout triggers a second time we end up calling
xhci_handle_command_timeout() with a stopped ring,
This will call xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op, restart the
command ring, and finally when the no-op completes it should call the missing completion.

If command ring doesn't start then additional code could be added to xhci_handle_command_timeout()
that clears the command ring if it is called a second time (=current command is already in
abort state and command ring is stopped when entering xhci_handle_command_timeout)

There might be some details missing, I'm not able to test any of this, but try
something like this:

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 3e1d24c..576819e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
                 xhci_halt(xhci);
                 return -ESHUTDOWN;
         }
-
+       /* writing the CMD_RING_ABORT bit should create a command completion
+        * event, add a command completion timeout for it as well
+        */
+       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
         return 0;
  }

-Mathias

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-23 14:22             ` Mathias Nyman
@ 2016-03-28  6:13               ` Rajesh Bhagat
  2016-03-29 17:21                 ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-28  6:13 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Wednesday, March 23, 2016 7:52 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 23.03.2016 05:53, Rajesh Bhagat wrote:
> 
> >>> IMO, The assumption that "xhci_abort_cmd_ring would always generate
> >>> an event and handle_cmd_completion would be called" will not be
> >>> always be true if HW
> >> is in bad state.
> >>>
> >>> Please share your opinion.
> >>>
> >>
> >> writing the CA (command abort) bit in CRCR (command ring control
> >> register)  will stop the command ring, and CRR (command ring running) will be set
> to 0 by xHC.
> >> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
> >> If it's not 0 then the driver considers the command abort as failed.
> >>
> >> The scenario you're thinking of is that xHC would still react to CA
> >> bit set, it would stop the command ring and set CRR 0, but not send a command
> completion event.
> >>
> >> Have you tried adding some debug to handle_cmd_completion() and see
> >> if you receive any event after command abortion?
> >>
> >
> > Yes. We have added debug prints at first line of
> > handle_cmd_completion, and we are not getting those prints. The last
> > print messages that we get are as below from xhci_alloc_dev while
> > resume
> > operation:
> >
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring
> >
> > May be somehow, USB controller is in bad state and not responding to the
> commands.
> >
> > Please suggest how XHCI driver can handle such situations.
> >
> 
> Restart the command timeout timer when writing the command abort bit.
> If we get theIf we get the abort event the timer is deleted.
> 
> Otherwise if the timout triggers a second time we end up calling
> xhci_handle_command_timeout() with a stopped ring, This will call
> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op, restart the
> command ring, and finally when the no-op completes it should call the missing
> completion.
> 
> If command ring doesn't start then additional code could be added to
> xhci_handle_command_timeout() that clears the command ring if it is called a second
> time (=current command is already in abort state and command ring is stopped when
> entering xhci_handle_command_timeout)
> 
> There might be some details missing, I'm not able to test any of this, but try
> something like this:
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index
> 3e1d24c..576819e 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>                  xhci_halt(xhci);
>                  return -ESHUTDOWN;
>          }
> -
> +       /* writing the CMD_RING_ABORT bit should create a command completion
> +        * event, add a command completion timeout for it as well
> +        */
> +       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>          return 0;
>   }

Hello Mathias, 

Thanks for the patch. 

After application of above patch, I'm getting following prints constantly: 

xhci-hcd xhci-hcd.0.auto: Command timeout
xhci-hcd xhci-hcd.0.auto: Abort command ring
xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
xhci-hcd xhci-hcd.0.auto: // Ding dong!
...
xhci-hcd xhci-hcd.0.auto: Command timeout
xhci-hcd xhci-hcd.0.auto: Abort command ring
xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
xhci-hcd xhci-hcd.0.auto: // Ding dong!

As expected, xhci_handle_command_timeout is called again and next time ring state
is __not__ CMD_RING_STATE_RUNNING, Hence xhci_handle_stopped_cmd_ring is called 
which turn all the aborted commands to no-ops and again makes the ring state as 
CMD_RING_STATE_RUNNING, and rings the door bell. 

But again in this case, no response from USB controller and xhci_alloc_dev is still waiting for
wait_for_completion. 

Does rescheduling the xhci->cmd_timer ensures command completion will be called. IMO, it is still 
dependent on HW response. 

Please share your opinion. 

> 
> -Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-28  6:13               ` Rajesh Bhagat
@ 2016-03-29 17:21                 ` Mathias Nyman
  2016-03-31  3:51                   ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-29 17:21 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash

On 28.03.2016 09:13, Rajesh Bhagat wrote:
>
>
>> -----Original Message-----
>> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
>> Sent: Wednesday, March 23, 2016 7:52 PM
>> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
>> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
>> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
>> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
>> commmand timeout
>>
>> On 23.03.2016 05:53, Rajesh Bhagat wrote:
>>
>>>>> IMO, The assumption that "xhci_abort_cmd_ring would always generate
>>>>> an event and handle_cmd_completion would be called" will not be
>>>>> always be true if HW
>>>> is in bad state.
>>>>>
>>>>> Please share your opinion.
>>>>>
>>>>
>>>> writing the CA (command abort) bit in CRCR (command ring control
>>>> register)  will stop the command ring, and CRR (command ring running) will be set
>> to 0 by xHC.
>>>> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
>>>> If it's not 0 then the driver considers the command abort as failed.
>>>>
>>>> The scenario you're thinking of is that xHC would still react to CA
>>>> bit set, it would stop the command ring and set CRR 0, but not send a command
>> completion event.
>>>>
>>>> Have you tried adding some debug to handle_cmd_completion() and see
>>>> if you receive any event after command abortion?
>>>>
>>>
>>> Yes. We have added debug prints at first line of
>>> handle_cmd_completion, and we are not getting those prints. The last
>>> print messages that we get are as below from xhci_alloc_dev while
>>> resume
>>> operation:
>>>
>>> xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
>>> Abort command ring
>>>
>>> May be somehow, USB controller is in bad state and not responding to the
>> commands.
>>>
>>> Please suggest how XHCI driver can handle such situations.
>>>
>>
>> Restart the command timeout timer when writing the command abort bit.
>> If we get theIf we get the abort event the timer is deleted.
>>
>> Otherwise if the timout triggers a second time we end up calling
>> xhci_handle_command_timeout() with a stopped ring, This will call
>> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op, restart the
>> command ring, and finally when the no-op completes it should call the missing
>> completion.
>>
>> If command ring doesn't start then additional code could be added to
>> xhci_handle_command_timeout() that clears the command ring if it is called a second
>> time (=current command is already in abort state and command ring is stopped when
>> entering xhci_handle_command_timeout)
>>
>> There might be some details missing, I'm not able to test any of this, but try
>> something like this:
>>
>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index
>> 3e1d24c..576819e 100644
>> --- a/drivers/usb/host/xhci-ring.c
>> +++ b/drivers/usb/host/xhci-ring.c
>> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>                   xhci_halt(xhci);
>>                   return -ESHUTDOWN;
>>           }
>> -
>> +       /* writing the CMD_RING_ABORT bit should create a command completion
>> +        * event, add a command completion timeout for it as well
>> +        */
>> +       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>>           return 0;
>>    }
>
> Hello Mathias,
>
> Thanks for the patch.
>
> After application of above patch, I'm getting following prints constantly:
>
> xhci-hcd xhci-hcd.0.auto: Command timeout
> xhci-hcd xhci-hcd.0.auto: Abort command ring
> xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
> xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
> xhci-hcd xhci-hcd.0.auto: // Ding dong!
> ...
> xhci-hcd xhci-hcd.0.auto: Command timeout
> xhci-hcd xhci-hcd.0.auto: Abort command ring
> xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
> xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
> xhci-hcd xhci-hcd.0.auto: // Ding dong!
>
> As expected, xhci_handle_command_timeout is called again and next time ring state
> is __not__ CMD_RING_STATE_RUNNING, Hence xhci_handle_stopped_cmd_ring is called
> which turn all the aborted commands to no-ops and again makes the ring state as
> CMD_RING_STATE_RUNNING, and rings the door bell.
>
> But again in this case, no response from USB controller and xhci_alloc_dev is still waiting for
> wait_for_completion.
>
> Does rescheduling the xhci->cmd_timer ensures command completion will be called. IMO, it is still
> dependent on HW response.
>

It's still dependent on hw starting the command ring, which apparently fails

So additional code is needed that will treat a second or third timeout as
a failed abort attempt, i.e. if a command times out on a stopped ring and the
command status is already aborted, then we cleanup the command ring and kill xhci.

So in addition to the previous added timeout something like this is needed:
(Again, uncompiled, untested semi pseudocode)


diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 7cf6621..d8518e8f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1253,6 +1253,7 @@ void xhci_handle_command_timeout(unsigned long data)
         int ret;
         unsigned long flags;
         u64 hw_ring_state;
+       u32 old_status;
         struct xhci_command *cur_cmd = NULL;
         xhci = (struct xhci_hcd *) data;
  
@@ -1260,6 +1261,7 @@ void xhci_handle_command_timeout(unsigned long data)
         spin_lock_irqsave(&xhci->lock, flags);
         if (xhci->current_cmd) {
                 cur_cmd = xhci->current_cmd;
+               old_status = cur_cmd->status;
                 cur_cmd->status = COMP_CMD_ABORT;
         }
  
@@ -1282,7 +1284,14 @@ void xhci_handle_command_timeout(unsigned long data)
         }
         /* command timeout on stopped ring, ring can't be aborted */
         xhci_dbg(xhci, "Command timeout on stopped ring\n");
-       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+       /* is this the second timeout for the same command? ring wont restart */
+       if (old_status == COMP_CMD_ABORT) {
+               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
+               xhci_cleanup_command_queue(xhci);
+               /* everything else here to halt, cleanup, free etc kill xhc */
+       } else {
+               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+       }
         spin_unlock_irqrestore(&xhci->lock, flags);
         return;
  }

-Mathias

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-29 17:21                 ` Mathias Nyman
@ 2016-03-31  3:51                   ` Rajesh Bhagat
  2016-03-31 14:36                     ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-03-31  3:51 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Tuesday, March 29, 2016 10:51 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 28.03.2016 09:13, Rajesh Bhagat wrote:
> >
> >
> >> -----Original Message-----
> >> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> >> Sent: Wednesday, March 23, 2016 7:52 PM
> >> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> >> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> >> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> >> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation
> >> due to XHCI commmand timeout
> >>
> >> On 23.03.2016 05:53, Rajesh Bhagat wrote:
> >>
> >>>>> IMO, The assumption that "xhci_abort_cmd_ring would always
> >>>>> generate an event and handle_cmd_completion would be called" will
> >>>>> not be always be true if HW
> >>>> is in bad state.
> >>>>>
> >>>>> Please share your opinion.
> >>>>>
> >>>>
> >>>> writing the CA (command abort) bit in CRCR (command ring control
> >>>> register)  will stop the command ring, and CRR (command ring
> >>>> running) will be set
> >> to 0 by xHC.
> >>>> xhci_abort_cmd_ring() polls this bit up to 5 seconds.
> >>>> If it's not 0 then the driver considers the command abort as failed.
> >>>>
> >>>> The scenario you're thinking of is that xHC would still react to CA
> >>>> bit set, it would stop the command ring and set CRR 0, but not send
> >>>> a command
> >> completion event.
> >>>>
> >>>> Have you tried adding some debug to handle_cmd_completion() and see
> >>>> if you receive any event after command abortion?
> >>>>
> >>>
> >>> Yes. We have added debug prints at first line of
> >>> handle_cmd_completion, and we are not getting those prints. The last
> >>> print messages that we get are as below from xhci_alloc_dev while
> >>> resume
> >>> operation:
> >>>
> >>> xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> >>> Abort command ring
> >>>
> >>> May be somehow, USB controller is in bad state and not responding to
> >>> the
> >> commands.
> >>>
> >>> Please suggest how XHCI driver can handle such situations.
> >>>
> >>
> >> Restart the command timeout timer when writing the command abort bit.
> >> If we get theIf we get the abort event the timer is deleted.
> >>
> >> Otherwise if the timout triggers a second time we end up calling
> >> xhci_handle_command_timeout() with a stopped ring, This will call
> >> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op,
> >> restart the command ring, and finally when the no-op completes it
> >> should call the missing completion.
> >>
> >> If command ring doesn't start then additional code could be added to
> >> xhci_handle_command_timeout() that clears the command ring if it is
> >> called a second time (=current command is already in abort state and
> >> command ring is stopped when entering xhci_handle_command_timeout)
> >>
> >> There might be some details missing, I'm not able to test any of
> >> this, but try something like this:
> >>
> >> diff --git a/drivers/usb/host/xhci-ring.c
> >> b/drivers/usb/host/xhci-ring.c index 3e1d24c..576819e 100644
> >> --- a/drivers/usb/host/xhci-ring.c
> >> +++ b/drivers/usb/host/xhci-ring.c
> >> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >>                   xhci_halt(xhci);
> >>                   return -ESHUTDOWN;
> >>           }
> >> -
> >> +       /* writing the CMD_RING_ABORT bit should create a command completion
> >> +        * event, add a command completion timeout for it as well
> >> +        */
> >> +       mod_timer(&xhci->cmd_timer, jiffies +
> >> + XHCI_CMD_DEFAULT_TIMEOUT);
> >>           return 0;
> >>    }
> >
> > Hello Mathias,
> >
> > Thanks for the patch.
> >
> > After application of above patch, I'm getting following prints constantly:
> >
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> > ...
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> >
> > As expected, xhci_handle_command_timeout is called again and next time
> > ring state is __not__ CMD_RING_STATE_RUNNING, Hence
> > xhci_handle_stopped_cmd_ring is called which turn all the aborted
> > commands to no-ops and again makes the ring state as
> CMD_RING_STATE_RUNNING, and rings the door bell.
> >
> > But again in this case, no response from USB controller and
> > xhci_alloc_dev is still waiting for wait_for_completion.
> >
> > Does rescheduling the xhci->cmd_timer ensures command completion will
> > be called. IMO, it is still dependent on HW response.
> >
> 
> It's still dependent on hw starting the command ring, which apparently fails
> 
> So additional code is needed that will treat a second or third timeout as a failed abort
> attempt, i.e. if a command times out on a stopped ring and the command status is
> already aborted, then we cleanup the command ring and kill xhci.
> 
> So in addition to the previous added timeout something like this is needed:
> (Again, uncompiled, untested semi pseudocode)
> 
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index
> 7cf6621..d8518e8f 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1253,6 +1253,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          int ret;
>          unsigned long flags;
>          u64 hw_ring_state;
> +       u32 old_status;
>          struct xhci_command *cur_cmd = NULL;
>          xhci = (struct xhci_hcd *) data;
> 
> @@ -1260,6 +1261,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          spin_lock_irqsave(&xhci->lock, flags);
>          if (xhci->current_cmd) {
>                  cur_cmd = xhci->current_cmd;
> +               old_status = cur_cmd->status;
>                  cur_cmd->status = COMP_CMD_ABORT;
>          }
> 
> @@ -1282,7 +1284,14 @@ void xhci_handle_command_timeout(unsigned long data)
>          }
>          /* command timeout on stopped ring, ring can't be aborted */
>          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +       /* is this the second timeout for the same command? ring wont restart */
> +       if (old_status == COMP_CMD_ABORT) {
> +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> +               xhci_cleanup_command_queue(xhci);
> +               /* everything else here to halt, cleanup, free etc kill xhc */
> +       } else {
> +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +       }
>          spin_unlock_irqrestore(&xhci->lock, flags);
>          return;
>   }
> 

Hello Mathias,

Thanks a lot for your support. 

Attached patch is working and allows the completion handler to be called. And resume operation completes 
and shell comes back (but with lot of errors). 

Now, some other points which needs our attention here are: 
1. usb core hub code is not checking the error status of hcd->driver->reset_device(xhci_discover_or_reset_device)
    and continues considering reset_device was successful (and causes other issues). Hence, a check is needed on return 
    value of reset_device and proper action is required on it. 
2. Even if above return value is checked and reset_device status is used. USB core hub retries for N times which is not
    required in this case as adding to the resume operation time. But if in this case we return -ENOTCONN instead of -EINVAL
    from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns early and doesn't retry. 

Proposed Solution for above with your suggested patches is as below: 

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 7cad1fa..efeba31 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -2807,13 +2807,17 @@ done:
                        struct usb_hcd *hcd = bus_to_hcd(udev->bus);
 
                        update_devnum(udev, 0);
-                       /* The xHC may think the device is already reset,
-                        * so ignore the status.
+                       /*
+                        * check the status of resetting the device and update
+                        * the udev status.
                         */
                        if (hcd->driver->reset_device)
-                               hcd->driver->reset_device(hcd, udev);
+                               status = hcd->driver->reset_device(hcd, udev);
 
-                       usb_set_device_state(udev, USB_STATE_DEFAULT);
+                       if (status == 0)
+                               usb_set_device_state(udev, USB_STATE_DEFAULT);
+                       else
+                               usb_set_device_state(udev, USB_STATE_NOTATTACHED);
                }
        } else {
                if (udev)
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index b3a0a22..9427175f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
                return -ESHUTDOWN;
        }
 
+       /* writing the CMD_RING_ABORT bit should create a command completion
+        * event, add a command completion timeout for it as well
+        */
+       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
        return 0;
 }
 
@@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
        int ret;
        unsigned long flags;
        u64 hw_ring_state;
+       u32 old_status;
        struct xhci_command *cur_cmd = NULL;
        xhci = (struct xhci_hcd *) data;
 
@@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
        spin_lock_irqsave(&xhci->lock, flags);
        if (xhci->current_cmd) {
                cur_cmd = xhci->current_cmd;
+               old_status = cur_cmd->status;
                cur_cmd->status = COMP_CMD_ABORT;
        }
 
@@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long data)
        }
        /* command timeout on stopped ring, ring can't be aborted */
        xhci_dbg(xhci, "Command timeout on stopped ring\n");
-       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
+       /* is this the second timeout for the same command? ring wont restart */
+       if (old_status == COMP_CMD_ABORT) {
+               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
+               xhci_cleanup_command_queue(xhci);
+       /* everything else here to halt, cleanup, free etc kill xhc */
+       } else
+               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+
        spin_unlock_irqrestore(&xhci->lock, flags);
        return;
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index c502c22..bd18966 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd, struct usb_device *udev)
                if (ret == 1)
                        return 0;
                else
-                       return -EINVAL;
+                       return -ENOTCONN;     /* Don't retry */
        }
 
        if (virt_dev->tt_info)

Sample output after above patch (timer is set as wakeup source):

root@phoenix:~# echo mem > /sys/power/state    
PM: Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.001 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
PM: suspend of devices complete after 155.658 msecs
PM: late suspend of devices complete after 1.594 msecs
PM: noirq suspend of devices complete after 1.496 msecs
PM: noirq resume of devices complete after 1.290 msecs
PM: early resume of devices complete after 1.250 msecs
usb usb1: root hub lost power or was reset
usb usb2: root hub lost power or was reset
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
PM: resume of devices complete after 25436.366 msecs     <= resume time is increased a lot even after using -ENOTCONN
Restarting tasks ... 
usb 1-1: USB disconnect, device number 2
usb 1-1.2: USB disconnect, device number 3
usb 2-1: USB disconnect, device number 2
done.
root@phoenix:~#

Please share your opinion on other changes for patch submission as well as resume time. 

- Rajesh Bhagat 

> -Mathias

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-31  3:51                   ` Rajesh Bhagat
@ 2016-03-31 14:36                     ` Mathias Nyman
  2016-04-01  3:55                       ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-03-31 14:36 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash

On 31.03.2016 06:51, Rajesh Bhagat wrote:
>
>
>
> Hello Mathias,
>
> Thanks a lot for your support.
>
> Attached patch is working and allows the completion handler to be called. And resume operation completes
> and shell comes back (but with lot of errors).
>
> Now, some other points which needs our attention here are:
> 1. usb core hub code is not checking the error status of hcd->driver->reset_device(xhci_discover_or_reset_device)
>      and continues considering reset_device was successful (and causes other issues). Hence, a check is needed on return
>      value of reset_device and proper action is required on it.
> 2. Even if above return value is checked and reset_device status is used. USB core hub retries for N times which is not
>      required in this case as adding to the resume operation time. But if in this case we return -ENOTCONN instead of -EINVAL
>      from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns early and doesn't retry.
>
> Proposed Solution for above with your suggested patches is as below:
>
> diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
> index 7cad1fa..efeba31 100644
> --- a/drivers/usb/core/hub.c
> +++ b/drivers/usb/core/hub.c
> @@ -2807,13 +2807,17 @@ done:
>                          struct usb_hcd *hcd = bus_to_hcd(udev->bus);
>
>                          update_devnum(udev, 0);
> -                       /* The xHC may think the device is already reset,
> -                        * so ignore the status.
> +                       /*
> +                        * check the status of resetting the device and update
> +                        * the udev status.
>                           */
>                          if (hcd->driver->reset_device)
> -                               hcd->driver->reset_device(hcd, udev);
> +                               status = hcd->driver->reset_device(hcd, udev);
>
> -                       usb_set_device_state(udev, USB_STATE_DEFAULT);
> +                       if (status == 0)
> +                               usb_set_device_state(udev, USB_STATE_DEFAULT);
> +                       else
> +                               usb_set_device_state(udev, USB_STATE_NOTATTACHED);
>                  }
>          } else {
>                  if (udev)
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index b3a0a22..9427175f 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>                  return -ESHUTDOWN;
>          }
>
> +       /* writing the CMD_RING_ABORT bit should create a command completion
> +        * event, add a command completion timeout for it as well
> +        */
> +       mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>          return 0;
>   }
>
> @@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          int ret;
>          unsigned long flags;
>          u64 hw_ring_state;
> +       u32 old_status;
>          struct xhci_command *cur_cmd = NULL;
>          xhci = (struct xhci_hcd *) data;
>
> @@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
>          spin_lock_irqsave(&xhci->lock, flags);
>          if (xhci->current_cmd) {
>                  cur_cmd = xhci->current_cmd;
> +               old_status = cur_cmd->status;
>                  cur_cmd->status = COMP_CMD_ABORT;
>          }
>
> @@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long data)
>          }
>          /* command timeout on stopped ring, ring can't be aborted */
>          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +
> +       /* is this the second timeout for the same command? ring wont restart */
> +       if (old_status == COMP_CMD_ABORT) {
> +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> +               xhci_cleanup_command_queue(xhci);
> +       /* everything else here to halt, cleanup, free etc kill xhc */
> +       } else
> +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> +
>          spin_unlock_irqrestore(&xhci->lock, flags);
>          return;
>   }
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index c502c22..bd18966 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd, struct usb_device *udev)
>                  if (ret == 1)
>                          return 0;
>                  else
> -                       return -EINVAL;
> +                       return -ENOTCONN;     /* Don't retry */
>          }
>
>          if (virt_dev->tt_info)
>
> Sample output after above patch (timer is set as wakeup source):
>
> root@phoenix:~# echo mem > /sys/power/state
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.001 seconds) done.
> Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> PM: suspend of devices complete after 155.658 msecs
> PM: late suspend of devices complete after 1.594 msecs
> PM: noirq suspend of devices complete after 1.496 msecs
> PM: noirq resume of devices complete after 1.290 msecs
> PM: early resume of devices complete after 1.250 msecs
> usb usb1: root hub lost power or was reset
> usb usb2: root hub lost power or was reset
> xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> PM: resume of devices complete after 25436.366 msecs     <= resume time is increased a lot even after using -ENOTCONN
> Restarting tasks ...
> usb 1-1: USB disconnect, device number 2
> usb 1-1.2: USB disconnect, device number 3
> usb 2-1: USB disconnect, device number 2
> done.
> root@phoenix:~#
>
> Please share your opinion on other changes for patch submission as well as resume time.
>

I think more effort should be put into investigating why this happens in the first place.
What is the root cause? why doesn't xhci start properly after resume in this case.

Optimizing resume time and error paths should be secondary here, resuming faster isn't really
a matter when xhci is completely stuck.

-Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-03-31 14:36                     ` Mathias Nyman
@ 2016-04-01  3:55                       ` Rajesh Bhagat
  2016-04-01 10:49                         ` Mathias Nyman
  0 siblings, 1 reply; 18+ messages in thread
From: Rajesh Bhagat @ 2016-04-01  3:55 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Thursday, March 31, 2016 8:07 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 31.03.2016 06:51, Rajesh Bhagat wrote:
> >
> >
> >
> > Hello Mathias,
> >
> > Thanks a lot for your support.
> >
> > Attached patch is working and allows the completion handler to be
> > called. And resume operation completes and shell comes back (but with lot of
> errors).
> >
> > Now, some other points which needs our attention here are:
> > 1. usb core hub code is not checking the error status of hcd->driver-
> >reset_device(xhci_discover_or_reset_device)
> >      and continues considering reset_device was successful (and causes other issues).
> Hence, a check is needed on return
> >      value of reset_device and proper action is required on it.
> > 2. Even if above return value is checked and reset_device status is used. USB core
> hub retries for N times which is not
> >      required in this case as adding to the resume operation time. But if in this case
> we return -ENOTCONN instead of -EINVAL
> >      from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns
> early and doesn't retry.
> >
> > Proposed Solution for above with your suggested patches is as below:
> >
> > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index
> > 7cad1fa..efeba31 100644
> > --- a/drivers/usb/core/hub.c
> > +++ b/drivers/usb/core/hub.c
> > @@ -2807,13 +2807,17 @@ done:
> >                          struct usb_hcd *hcd = bus_to_hcd(udev->bus);
> >
> >                          update_devnum(udev, 0);
> > -                       /* The xHC may think the device is already reset,
> > -                        * so ignore the status.
> > +                       /*
> > +                        * check the status of resetting the device and update
> > +                        * the udev status.
> >                           */
> >                          if (hcd->driver->reset_device)
> > -                               hcd->driver->reset_device(hcd, udev);
> > +                               status =
> > + hcd->driver->reset_device(hcd, udev);
> >
> > -                       usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       if (status == 0)
> > +                               usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       else
> > +                               usb_set_device_state(udev,
> > + USB_STATE_NOTATTACHED);
> >                  }
> >          } else {
> >                  if (udev)
> > diff --git a/drivers/usb/host/xhci-ring.c
> > b/drivers/usb/host/xhci-ring.c index b3a0a22..9427175f 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >                  return -ESHUTDOWN;
> >          }
> >
> > +       /* writing the CMD_RING_ABORT bit should create a command completion
> > +        * event, add a command completion timeout for it as well
> > +        */
> > +       mod_timer(&xhci->cmd_timer, jiffies +
> > + XHCI_CMD_DEFAULT_TIMEOUT);
> >          return 0;
> >   }
> >
> > @@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          int ret;
> >          unsigned long flags;
> >          u64 hw_ring_state;
> > +       u32 old_status;
> >          struct xhci_command *cur_cmd = NULL;
> >          xhci = (struct xhci_hcd *) data;
> >
> > @@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          spin_lock_irqsave(&xhci->lock, flags);
> >          if (xhci->current_cmd) {
> >                  cur_cmd = xhci->current_cmd;
> > +               old_status = cur_cmd->status;
> >                  cur_cmd->status = COMP_CMD_ABORT;
> >          }
> >
> > @@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long
> data)
> >          }
> >          /* command timeout on stopped ring, ring can't be aborted */
> >          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> > -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> > +       /* is this the second timeout for the same command? ring wont restart */
> > +       if (old_status == COMP_CMD_ABORT) {
> > +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> > +               xhci_cleanup_command_queue(xhci);
> > +       /* everything else here to halt, cleanup, free etc kill xhc */
> > +       } else
> > +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> >          spin_unlock_irqrestore(&xhci->lock, flags);
> >          return;
> >   }
> > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index
> > c502c22..bd18966 100644
> > --- a/drivers/usb/host/xhci.c
> > +++ b/drivers/usb/host/xhci.c
> > @@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd,
> struct usb_device *udev)
> >                  if (ret == 1)
> >                          return 0;
> >                  else
> > -                       return -EINVAL;
> > +                       return -ENOTCONN;     /* Don't retry */
> >          }
> >
> >          if (virt_dev->tt_info)
> >
> > Sample output after above patch (timer is set as wakeup source):
> >
> > root@phoenix:~# echo mem > /sys/power/state
> > PM: Syncing filesystems ... done.
> > Freezing user space processes ... (elapsed 0.001 seconds) done.
> > Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > PM: suspend of devices complete after 155.658 msecs
> > PM: late suspend of devices complete after 1.594 msecs
> > PM: noirq suspend of devices complete after 1.496 msecs
> > PM: noirq resume of devices complete after 1.290 msecs
> > PM: early resume of devices complete after 1.250 msecs usb usb1: root
> > hub lost power or was reset usb usb2: root hub lost power or was reset
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > PM: resume of devices complete after 25436.366 msecs     <= resume time is
> increased a lot even after using -ENOTCONN
> > Restarting tasks ...
> > usb 1-1: USB disconnect, device number 2 usb 1-1.2: USB disconnect,
> > device number 3 usb 2-1: USB disconnect, device number 2 done.
> > root@phoenix:~#
> >
> > Please share your opinion on other changes for patch submission as well as resume
> time.
> >

> 
> I think more effort should be put into investigating why this happens in the first place.
> What is the root cause? why doesn't xhci start properly after resume in this case.
> 
Hello Mathias, 

I understand your point and would surely debug the root cause of the issue. But implementing the 
fallback mechanism in SW, if HW does not respond well seems to a better solution to me. 

Just to add, code prior to common implementation of xhci_handle_command_timeout was handling the above 
case and was __not__ dependent on HW. 

Please comment on the possibility of above fallback mechanism in XHCI SW and any side effects that you 
can foresee. 

> Optimizing resume time and error paths should be secondary here, resuming faster
> isn't really a matter when xhci is completely stuck.
> 

I agree on above point. 

- Rajesh Bhagat 

> -Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-04-01  3:55                       ` Rajesh Bhagat
@ 2016-04-01 10:49                         ` Mathias Nyman
  2016-04-11  3:46                           ` Rajesh Bhagat
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Nyman @ 2016-04-01 10:49 UTC (permalink / raw)
  To: Rajesh Bhagat; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash

On 01.04.2016 06:55, Rajesh Bhagat wrote:

>>>
>>> Please share your opinion on other changes for patch submission as well as resume
>> time.
>>>
>
>>
>> I think more effort should be put into investigating why this happens in the first place.
>> What is the root cause? why doesn't xhci start properly after resume in this case.
>>
> Hello Mathias,
>
> I understand your point and would surely debug the root cause of the issue. But implementing the
> fallback mechanism in SW, if HW does not respond well seems to a better solution to me.
>

Yes, I'll add something like this, at least the xhci parts to make sure the completions are called.

This is not really a fallback, it's just a better burial for xhci. It will be unusable after this.

-Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout
  2016-04-01 10:49                         ` Mathias Nyman
@ 2016-04-11  3:46                           ` Rajesh Bhagat
  0 siblings, 0 replies; 18+ messages in thread
From: Rajesh Bhagat @ 2016-04-11  3:46 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: gregkh, linux-usb, linux-kernel, Sriram Dash



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@linux.intel.com]
> Sent: Friday, April 01, 2016 4:20 PM
> To: Rajesh Bhagat <rajesh.bhagat@nxp.com>
> Cc: gregkh@linuxfoundation.org; linux-usb@vger.kernel.org; linux-
> kernel@vger.kernel.org; Sriram Dash <sriram.dash@nxp.com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 01.04.2016 06:55, Rajesh Bhagat wrote:
> 
> >>>
> >>> Please share your opinion on other changes for patch submission as
> >>> well as resume
> >> time.
> >>>
> >
> >>
> >> I think more effort should be put into investigating why this happens in the first
> place.
> >> What is the root cause? why doesn't xhci start properly after resume in this case.
> >>
> > Hello Mathias,
> >
> > I understand your point and would surely debug the root cause of the
> > issue. But implementing the fallback mechanism in SW, if HW does not respond well
> seems to a better solution to me.
> >
> 
> Yes, I'll add something like this, at least the xhci parts to make sure the completions
> are called.
> 

Should I send the patch?

> This is not really a fallback, it's just a better burial for xhci. It will be unusable after
> this.
> 
> -Mathias

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-04-11  3:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-18  7:01 [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout Rajesh Bhagat
2016-03-18 11:20 ` Mathias Nyman
2016-03-21  4:18   ` Rajesh Bhagat
2016-03-21  9:16     ` Mathias Nyman
2016-03-22  5:19       ` Rajesh Bhagat
2016-03-22 12:05         ` Mathias Nyman
2016-03-23  3:53           ` Rajesh Bhagat
2016-03-23 14:22             ` Mathias Nyman
2016-03-28  6:13               ` Rajesh Bhagat
2016-03-29 17:21                 ` Mathias Nyman
2016-03-31  3:51                   ` Rajesh Bhagat
2016-03-31 14:36                     ` Mathias Nyman
2016-04-01  3:55                       ` Rajesh Bhagat
2016-04-01 10:49                         ` Mathias Nyman
2016-04-11  3:46                           ` Rajesh Bhagat
2016-03-18 14:21 ` Alan Stern
2016-03-21  4:22   ` Rajesh Bhagat
2016-03-21 15:02     ` Alan Stern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.