linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* OHCI root_port_reset() deadly loop...
@ 2007-10-07  6:53 David Miller
  2007-10-07  7:31 ` David Brownell
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-07  6:53 UTC (permalink / raw)
  To: dbrownell; +Cc: linux-usb-users, linux-kernel, greg


When root_port_reset() in ohci-hub.c polls for the end of the reset,
it puts no limit on the loop and will only exit the loop when either
the RH_PS_PRS bit clears or the register returns all 1's (the latter
condition is a recent addition).

If for some reason the bit never clears, we sit here forever and never
exit the loop.

I actually hit this on one of my machines, and I'm trying to track
down what's happening.

Regardless of why my machine is doing this, there absolutely should be
some upper bound put on the number of times we will run through this
loop, perhaps enough such that up to 5 seconds elapses waiting for the
reset bit to clear.  And if it times out we should print a loud
message onto the console, but still try to continue.

Thanks!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-07  6:53 OHCI root_port_reset() deadly loop David Miller
@ 2007-10-07  7:31 ` David Brownell
  2007-10-07  7:51   ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-07  7:31 UTC (permalink / raw)
  To: davem; +Cc: linux-usb-users, linux-kernel, greg

> From davem@davemloft.net  Sat Oct  6 23:56:49 2007
>
> When root_port_reset() in ohci-hub.c polls for the end of the reset,
> it puts no limit on the loop and will only exit the loop when either
> the RH_PS_PRS bit clears or the register returns all 1's (the latter
> condition is a recent addition).

The all-ones typically indicating something like CardBus eject
not preceded by a "polite" driver shutdown.


> If for some reason the bit never clears, we sit here forever and never
> exit the loop.
>
> I actually hit this on one of my machines, and I'm trying to track
> down what's happening.

Sounds like a hardware issue -- unless something else is trashing
controller state.  We've not observed that specific hardware failure
before, for what it's worth.

Is this SPARC, or is ACPI potentially in play?  PCI, or non-PCI?
Are the other ports still behaving?  Is EHCI maybe trying to switch
ownership of that port?  Is maybe the (newish) autosuspend stuff
kicking in?

The OHCI spec requires the controller to stop the reset itself.
Most silicon seems to have a built-in 10 msec timeout.


> Regardless of why my machine is doing this, there absolutely should be
> some upper bound put on the number of times we will run through this
> loop, perhaps enough such that up to 5 seconds elapses waiting for the
> reset bit to clear.  And if it times out we should print a loud
> message onto the console, but still try to continue.

Patches accepted.  :)

Since the PRS bit is specified as "write one", with writing zero
as no-effect (since the rest is hardware-timed), the only recovery
procedure might involve resetting the whole controller.  Messy,
and not something the usb core has historically handled very well.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-07  7:31 ` David Brownell
@ 2007-10-07  7:51   ` David Miller
  2007-10-08 23:54     ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-07  7:51 UTC (permalink / raw)
  To: david-b; +Cc: linux-usb-users, linux-kernel, greg

From: David Brownell <david-b@pacbell.net>
Date: Sun, 07 Oct 2007 00:31:41 -0700

> Is this SPARC, or is ACPI potentially in play?  PCI, or non-PCI?

It's sparc64 on PCI.

> Are the other ports still behaving?  Is EHCI maybe trying to switch
> ownership of that port?  Is maybe the (newish) autosuspend stuff
> kicking in?

I wouldn't know, the machine hangs and doesn't get any further.

> Patches accepted.  :)

I'm 700 patches deep with an additionally large backlog of
patches to apply for the networking and sparc64 tree.

I don't have the time, which is why I reported the problem
to the OHCI maintainer instead of letting it slip through
the cracks.

> Since the PRS bit is specified as "write one", with writing zero
> as no-effect (since the rest is hardware-timed), the only recovery
> procedure might involve resetting the whole controller.  Messy,
> and not something the usb core has historically handled very well.

At a minimum you should exit the loop and print out a warning
messages and try to continue even without trying to reset the
whole controller if that will take some developer effort.

Anything is better than just hanging there forever.  Not every
user knows to hit ALT-SYSRQ then 'P' to get a register dump to
figure out why their computer stopped booting.

Every register polling loops, without exception, should have a limit
and exit with an error indication when that limit is reached.  It
will never hang someones machine, because although not this time, in
many cases these kinds of hangs are absolutely impossible to debug
(interrupts disabled, critical lock held, etc.)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-07  7:51   ` David Miller
@ 2007-10-08 23:54     ` David Miller
  2007-10-09  3:10       ` Greg KH
                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: David Miller @ 2007-10-08 23:54 UTC (permalink / raw)
  To: david-b; +Cc: linux-usb-users, linux-kernel, greg

From: David Miller <davem@davemloft.net>
Date: Sun, 07 Oct 2007 00:51:56 -0700 (PDT)

> From: David Brownell <david-b@pacbell.net>
> Date: Sun, 07 Oct 2007 00:31:41 -0700
> 
> > Are the other ports still behaving?  Is EHCI maybe trying to switch
> > ownership of that port?  Is maybe the (newish) autosuspend stuff
> > kicking in?
> 
> I wouldn't know, the machine hangs and doesn't get any further.

To add some more information here, I think the EHCI idea might
hold some water.

What I have here are two NEC OHCI USB interfaces and one NEC EHCI
USB interface on PCI.  Aparently they all go through a shared
USB hub, mapped like this:

HUB Port 1: OHCI #1, EHCI
HUB Port 2: OHCI #2, EHCI
HUB Port 3: OHCI #1, EHCI
HUB Port 4: OHCI #2, EHCI
HUB Port 5: OHCI #1, EHCI

The OHCI ports go out to external USB connectors on the back panel of
the machine, whereas the EHCI is connected up to an internal USB
storage CDROM device and what appears to be another USB hub.

The problem seems to be very strongly tied to timing.  For example
simply adding "ignore_loglevel" to the kernel boot command line can
make the problem go away.

This got me thinking about your EHCI comment.

If these controllers are going through the same HUB, things might go
south if OHCI initialized first, then khubd et al. are asynchronously
accessing the segments behind OHCI at the same time that the EHCI
driver is initializing.  Perhaps, this is the kind of sequence of
events which makes one of the root ports reset in such a way that the
the reset bit never clears.

Given that this machine has 64 cpus, the likelyhood for such parallel
accesses is very likely :-)

Does this make any sense?

Regardless, here is a patch that hardens the OHCI reset handling
loops so that they break out instead of hanging the entire system
should this condition occur.  It's at least better than what the
code does to a user right now which is hang the box completely:

[USB] ohci: Do not hang the system if port reset does not complete.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..77ae5b4 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
 	u16	reset_done = now + PORT_RESET_MSEC;
+	int	limit_1;
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
 	 * this might need to be deadline-scheduled.
 	 */
-	do {
+	limit_1 = 100;
+	while (--limit_1 >= 0) {
+		int limit_2;
+
 		/* spin until any current reset finishes */
-		for (;;) {
+		limit_2 = PORT_RESET_MSEC * 2;
+		while (--limit_2 >= 0) {
 			temp = ohci_readl (ohci, portstat);
 			/* handle e.g. CardBus eject */
 			if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 				break;
 			udelay (500);
 		}
+		if (limit_2 < 0) {
+			ohci_warn(ohci, "Root port inner-loop reset timeout, "
+				  "portstat[%08x]\n", temp);
+		}
 
 		if (!(temp & RH_PS_CCS))
 			break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
 		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+		if (!tick_before(now, reset_done))
+			break;
+	}
+	if (limit_1 < 0) {
+		ohci_warn(ohci, "Root port outer-loop reset timeout, "
+			  "now[%04x] reset_done[%04x]\n",
+			  now, reset_done);
+	}
 	/* caller synchronizes using PRSC */
 
 	return 0;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-08 23:54     ` David Miller
@ 2007-10-09  3:10       ` Greg KH
  2007-10-09  3:16         ` David Miller
  2007-10-09  4:09       ` David Brownell
  2007-10-09  4:36       ` David Brownell
  2 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2007-10-09  3:10 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel

On Mon, Oct 08, 2007 at 04:54:20PM -0700, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 07 Oct 2007 00:51:56 -0700 (PDT)
> 
> > From: David Brownell <david-b@pacbell.net>
> > Date: Sun, 07 Oct 2007 00:31:41 -0700
> > 
> > > Are the other ports still behaving?  Is EHCI maybe trying to switch
> > > ownership of that port?  Is maybe the (newish) autosuspend stuff
> > > kicking in?
> > 
> > I wouldn't know, the machine hangs and doesn't get any further.
> 
> To add some more information here, I think the EHCI idea might
> hold some water.
> 
> What I have here are two NEC OHCI USB interfaces and one NEC EHCI
> USB interface on PCI.  Aparently they all go through a shared
> USB hub, mapped like this:
> 
> HUB Port 1: OHCI #1, EHCI
> HUB Port 2: OHCI #2, EHCI
> HUB Port 3: OHCI #1, EHCI
> HUB Port 4: OHCI #2, EHCI
> HUB Port 5: OHCI #1, EHCI
> 
> The OHCI ports go out to external USB connectors on the back panel of
> the machine, whereas the EHCI is connected up to an internal USB
> storage CDROM device and what appears to be another USB hub.
> 
> The problem seems to be very strongly tied to timing.  For example
> simply adding "ignore_loglevel" to the kernel boot command line can
> make the problem go away.
> 
> This got me thinking about your EHCI comment.
> 
> If these controllers are going through the same HUB, things might go
> south if OHCI initialized first, then khubd et al. are asynchronously
> accessing the segments behind OHCI at the same time that the EHCI
> driver is initializing.  Perhaps, this is the kind of sequence of
> events which makes one of the root ports reset in such a way that the
> the reset bit never clears.
> 
> Given that this machine has 64 cpus, the likelyhood for such parallel
> accesses is very likely :-)
> 
> Does this make any sense?

Yes it does, I'm seeing reports from some hardware companies of the very
same thing.  If you serialize and load the ehci driver first, and then
the ohci driver, that should fix the problem.

Does that also work for you?  Or are these drivers built into the
kernel?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  3:10       ` Greg KH
@ 2007-10-09  3:16         ` David Miller
  2007-10-09  3:34           ` David Brownell
  2007-10-09 16:01           ` [Linux-usb-users] " Alan Stern
  0 siblings, 2 replies; 49+ messages in thread
From: David Miller @ 2007-10-09  3:16 UTC (permalink / raw)
  To: greg; +Cc: david-b, linux-usb-users, linux-kernel

From: Greg KH <greg@kroah.com>
Date: Mon, 8 Oct 2007 20:10:49 -0700

> Yes it does, I'm seeing reports from some hardware companies of the very
> same thing.  If you serialize and load the ehci driver first, and then
> the ohci driver, that should fix the problem.
> 
> Does that also work for you?  Or are these drivers built into the
> kernel?

As coicidence would have it I finally found a recipe for triggering
the issue, and it ties into what you're talking about here.

It happens only if I make sure OHCI gets loaded first and then EHCI
right afterwards.

It seems that indeed it is important for EHCI to get loaded first,
and in-kernel this is ensured by the link ordering.

However, when both OHCI and EHCI are built as modules (or, similarly
I guess, OHCI is built-in and EHCI is modular) there appears to be
nothing in userspace which makes sure EHCI gets loaded first.

When this triggers, in OHCI's root_port_reset(), the port status
register reads 0x111 in that inner-loop and the value never changes.
It stays like this forever.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  3:16         ` David Miller
@ 2007-10-09  3:34           ` David Brownell
  2007-10-09  3:42             ` David Miller
  2007-10-09 16:01           ` [Linux-usb-users] " Alan Stern
  1 sibling, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09  3:34 UTC (permalink / raw)
  To: greg, davem; +Cc: linux-usb-users, linux-kernel

> However, when both OHCI and EHCI are built as modules (or, similarly
> I guess, OHCI is built-in and EHCI is modular) there appears to be
> nothing in userspace which makes sure EHCI gets loaded first.

The old /etc/hotplug/usb.rc script made sure to load those modules
in the correct order:  EHCI first.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  3:34           ` David Brownell
@ 2007-10-09  3:42             ` David Miller
  2007-10-09  4:39               ` Greg KH
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-09  3:42 UTC (permalink / raw)
  To: david-b; +Cc: greg, linux-usb-users, linux-kernel

From: David Brownell <david-b@pacbell.net>
Date: Mon, 08 Oct 2007 20:34:12 -0700

> > However, when both OHCI and EHCI are built as modules (or, similarly
> > I guess, OHCI is built-in and EHCI is modular) there appears to be
> > nothing in userspace which makes sure EHCI gets loaded first.
> 
> The old /etc/hotplug/usb.rc script made sure to load those modules
> in the correct order:  EHCI first.

I expected to find something cute attempting to handle this under
/etc/udev, I have failed so far :-)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-08 23:54     ` David Miller
  2007-10-09  3:10       ` Greg KH
@ 2007-10-09  4:09       ` David Brownell
  2007-10-09  5:13         ` Benjamin Herrenschmidt
  2007-10-09  4:36       ` David Brownell
  2 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09  4:09 UTC (permalink / raw)
  To: davem; +Cc: linux-usb-users, linux-kernel, greg

> To add some more information here, I think the EHCI idea might
> hold some water.
>
> What I have here are two NEC OHCI USB interfaces and one NEC EHCI
> USB interface on PCI.  Aparently they all go through a shared
> USB hub, mapped like this:
>
> HUB Port 1: OHCI #1, EHCI
> HUB Port 2: OHCI #2, EHCI
> HUB Port 3: OHCI #1, EHCI
> HUB Port 4: OHCI #2, EHCI
> HUB Port 5: OHCI #1, EHCI
>
> The OHCI ports go out to external USB connectors on the back panel of
> the machine, whereas the EHCI is connected up to an internal USB
> storage CDROM device and what appears to be another USB hub.

There's actually no such thing as an "EHCI port" or an "OHCI port".
Instead, there's a set of ports, each of which can be switched so
the USB differential data signals go up to either controller.

When EHCI starts, that switch points to EHCI so that devices can try
enumerating with high speed signaling.  When a device doesn't respond
to that "chirp", the EHCI root hub driver switches the port to the
companion controller.  (Which is OHCI here, UHCI on some PCs, etc.)


> The problem seems to be very strongly tied to timing.  For example
> simply adding "ignore_loglevel" to the kernel boot command line can
> make the problem go away.
>
> This got me thinking about your EHCI comment.
>
> If these controllers are going through the same HUB, things might go
> south if OHCI initialized first, then khubd et al. are asynchronously
> accessing the segments behind OHCI at the same time that the EHCI
> driver is initializing.  Perhaps, this is the kind of sequence of
> events which makes one of the root ports reset in such a way that the
> the reset bit never clears.
>
> Given that this machine has 64 cpus, the likelyhood for such parallel
> accesses is very likely :-)
>
> Does this make any sense?

Yes, that's why I asked about EHCI.  My speculation would be that
OHCI starts the reset, and EHCI claims the port before it completes;
or contrariwise OHCI starts the reset right after EHCI claims it.

And there's some point in that process where a hardware race makes
the trouble you've observed.  I believe there are plenty of other
places where it's perfectly fine if EHCI grabs the port, or this
little race would have shown up many times before.

- Dave



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-08 23:54     ` David Miller
  2007-10-09  3:10       ` Greg KH
  2007-10-09  4:09       ` David Brownell
@ 2007-10-09  4:36       ` David Brownell
  2007-10-09  4:44         ` David Miller
  2 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09  4:36 UTC (permalink / raw)
  To: davem; +Cc: linux-usb-users, linux-kernel, greg

> Regardless, here is a patch that hardens the OHCI reset handling
> loops so that they break out instead of hanging the entire system
> should this condition occur.  It's at least better than what the
> code does to a user right now which is hang the box completely:
>
> [USB] ohci: Do not hang the system if port reset does not complete.
>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
> index bb9cc59..77ae5b4 100644
> --- a/drivers/usb/host/ohci-hub.c
> +++ b/drivers/usb/host/ohci-hub.c
> @@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  	u32	temp;
>  	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
>  	u16	reset_done = now + PORT_RESET_MSEC;
> +	int	limit_1;
>  
>  	/* build a "continuous enough" reset signal, with up to
>  	 * 3msec gap between pulses.  scheduler HZ==100 must work;
>  	 * this might need to be deadline-scheduled.
>  	 */
> -	do {
> +	limit_1 = 100;
> +	while (--limit_1 >= 0) {

Don't need this "limit_1" timeout; "reset_done" handles all
the timeout needed there.  The regs->fmnumber is essentially
a millisecond counter.


> +		int limit_2;
> +
>  		/* spin until any current reset finishes */
> -		for (;;) {
> +		limit_2 = PORT_RESET_MSEC * 2;

This is the loop that didn't terminate for you, right?
PORT_RESET_HW_MSEC is the ceiling you should use here,
not PORT_RESET_MSEC.


> +		while (--limit_2 >= 0) {
>  			temp = ohci_readl (ohci, portstat);
>  			/* handle e.g. CardBus eject */
>  			if (temp == ~(u32)0)
> @@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  				break;
>  			udelay (500);
>  		}
> +		if (limit_2 < 0) {
> +			ohci_warn(ohci, "Root port inner-loop reset timeout, "
> +				  "portstat[%08x]\n", temp);
> +		}

What values do you see for "portstat"?

I suspect there will be some flag set which would allow a more
immediate exit from that loop.  RH_PS_CCS might clear, for example.

And in any case, if that fails I don't see any reason not to just
break, and return immediately.

>  
>  		if (!(temp & RH_PS_CCS))
>  			break;
> @@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  		ohci_writel (ohci, RH_PS_PRS, portstat);
>  		msleep(PORT_RESET_HW_MSEC);
>  		now = ohci_readl(ohci, &ohci->regs->fmnumber);
> -	} while (tick_before(now, reset_done));
> +		if (!tick_before(now, reset_done))
> +			break;
> +	}
> +	if (limit_1 < 0) {
> +		ohci_warn(ohci, "Root port outer-loop reset timeout, "
> +			  "now[%04x] reset_done[%04x]\n",
> +			  now, reset_done);
> +	}
>  	/* caller synchronizes using PRSC */
>  
>  	return 0;
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  3:42             ` David Miller
@ 2007-10-09  4:39               ` Greg KH
  2007-10-09  4:47                 ` David Miller
  2007-10-09  5:00                 ` David Brownell
  0 siblings, 2 replies; 49+ messages in thread
From: Greg KH @ 2007-10-09  4:39 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel

On Mon, Oct 08, 2007 at 08:42:36PM -0700, David Miller wrote:
> From: David Brownell <david-b@pacbell.net>
> Date: Mon, 08 Oct 2007 20:34:12 -0700
> 
> > > However, when both OHCI and EHCI are built as modules (or, similarly
> > > I guess, OHCI is built-in and EHCI is modular) there appears to be
> > > nothing in userspace which makes sure EHCI gets loaded first.
> > 
> > The old /etc/hotplug/usb.rc script made sure to load those modules
> > in the correct order:  EHCI first.
> 
> I expected to find something cute attempting to handle this under
> /etc/udev, I have failed so far :-)

No, nothing cute in udev itself, but it seems that all distros that I
know of have a "load these modules now" type setting in their init
scripts that can be used here.

I can't think of a way to enforce this load order on the modules
themselves due to the fact that OHCI might not even be needed for EHCI
devices on UHCI (Intel) based chipsets :(

Can anyone else?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:36       ` David Brownell
@ 2007-10-09  4:44         ` David Miller
  2007-10-09 16:38           ` David Brownell
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-09  4:44 UTC (permalink / raw)
  To: david-b; +Cc: linux-usb-users, linux-kernel, greg

From: David Brownell <david-b@pacbell.net>
Date: Mon, 08 Oct 2007 21:36:43 -0700

> Don't need this "limit_1" timeout; "reset_done" handles all
> the timeout needed there.  The regs->fmnumber is essentially
> a millisecond counter.

If the hardware hangs and the register stops incrementing,
the entire kernel will hang.  That is unacceptable.

We do need it.

> 
> > +		int limit_2;
> > +
> >  		/* spin until any current reset finishes */
> > -		for (;;) {
> > +		limit_2 = PORT_RESET_MSEC * 2;
> 
> This is the loop that didn't terminate for you, right?
> PORT_RESET_HW_MSEC is the ceiling you should use here,
> not PORT_RESET_MSEC.

Ok, fixed.

> What values do you see for "portstat"?

0x111

> I suspect there will be some flag set which would allow a more
> immediate exit from that loop.  RH_PS_CCS might clear, for example.

Absolutely nothing clears in the register from it's initial value.

Here is the patch with the limit_2 initial value fixed.

I kept loop_1 in there, it is necessary.  No kernel code should
hang in an endless loop because of malfunctioning hardware.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
 	u16	reset_done = now + PORT_RESET_MSEC;
+	int	limit_1;
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
 	 * this might need to be deadline-scheduled.
 	 */
-	do {
+	limit_1 = 100;
+	while (--limit_1 >= 0) {
+		int limit_2;
+
 		/* spin until any current reset finishes */
-		for (;;) {
+		limit_2 = PORT_RESET_HW_MSEC * 2;
+		while (--limit_2 >= 0) {
 			temp = ohci_readl (ohci, portstat);
 			/* handle e.g. CardBus eject */
 			if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 				break;
 			udelay (500);
 		}
+		if (limit_2 < 0) {
+			ohci_warn(ohci, "Root port inner-loop reset timeout, "
+				  "portstat[%08x]\n", temp);
+		}
 
 		if (!(temp & RH_PS_CCS))
 			break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
 		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+		if (!tick_before(now, reset_done))
+			break;
+	}
+	if (limit_1 < 0) {
+		ohci_warn(ohci, "Root port outer-loop reset timeout, "
+			  "now[%04x] reset_done[%04x]\n",
+			  now, reset_done);
+	}
 	/* caller synchronizes using PRSC */
 
 	return 0;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:39               ` Greg KH
@ 2007-10-09  4:47                 ` David Miller
  2007-10-09  5:11                   ` Benjamin Herrenschmidt
  2007-10-09  6:06                   ` Greg KH
  2007-10-09  5:00                 ` David Brownell
  1 sibling, 2 replies; 49+ messages in thread
From: David Miller @ 2007-10-09  4:47 UTC (permalink / raw)
  To: greg; +Cc: david-b, linux-usb-users, linux-kernel

From: Greg KH <greg@kroah.com>
Date: Mon, 8 Oct 2007 21:39:09 -0700

> No, nothing cute in udev itself, but it seems that all distros that I
> know of have a "load these modules now" type setting in their init
> scripts that can be used here.
> 
> I can't think of a way to enforce this load order on the modules
> themselves due to the fact that OHCI might not even be needed for EHCI
> devices on UHCI (Intel) based chipsets :(
> 
> Can anyone else?

The three modules perhaps should be a bundle of whatever ones you have
enabled, and internally we can dispatch the initialization to occur in
the correct order from a top-level module_init().

If the devices need to be initialized in a certain order in a
situation like this, it really seems like it is the kernel's job to
enforce it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:39               ` Greg KH
  2007-10-09  4:47                 ` David Miller
@ 2007-10-09  5:00                 ` David Brownell
  2007-10-09  5:23                   ` David Miller
  2007-10-09  6:43                   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 49+ messages in thread
From: David Brownell @ 2007-10-09  5:00 UTC (permalink / raw)
  To: greg, davem; +Cc: linux-usb-users, linux-kernel

> > > The old /etc/hotplug/usb.rc script made sure to load those modules
> > > in the correct order:  EHCI first.
> > 
> > I expected to find something cute attempting to handle this under
> > /etc/udev, I have failed so far :-)
>
> No, nothing cute in udev itself, but it seems that all distros that I
> know of have a "load these modules now" type setting in their init
> scripts that can be used here.
>
> I can't think of a way to enforce this load order on the modules
> themselves due to the fact that OHCI might not even be needed for EHCI
> devices on UHCI (Intel) based chipsets :(

Assuming PCI is present, /sys/bus/pci/devices/*/class can tell
if EHCI is present (0x0c0320) ... if so, load that driver.
Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:47                 ` David Miller
@ 2007-10-09  5:11                   ` Benjamin Herrenschmidt
  2007-10-09  6:06                   ` Greg KH
  1 sibling, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-09  5:11 UTC (permalink / raw)
  To: David Miller; +Cc: greg, david-b, linux-usb-users, linux-kernel


On Mon, 2007-10-08 at 21:47 -0700, David Miller wrote:
> From: Greg KH <greg@kroah.com>
> Date: Mon, 8 Oct 2007 21:39:09 -0700
> 
> > No, nothing cute in udev itself, but it seems that all distros that I
> > know of have a "load these modules now" type setting in their init
> > scripts that can be used here.
> > 
> > I can't think of a way to enforce this load order on the modules
> > themselves due to the fact that OHCI might not even be needed for EHCI
> > devices on UHCI (Intel) based chipsets :(
> > 
> > Can anyone else?
> 
> The three modules perhaps should be a bundle of whatever ones you have
> enabled, and internally we can dispatch the initialization to occur in
> the correct order from a top-level module_init().
> 
> If the devices need to be initialized in a certain order in a
> situation like this, it really seems like it is the kernel's job to
> enforce it.

Is the problem strictly an ordering problem or just a race ? In the
later case, maybe some better arbitration by the USB core to make
sure things are quiescent or in a known state when letting a new HCD
register might help ?

Ben.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:09       ` David Brownell
@ 2007-10-09  5:13         ` Benjamin Herrenschmidt
  2007-10-09  5:26           ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-09  5:13 UTC (permalink / raw)
  To: David Brownell; +Cc: davem, linux-usb-users, linux-kernel, greg

> Yes, that's why I asked about EHCI.  My speculation would be that
> OHCI starts the reset, and EHCI claims the port before it completes;
> or contrariwise OHCI starts the reset right after EHCI claims it.
> 
> And there's some point in that process where a hardware race makes
> the trouble you've observed.  I believe there are plenty of other
> places where it's perfectly fine if EHCI grabs the port, or this
> little race would have shown up many times before.

Since we can't know which O/UHCI is paired with which EHCI, we can't
really have generic code to deal with that race, but maybe we can be
smart and basically mutex khubd activity such as port reset vs.
registration of any new HCD ?

I'm not even sure module load order is 100% fault proof here since khubd
spawns as a thread...
 
Ben.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  5:00                 ` David Brownell
@ 2007-10-09  5:23                   ` David Miller
  2007-10-09  6:43                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 49+ messages in thread
From: David Miller @ 2007-10-09  5:23 UTC (permalink / raw)
  To: david-b; +Cc: greg, linux-usb-users, linux-kernel

From: David Brownell <david-b@pacbell.net>
Date: Mon, 08 Oct 2007 22:00:19 -0700

> Assuming PCI is present, /sys/bus/pci/devices/*/class can tell
> if EHCI is present (0x0c0320) ... if so, load that driver.
> Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300).

These are facts all of us know very well, but implementing this in
userspace in a failsafe manner isn't practical.  That's what we're
discussing.

There are things that autoload USB drivers way before udev or similar
even get started.

For example, the first thing some distributions do is try to load the
correct keyboard maps.  Guess what that can do?  It triggers a load of
all of the modular USB host controller drivers in case we have a USB
keyboard.

The only real solution is in the kernel, because it is the only
clean place to trap all of the potential module load events.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  5:13         ` Benjamin Herrenschmidt
@ 2007-10-09  5:26           ` David Miller
  2007-10-09  6:37             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-09  5:26 UTC (permalink / raw)
  To: benh; +Cc: david-b, linux-usb-users, linux-kernel, greg

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 09 Oct 2007 15:13:36 +1000

> I'm not even sure module load order is 100% fault proof here since
> khubd spawns as a thread...

I'm concerned about that as well, thanks for bringing it up.

My understanding, however, is that the critical thing is that the EHCI
device reset being done by the EHCI driver probe occurs and completes
first.  If that is true, then just making sure EHCI loads initially is
a sufficient constraint to fix this problem.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:47                 ` David Miller
  2007-10-09  5:11                   ` Benjamin Herrenschmidt
@ 2007-10-09  6:06                   ` Greg KH
  2007-10-09 19:22                     ` [linux-usb-devel] " David Brownell
  1 sibling, 1 reply; 49+ messages in thread
From: Greg KH @ 2007-10-09  6:06 UTC (permalink / raw)
  To: David Miller, linux-usb-devel, Alan Stern
  Cc: david-b, linux-usb-users, linux-kernel

On Mon, Oct 08, 2007 at 09:47:27PM -0700, David Miller wrote:
> From: Greg KH <greg@kroah.com>
> Date: Mon, 8 Oct 2007 21:39:09 -0700
> 
> > No, nothing cute in udev itself, but it seems that all distros that I
> > know of have a "load these modules now" type setting in their init
> > scripts that can be used here.
> > 
> > I can't think of a way to enforce this load order on the modules
> > themselves due to the fact that OHCI might not even be needed for EHCI
> > devices on UHCI (Intel) based chipsets :(
> > 
> > Can anyone else?
> 
> The three modules perhaps should be a bundle of whatever ones you have
> enabled, and internally we can dispatch the initialization to occur in
> the correct order from a top-level module_init().
> 
> If the devices need to be initialized in a certain order in a
> situation like this, it really seems like it is the kernel's job to
> enforce it.

I agree.

Here's some information from Intel about where they have seen this
happen for UHCI controllers, so it's not just an OHCI issue :(

thanks,

greg k-h

----------------


We had a logic analyzer attached to the bus going to the ESB (ICH) which
has the USB controller in it. In the passing case we would see no
accesses to UHCI IO registers while EHCI initialized and sets its config
flag. The EHCI Port Status & Control registers were then read and then
we see a write to the EHCI Port Status & Control registers port owner
bit for the low speed devices (keyboard & mouse). This turns control
back over to the companion UHCI controller.=20

In our most prevalent failing case (#1 below) we never saw the write to
port owner bit on the ports with the low speed devices. In the passing
case we see the write to the port owner bit.

I do not see how this would have anything to do with flakey hardware
especially since we can reproduce this on all of our systems and the
same device (USB controller) is used on multiple products.=20

I really believe that this has to do with the UHCI and EHCI drivers
running on top of each other. This seems to be happening fairly often on
our systems. If the EHCI driver runs first then we do not see the
failure. If they are running at the same time then we see different
failure symptoms.=20

1) We see that the ports with low speed devices are still in EHCI mode
(port owner bit not written to in EHCI driver). In our analyzer captures
we see the reads from the Port Status & Control register and it is
indicating that there are low speed devices on the ports. Can you tell
us why the driver would not be doing the write to the port owner bit
when it sees that low speed devices are attached to that port? Is there
something specific that it looks for and decides not to do the write?

2) In other cases we see that the ports with the low speed devices are
back in UHCI mode but the ports are disabled. In this case we see from
the analyzer traces that the UHCI driver has completed setting up the
port. It has actually enabled that port in UHCI mode. We then see the
EHCI driver comes in and it resets everything. The driver then gives
control back to the UCHI controller (by setting the port owner bit)
but...since the UHCI driver has already setup this port once it seems
that it does not go back and set it up again. In this case we do not
think that the UHCI driver has completed running when the EHCI driver
comes in and does the reset. Can you tell us if the UHCI driver was
interrupted in the middle but after the ports with the low speed devices
had been enabled would the UHCI driver ever go back and reinitialize the
ports with the low speed devices?

3) In some cases we see errors in the DMESG log but it seems to recover.

So we really do believe that it has to do with the EHCI driver running
in the middle of the UHCI driver running. And then dependent upon when
the EHCI driver comes in, while the UHCI driver is running, we see the
different failures. And since by default these drivers are not forced to
run sequentially we are susceptible to the failure.=20


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  5:26           ` David Miller
@ 2007-10-09  6:37             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-09  6:37 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel, greg


On Mon, 2007-10-08 at 22:26 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 09 Oct 2007 15:13:36 +1000
> 
> > I'm not even sure module load order is 100% fault proof here since
> > khubd spawns as a thread...
> 
> I'm concerned about that as well, thanks for bringing it up.
> 
> My understanding, however, is that the critical thing is that the EHCI
> device reset being done by the EHCI driver probe occurs and completes
> first.  If that is true, then just making sure EHCI loads initially is
> a sufficient constraint to fix this problem.

Yup, that would be, though I hate that sort of load order
dependencies...

Ben.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  5:00                 ` David Brownell
  2007-10-09  5:23                   ` David Miller
@ 2007-10-09  6:43                   ` Benjamin Herrenschmidt
  2007-10-09 18:48                     ` David Brownell
  1 sibling, 1 reply; 49+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-09  6:43 UTC (permalink / raw)
  To: David Brownell; +Cc: greg, davem, linux-usb-users, linux-kernel


On Mon, 2007-10-08 at 22:00 -0700, David Brownell wrote:
> > > > The old /etc/hotplug/usb.rc script made sure to load those modules
> > > > in the correct order:  EHCI first.
> > > 
> > > I expected to find something cute attempting to handle this under
> > > /etc/udev, I have failed so far :-)
> >
> > No, nothing cute in udev itself, but it seems that all distros that I
> > know of have a "load these modules now" type setting in their init
> > scripts that can be used here.
> >
> > I can't think of a way to enforce this load order on the modules
> > themselves due to the fact that OHCI might not even be needed for EHCI
> > devices on UHCI (Intel) based chipsets :(
> 
> Assuming PCI is present, /sys/bus/pci/devices/*/class can tell
> if EHCI is present (0x0c0320) ... if so, load that driver.
> Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300).

That will not work for all of the non-PCI implementations though.

Ben.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09  3:16         ` David Miller
  2007-10-09  3:34           ` David Brownell
@ 2007-10-09 16:01           ` Alan Stern
  2007-10-09 17:39             ` Greg KH
  1 sibling, 1 reply; 49+ messages in thread
From: Alan Stern @ 2007-10-09 16:01 UTC (permalink / raw)
  To: David Miller
  Cc: Greg KH, David Brownell, Kernel development list, USB users list

On Mon, 8 Oct 2007, David Miller wrote:

> As coicidence would have it I finally found a recipe for triggering
> the issue, and it ties into what you're talking about here.
> 
> It happens only if I make sure OHCI gets loaded first and then EHCI
> right afterwards.
> 
> It seems that indeed it is important for EHCI to get loaded first,
> and in-kernel this is ensured by the link ordering.
> 
> However, when both OHCI and EHCI are built as modules (or, similarly
> I guess, OHCI is built-in and EHCI is modular) there appears to be
> nothing in userspace which makes sure EHCI gets loaded first.
> 
> When this triggers, in OHCI's root_port_reset(), the port status
> register reads 0x111 in that inner-loop and the value never changes.
> It stays like this forever.

I agree with Ben; the best way to solve this is in the kernel.  Even if 
you did manage to make sure that modules were loaded in the right order 
initially, that wouldn't help if somebody starts unloading and loading 
modules by hand.

The underlying cause of the problem appears to be related to the fact
that it is impossible in principle to detect a disconnect while a
full/low-speed reset is in progress.  However the hardware shouldn't
rely on this, and it should fail gracefully when an unexpected
disconnect does occur.

For example, what would the controller do if the user unplugged the USB 
cable right in the middle of a reset?

Anyway, it certainly would be easy enough to make port resets mutually
exclusive with EHCI initialization.  That would work on any platform,
PCI or not.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  4:44         ` David Miller
@ 2007-10-09 16:38           ` David Brownell
  2007-10-09 20:41             ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09 16:38 UTC (permalink / raw)
  To: davem; +Cc: linux-usb-users, linux-kernel, greg

> > What values do you see for "portstat"?
>
> 0x111

That's unfortunately useless ... PPS, PRS, CCS (meaning powered,
resetting, connected).  In short, there is *no* indication from
the OHCI hardware that a disconnect happened.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 16:01           ` [Linux-usb-users] " Alan Stern
@ 2007-10-09 17:39             ` Greg KH
  2007-10-09 18:42               ` Alan Stern
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2007-10-09 17:39 UTC (permalink / raw)
  To: Alan Stern
  Cc: David Miller, David Brownell, Kernel development list, USB users list

On Tue, Oct 09, 2007 at 12:01:54PM -0400, Alan Stern wrote:
> On Mon, 8 Oct 2007, David Miller wrote:
> 
> > As coicidence would have it I finally found a recipe for triggering
> > the issue, and it ties into what you're talking about here.
> > 
> > It happens only if I make sure OHCI gets loaded first and then EHCI
> > right afterwards.
> > 
> > It seems that indeed it is important for EHCI to get loaded first,
> > and in-kernel this is ensured by the link ordering.
> > 
> > However, when both OHCI and EHCI are built as modules (or, similarly
> > I guess, OHCI is built-in and EHCI is modular) there appears to be
> > nothing in userspace which makes sure EHCI gets loaded first.
> > 
> > When this triggers, in OHCI's root_port_reset(), the port status
> > register reads 0x111 in that inner-loop and the value never changes.
> > It stays like this forever.
> 
> I agree with Ben; the best way to solve this is in the kernel.  Even if 
> you did manage to make sure that modules were loaded in the right order 
> initially, that wouldn't help if somebody starts unloading and loading 
> modules by hand.
> 
> The underlying cause of the problem appears to be related to the fact
> that it is impossible in principle to detect a disconnect while a
> full/low-speed reset is in progress.  However the hardware shouldn't
> rely on this, and it should fail gracefully when an unexpected
> disconnect does occur.
> 
> For example, what would the controller do if the user unplugged the USB 
> cable right in the middle of a reset?
> 
> Anyway, it certainly would be easy enough to make port resets mutually
> exclusive with EHCI initialization.  That would work on any platform,
> PCI or not.

That's a good idea, any thoughts as to how to do this?  I think I can
find some Intel and Dell people who would be glad to test this out if
you have a proposed patch.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 17:39             ` Greg KH
@ 2007-10-09 18:42               ` Alan Stern
  2007-10-09 18:59                 ` David Brownell
  2007-10-09 21:27                 ` David Miller
  0 siblings, 2 replies; 49+ messages in thread
From: Alan Stern @ 2007-10-09 18:42 UTC (permalink / raw)
  To: Greg KH
  Cc: David Miller, David Brownell, Kernel development list, USB users list

On Tue, 9 Oct 2007, Greg KH wrote:

> > Anyway, it certainly would be easy enough to make port resets mutually
> > exclusive with EHCI initialization.  That would work on any platform,
> > PCI or not.
> 
> That's a good idea, any thoughts as to how to do this?  I think I can
> find some Intel and Dell people who would be glad to test this out if
> you have a proposed patch.

Here is a proposed patch.  I haven't tried running it, but it compiles 
okay.

Alan Stern



Index: usb-2.6/drivers/usb/core/hcd.h
===================================================================
--- usb-2.6.orig/drivers/usb/core/hcd.h
+++ usb-2.6/drivers/usb/core/hcd.h
@@ -19,6 +19,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/rwsem.h>
+
 /* This file contains declarations of usbcore internals that are mostly
  * used or exposed by Host Controller Drivers.
  */
@@ -470,5 +472,9 @@ static inline void usbmon_urb_complete(s
 		: (in_interrupt () ? "in_interrupt" : "can sleep"))
 
 
-#endif /* __KERNEL__ */
+/* Mutual exclusion for EHCI CF initialization.  This interferes with
+ * port reset on some companion controllers.
+ */
+extern struct rw_semaphore ehci_cf_port_reset_rwsem;
 
+#endif /* __KERNEL__ */
Index: usb-2.6/drivers/usb/core/hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/core/hub.c
+++ usb-2.6/drivers/usb/core/hub.c
@@ -125,6 +125,12 @@ MODULE_PARM_DESC(use_both_schemes,
 		"try the other device initialization scheme if the "
 		"first one fails");
 
+/* Mutual exclusion for EHCI CF initialization.  This interferes with
+ * port reset on some companion controllers.
+ */
+DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
+EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
+
 
 static inline char *portspeed(int portstatus)
 {
@@ -1579,6 +1585,8 @@ static int hub_port_reset(struct usb_hub
 {
 	int i, status;
 
+	down_read(&ehci_cf_port_reset_rwsem);
+
 	/* Reset the port */
 	for (i = 0; i < PORT_RESET_TRIES; i++) {
 		status = set_port_feature(hub->hdev,
@@ -1610,7 +1618,7 @@ static int hub_port_reset(struct usb_hub
 			usb_set_device_state(udev, status
 					? USB_STATE_NOTATTACHED
 					: USB_STATE_DEFAULT);
-			return status;
+			goto done;
 		}
 
 		dev_dbg (hub->intfdev,
@@ -1623,6 +1631,8 @@ static int hub_port_reset(struct usb_hub
 		"Cannot enable port %i.  Maybe the USB cable is bad?\n",
 		port1);
 
+ done:
+	up_read(&ehci_cf_port_reset_rwsem);
 	return status;
 }
 
Index: usb-2.6/drivers/usb/host/ehci-hcd.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ehci-hcd.c
+++ usb-2.6/drivers/usb/host/ehci-hcd.c
@@ -646,9 +646,11 @@ static int ehci_run (struct usb_hcd *hcd
 	 * involved with the root hub.  (Except where one is integrated,
 	 * and there's no companion controller unless maybe for USB OTG.)
 	 */
+	down_write(&ehci_cf_port_reset_rwsem);
 	hcd->state = HC_STATE_RUNNING;
 	ehci_writel(ehci, FLAG_CF, &ehci->regs->configured_flag);
 	ehci_readl(ehci, &ehci->regs->command);	/* unblock posted writes */
+	up_write(&ehci_cf_port_reset_rwsem);
 
 	temp = HC_VERSION(ehci_readl(ehci, &ehci->caps->hc_capbase));
 	ehci_info (ehci,


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09  6:43                   ` Benjamin Herrenschmidt
@ 2007-10-09 18:48                     ` David Brownell
  0 siblings, 0 replies; 49+ messages in thread
From: David Brownell @ 2007-10-09 18:48 UTC (permalink / raw)
  To: benh; +Cc: linux-usb-users, linux-kernel, greg, davem

> > Assuming PCI is present, /sys/bus/pci/devices/*/class can tell
> > if EHCI is present (0x0c0320) ... if so, load that driver.
> > Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300).
>
> That will not work for all of the non-PCI implementations though.

Oddly enough, that's why I said "assuming PCI is present".  ;)

Anything using the ARC/TDI RTL doesn't have this issue ... it
includes full/low speed support directly, without a companion
controller.  That covers the Freescale silicon and some others.
For a long time I think that was the only generally available
non-PCI implementation.

But you're right that there are (now) a few other cases.
In-tree right now are also the Au1200, ps3, and 440epx.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 18:42               ` Alan Stern
@ 2007-10-09 18:59                 ` David Brownell
  2007-10-09 21:27                 ` David Miller
  1 sibling, 0 replies; 49+ messages in thread
From: David Brownell @ 2007-10-09 18:59 UTC (permalink / raw)
  To: stern, greg; +Cc: linux-usb-users, linux-kernel, davem

> Here is a proposed patch.  I haven't tried running it, but it compiles 
> okay.

Looks about right, too; thanks.  Minor comments:

> --- usb-2.6.orig/drivers/usb/core/hcd.h
> +++ usb-2.6/drivers/usb/core/hcd.h
> @@ -470,5 +472,9 @@ static inline void usbmon_urb_complete(s
>  		: (in_interrupt () ? "in_interrupt" : "can sleep"))
>  
>  
> -#endif /* __KERNEL__ */
> +/* Mutual exclusion for EHCI CF initialization.  This interferes with
> + * port reset on some companion controllers.
> + */
> +extern struct rw_semaphore ehci_cf_port_reset_rwsem;

You have two copies of that comment; I'd drop this one, and add a
better one around CF initialization.

What I'd have here is a comment that this symbol is exported ONLY
for use by the EHCI driver, trespassers will be violated, etc.


> --- usb-2.6.orig/drivers/usb/host/ehci-hcd.c
> +++ usb-2.6/drivers/usb/host/ehci-hcd.c
> @@ -646,9 +646,11 @@ static int ehci_run (struct usb_hcd *hcd
>  	 * involved with the root hub.  (Except where one is integrated,
>  	 * and there's no companion controller unless maybe for USB OTG.)
>  	 */
> +	down_write(&ehci_cf_port_reset_rwsem);
>  	hcd->state = HC_STATE_RUNNING;
>  	ehci_writel(ehci, FLAG_CF, &ehci->regs->configured_flag);
>  	ehci_readl(ehci, &ehci->regs->command);	/* unblock posted writes */
> +	up_write(&ehci_cf_port_reset_rwsem);

That deserves a comment about how CF and companion port resets
don't mix well.

... assuming this does check out as the problem, which I expect
it will.

- Dave

>  
>  	temp = HC_VERSION(ehci_readl(ehci, &ehci->caps->hc_capbase));
>  	ehci_info (ehci,
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [linux-usb-devel] OHCI root_port_reset() deadly loop...
  2007-10-09  6:06                   ` Greg KH
@ 2007-10-09 19:22                     ` David Brownell
  2007-10-10 15:32                       ` Alan Stern
  0 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09 19:22 UTC (permalink / raw)
  To: stern, linux-usb-devel, greg, davem; +Cc: linux-usb-users, linux-kernel

> From: Greg KH <greg@kroah.com>
>
> Here's some information from Intel about where they have seen this
> happen for UHCI controllers, so it's not just an OHCI issue :(
>
>	...
>
> 1) We see that the ports with low speed devices are still in EHCI mode
> (port owner bit not written to in EHCI driver). In our analyzer captures
> we see the reads from the Port Status & Control register and it is
> indicating that there are low speed devices on the ports. Can you tell
> us why the driver would not be doing the write to the port owner bit
> when it sees that low speed devices are attached to that port? Is there
> something specific that it looks for and decides not to do the write?

It's hard to say without more info about what's going on.  It
might be on a non-enumeraton code path -- where nothing expects
to set the OWNER bit -- or the CONNECT bit might not be set.
See host/ehci-hub.c for details about exactly what the EHCI parts
are doing, and core/hub.c for how it's driven.

(It can be tricky to make sense of all that, easier if debugging
messages are turned on.   But that changes timings.  Better yet
would be being non-intrusive tracing of the actual code paths
that the CPU follows.)


> 2) In other cases we see that the ports with the low speed devices are
> back in UHCI mode but the ports are disabled. In this case we see from
> the analyzer traces that the UHCI driver has completed setting up the
> port. It has actually enabled that port in UHCI mode. We then see the
> EHCI driver comes in and it resets everything. The driver then gives
> control back to the UCHI controller (by setting the port owner bit)
> but...since the UHCI driver has already setup this port once it seems
> that it does not go back and set it up again. In this case we do not
> think that the UHCI driver has completed running when the EHCI driver
> comes in and does the reset. Can you tell us if the UHCI driver was
> interrupted in the middle but after the ports with the low speed devices
> had been enabled would the UHCI driver ever go back and reinitialize the
> ports with the low speed devices?

If the UHCI driver got disconnect and reconnect notifications,
I expect it would do so.  At least OHCI seems *not* to have gotten
such notifications.  I'm speculating that the "just a switch"
behavior for the CF (and OWNER) bits may not allow notifications
like that to happen when the companion controllers are resetting.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09 16:38           ` David Brownell
@ 2007-10-09 20:41             ` David Miller
  2007-10-09 20:46               ` Greg KH
  2007-10-09 21:09               ` David Brownell
  0 siblings, 2 replies; 49+ messages in thread
From: David Miller @ 2007-10-09 20:41 UTC (permalink / raw)
  To: david-b; +Cc: linux-usb-users, linux-kernel, greg

From: David Brownell <david-b@pacbell.net>
Date: Tue, 09 Oct 2007 09:38:27 -0700

> > > What values do you see for "portstat"?
> >
> > 0x111
> 
> That's unfortunately useless ... PPS, PRS, CCS (meaning powered,
> resetting, connected).  In short, there is *no* indication from
> the OHCI hardware that a disconnect happened.

But it's an excellent example of why my patch is mandatory.

When the device gets into a stuck state, this code does the wrong
thing.

Instead of trying to deal with bad situations, it hangs the system
instead.  That sucks for anyone trying to debug something like this.
This has caused weeks of debugging and grief for people, and it
could have all been eliminated if this looping code were more
robust to hardware failures.

If you can't see why this is bad programming practice in a hardware
driver, I will try to get my patch to someone who does.

Even though I'm severly overloaded, you asked me for a patch, I gave
you one.  And now you want to push it back at me because it isn't
clear to you yet that the code there right now is a huge problem.

And all of this is independant of making sure EHCI loads first,
we need to fix that too.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09 20:41             ` David Miller
@ 2007-10-09 20:46               ` Greg KH
  2007-10-09 21:05                 ` David Brownell
  2007-10-09 21:09               ` David Brownell
  1 sibling, 1 reply; 49+ messages in thread
From: Greg KH @ 2007-10-09 20:46 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel

On Tue, Oct 09, 2007 at 01:41:16PM -0700, David Miller wrote:
> From: David Brownell <david-b@pacbell.net>
> Date: Tue, 09 Oct 2007 09:38:27 -0700
> 
> > > > What values do you see for "portstat"?
> > >
> > > 0x111
> > 
> > That's unfortunately useless ... PPS, PRS, CCS (meaning powered,
> > resetting, connected).  In short, there is *no* indication from
> > the OHCI hardware that a disconnect happened.
> 
> But it's an excellent example of why my patch is mandatory.

I agree.  David Brownell, unless you strongly object, I'll take David
Miller's patch and add it to my tree.

And if you do object, can you respond to the objection with a version of
the patch that you would accept?  :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09 20:46               ` Greg KH
@ 2007-10-09 21:05                 ` David Brownell
  0 siblings, 0 replies; 49+ messages in thread
From: David Brownell @ 2007-10-09 21:05 UTC (permalink / raw)
  To: greg, davem; +Cc: linux-usb-users, linux-kernel

> > > > > What values do you see for "portstat"?
> > > >
> > > > 0x111
> > > 
> > > That's unfortunately useless ... PPS, PRS, CCS (meaning powered,
> > > resetting, connected).  In short, there is *no* indication from
> > > the OHCI hardware that a disconnect happened.
> > 
> > But it's an excellent example of why my patch is mandatory.

I never disagreed that some patch for that issue was needed.


> I agree.  David Brownell, unless you strongly object, I'll take David
> Miller's patch and add it to my tree.
>
> And if you do object, can you respond to the objection with a version of
> the patch that you would accept?  :)

I'll send some version with a signoff.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: OHCI root_port_reset() deadly loop...
  2007-10-09 20:41             ` David Miller
  2007-10-09 20:46               ` Greg KH
@ 2007-10-09 21:09               ` David Brownell
  1 sibling, 0 replies; 49+ messages in thread
From: David Brownell @ 2007-10-09 21:09 UTC (permalink / raw)
  To: davem; +Cc: linux-usb-users, linux-kernel, greg

> Even though I'm severly overloaded, you asked me for a patch, I gave
> you one.

And I got some questions as I reviewed it.  That's far
from un-heard-of.  One particular question related to
one potential simplification ... which unfortunately
couldn't check out.  Also not un-heard-of.


>	And now you want to push it back at me because it isn't
> clear to you yet that the code there right now is a huge problem.

You're reading things into what I wrote which were not there.
Calm down.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 18:42               ` Alan Stern
  2007-10-09 18:59                 ` David Brownell
@ 2007-10-09 21:27                 ` David Miller
  2007-10-09 21:43                   ` David Brownell
  1 sibling, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-09 21:27 UTC (permalink / raw)
  To: stern; +Cc: greg, david-b, linux-kernel, linux-usb-users

From: Alan Stern <stern@rowland.harvard.edu>
Date: Tue, 9 Oct 2007 14:42:34 -0400 (EDT)

> On Tue, 9 Oct 2007, Greg KH wrote:
> 
> > > Anyway, it certainly would be easy enough to make port resets mutually
> > > exclusive with EHCI initialization.  That would work on any platform,
> > > PCI or not.
> > 
> > That's a good idea, any thoughts as to how to do this?  I think I can
> > find some Intel and Dell people who would be glad to test this out if
> > you have a proposed patch.
> 
> Here is a proposed patch.  I haven't tried running it, but it compiles 
> okay.

Are we sure that this is the only event that causes the conflicts between
the EHCI and it's companion virtual OHCI/UHCI host(s)?

Is there a chip reset done to the EHCI controller early during device
probe that could cause problems too?

I know it's grotty, but I definitely feel more comfortable with the
fix involving making sure EHCI loads and fully probes first.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 21:27                 ` David Miller
@ 2007-10-09 21:43                   ` David Brownell
  2007-10-09 22:00                     ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-09 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: stern, greg, linux-kernel, linux-usb-users

On Tuesday 09 October 2007, David Miller wrote:
> Are we sure that this is the only event that causes the conflicts between
> the EHCI and it's companion virtual OHCI/UHCI host(s)?

It's the event which can cause trouble when EHCI starts up
after OHCI/UHCI (rather than before it).

The other event is setting the OWNER bit, which clearly
has not caused trouble.

Does it resolve the problem you observed?

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 21:43                   ` David Brownell
@ 2007-10-09 22:00                     ` David Miller
  2007-10-10  4:35                       ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-09 22:00 UTC (permalink / raw)
  To: david-b; +Cc: stern, greg, linux-kernel, linux-usb-users

From: David Brownell <david-b@pacbell.net>
Date: Tue, 9 Oct 2007 14:43:54 -0700

> Does it resolve the problem you observed?

I will definitely give the patch a try and report back.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-09 22:00                     ` David Miller
@ 2007-10-10  4:35                       ` David Miller
  2007-10-15 22:01                         ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-10  4:35 UTC (permalink / raw)
  To: david-b; +Cc: stern, greg, linux-kernel, linux-usb-users

From: David Miller <davem@davemloft.net>
Date: Tue, 09 Oct 2007 15:00:40 -0700 (PDT)

> From: David Brownell <david-b@pacbell.net>
> Date: Tue, 9 Oct 2007 14:43:54 -0700
> 
> > Does it resolve the problem you observed?
> 
> I will definitely give the patch a try and report back.

My tests look really good with Alan Stern's patch.

With a kernel is hacked to load OHCI before EHCI, and
that always hangs, the patch makes the hang go away
reliably.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [linux-usb-devel] OHCI root_port_reset() deadly loop...
  2007-10-09 19:22                     ` [linux-usb-devel] " David Brownell
@ 2007-10-10 15:32                       ` Alan Stern
  0 siblings, 0 replies; 49+ messages in thread
From: Alan Stern @ 2007-10-10 15:32 UTC (permalink / raw)
  To: David Brownell
  Cc: linux-usb-devel, greg, davem, linux-usb-users, linux-kernel

On Tue, 9 Oct 2007, David Brownell wrote:

> > From: Greg KH <greg@kroah.com>
> >
> > Here's some information from Intel about where they have seen this
> > happen for UHCI controllers, so it's not just an OHCI issue :(
> >
> >	...
> >

The Intel error reports here are kind of unclear.

> > 1) We see that the ports with low speed devices are still in EHCI mode
> > (port owner bit not written to in EHCI driver).

_When_ are they still in EHCI mode?

> > In our analyzer captures
> > we see the reads from the Port Status & Control register and it is
> > indicating that there are low speed devices on the ports. Can you tell
> > us why the driver would not be doing the write to the port owner bit
> > when it sees that low speed devices are attached to that port? Is there
> > something specific that it looks for and decides not to do the write?
> 
> It's hard to say without more info about what's going on.  It
> might be on a non-enumeraton code path -- where nothing expects
> to set the OWNER bit -- or the CONNECT bit might not be set.
> See host/ehci-hub.c for details about exactly what the EHCI parts
> are doing, and core/hub.c for how it's driven.

Right.  In general the driver _does_ write to the port owner bit when 
it sees a low-speed device is attached to the port.  If that didn't 
happen in this case, maybe it's because the driver didn't see it.  
There are only one or two places in the code where the driver checks.

> (It can be tricky to make sense of all that, easier if debugging
> messages are turned on.   But that changes timings.  Better yet
> would be being non-intrusive tracing of the actual code paths
> that the CPU follows.)

It certainly would help us if the tests were made on a kernel with
CONFIG_USB_DEBUG enabled.

> > 2) In other cases we see that the ports with the low speed devices are
> > back in UHCI mode but the ports are disabled. In this case we see from
> > the analyzer traces that the UHCI driver has completed setting up the
> > port. It has actually enabled that port in UHCI mode.

This implies that the reset sequence finished successfully.  Did that 
actually happen?

> > We then see the
> > EHCI driver comes in and it resets everything. The driver then gives
> > control back to the UCHI controller (by setting the port owner bit)
> > but...since the UHCI driver has already setup this port once it seems
> > that it does not go back and set it up again. In this case we do not
> > think that the UHCI driver has completed running when the EHCI driver
> > comes in and does the reset.

But above they said that it _had_ completed!  Did they mean that the 
reset was complete but the driver hadn't yet detected that it was 
complete?

> > Can you tell us if the UHCI driver was
> > interrupted in the middle but after the ports with the low speed devices
> > had been enabled would the UHCI driver ever go back and reinitialize the
> > ports with the low speed devices?
> 
> If the UHCI driver got disconnect and reconnect notifications,
> I expect it would do so.  At least OHCI seems *not* to have gotten
> such notifications.  I'm speculating that the "just a switch"
> behavior for the CF (and OWNER) bits may not allow notifications
> like that to happen when the companion controllers are resetting.

That seems likely.  There's no way (as far as I can tell) for a host to 
see a disconnect while a reset is in progress.  Of course, as soon as 
the reset is over then the host should see the disconnect, and it 
should set the corresponding Connect-Status-Changed bit.

Amusing scenario: Suppose that ehci-hcd initializes the EHCI controller 
(taking control of the port), sees that the device isn't high speed, 
and switches port ownership back to the companion controller, all 
during the span of the companion's reset.  The companion would never 
know anything had happened!  But then everything should just work -- 
the port would be enabled and communication should proceed normally.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-10  4:35                       ` David Miller
@ 2007-10-15 22:01                         ` David Miller
  2007-10-15 23:39                           ` David Brownell
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-15 22:01 UTC (permalink / raw)
  To: david-b; +Cc: stern, greg, linux-kernel, linux-usb-users

From: David Miller <davem@davemloft.net>
Date: Tue, 09 Oct 2007 21:35:07 -0700 (PDT)

> From: David Miller <davem@davemloft.net>
> Date: Tue, 09 Oct 2007 15:00:40 -0700 (PDT)
> 
> > From: David Brownell <david-b@pacbell.net>
> > Date: Tue, 9 Oct 2007 14:43:54 -0700
> > 
> > > Does it resolve the problem you observed?
> > 
> > I will definitely give the patch a try and report back.
> 
> My tests look really good with Alan Stern's patch.
> 
> With a kernel is hacked to load OHCI before EHCI, and
> that always hangs, the patch makes the hang go away
> reliably.

Bad news, even with the rwsem after a lot more testing I can still
trigger the hang in ohci_hub_control() :-(

I think we need to go back to considering the total serialization
approach to this problem.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-15 22:01                         ` David Miller
@ 2007-10-15 23:39                           ` David Brownell
  2007-10-15 23:58                             ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: David Brownell @ 2007-10-15 23:39 UTC (permalink / raw)
  To: davem; +Cc: stern, linux-usb-users, linux-kernel, greg

> Bad news, even with the rwsem after a lot more testing I can still
> trigger the hang in ohci_hub_control() :-(
>
> I think we need to go back to considering the total serialization
> approach to this problem.

We shouldn't need that.  What happens if you add an msleep(5)
before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?

The theory there being that the switch triggered by setting CF
doesn't take effect instantaneously, contrary to the effective
assumption of that code.  A delay of 5 msec seems like it should
be more than enough, but that's kind of a guess ... it's good to
keep that low, since unfortunately that's in the critical path
for OLPC "resume from idle".

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-15 23:39                           ` David Brownell
@ 2007-10-15 23:58                             ` David Miller
  2007-10-16 15:23                               ` Alan Stern
  2007-10-16 18:26                               ` David Brownell
  0 siblings, 2 replies; 49+ messages in thread
From: David Miller @ 2007-10-15 23:58 UTC (permalink / raw)
  To: david-b; +Cc: stern, linux-usb-users, linux-kernel, greg

From: David Brownell <david-b@pacbell.net>
Date: Mon, 15 Oct 2007 16:39:10 -0700

> > Bad news, even with the rwsem after a lot more testing I can still
> > trigger the hang in ohci_hub_control() :-(
> >
> > I think we need to go back to considering the total serialization
> > approach to this problem.
> 
> We shouldn't need that.  What happens if you add an msleep(5)
> before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?

What happens is the heisenbug will go away for another week.

> The theory there being that the switch triggered by setting CF
> doesn't take effect instantaneously, contrary to the effective
> assumption of that code.  A delay of 5 msec seems like it should
> be more than enough, but that's kind of a guess ... it's good to
> keep that low, since unfortunately that's in the critical path
> for OLPC "resume from idle".

I want to help with this, but if I even breath on the kernel the bug
goes away.  The race just gets harder to trigger, and if we just keep
adding things it'll make the problem go away but for the absolutely
wrong reasons.

The only way we will provably fix this is to make sure EHCI initialize
fully, first, regardless of kernel config or what userland does.

Also, David, you haven't done anything with the feedback I gave to the
most recent revision of the OHCI hub reset anti-wedge patch.  You
removed the debug logging when the outer-loop timeout expires, and I
asked that you put that back so that if it happens there is some
chance to know that this is what happened.  If it's not supposed to
happen, there is no harm in putting the debugging log message there
so that if the impossible does happen we find out about it.

I really don't think it's appropriate for that bug fix to sit yet
another week.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-15 23:58                             ` David Miller
@ 2007-10-16 15:23                               ` Alan Stern
  2007-10-16 22:06                                 ` David Miller
  2007-10-16 22:08                                 ` David Miller
  2007-10-16 18:26                               ` David Brownell
  1 sibling, 2 replies; 49+ messages in thread
From: Alan Stern @ 2007-10-16 15:23 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel, greg

On Mon, 15 Oct 2007, David Miller wrote:

> I want to help with this, but if I even breath on the kernel the bug
> goes away.  The race just gets harder to trigger, and if we just keep
> adding things it'll make the problem go away but for the absolutely
> wrong reasons.
> 
> The only way we will provably fix this is to make sure EHCI initialize
> fully, first, regardless of kernel config or what userland does.

Unfortunately that simply isn't possible.  No matter what you do, the 
user can always unload ehci-hcd and then load it back in again.

Do you have any idea _where_ in ohci_hub_control the hang still occurs?  
Is it the same unbounded reset loop?

Does the patch below satisfy both Davids?

Alan Stern



Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
 /* called from some task, normally khubd */
 static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 {
-	__hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
-	u32	temp;
-	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	u16	reset_done = now + PORT_RESET_MSEC;
+	__hc32 __iomem	*portstat = &ohci->regs->roothub.portstatus[port];
+	u32		temp;
+	unsigned long	reset_done = jiffies +
+				msecs_to_jiffies(PORT_RESET_MSEC);
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
@@ -578,6 +578,8 @@ static inline int root_port_reset (struc
 				return -ESHUTDOWN;
 			if (!(temp & RH_PS_PRS))
 				break;
+			if (time_after(jiffies, reset_done))
+				break;
 			udelay (500);
 		}
 
@@ -589,8 +591,7 @@ static inline int root_port_reset (struc
 		/* start the next reset, sleep till it's probably done */
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
-		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+	} while (time_before_eq(jiffies, reset_done));
 	/* caller synchronizes using PRSC */
 
 	return 0;


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-15 23:58                             ` David Miller
  2007-10-16 15:23                               ` Alan Stern
@ 2007-10-16 18:26                               ` David Brownell
  1 sibling, 0 replies; 49+ messages in thread
From: David Brownell @ 2007-10-16 18:26 UTC (permalink / raw)
  To: davem; +Cc: stern, linux-usb-users, linux-kernel, greg

> > > Bad news, even with the rwsem after a lot more testing I can still
> > > trigger the hang in ohci_hub_control() :-(
> > >
> > > I think we need to go back to considering the total serialization
> > > approach to this problem.
> > 
> > We shouldn't need that.  What happens if you add an msleep(5)
> > before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?
>
> What happens is the heisenbug will go away for another week.

Not if what I suggested is happening is really what's happening.
(Quoted next.)

It's got to be just a *simple* hardware race, and the msleep would
reliably prevent it since the switch takes a finite amount of time
to do its job.  I've had to struggle with real heisenbugs, and this
doesn't have enough conflicting behaviors inside the silicon (or
poor enough design) to qualify.


> > The theory there being that the switch triggered by setting CF
> > doesn't take effect instantaneously, contrary to the effective
> > assumption of that code.  A delay of 5 msec seems like it should
> > be more than enough, but that's kind of a guess ... it's good to
> > keep that low, since unfortunately that's in the critical path
> > for OLPC "resume from idle".
>
> I want to help with this, but if I even breath on the kernel the bug
> goes away.  The race just gets harder to trigger, and if we just keep
> adding things it'll make the problem go away but for the absolutely
> wrong reasons.

So, you're unwilling to explore whether that suggestion addresses
this problem.


> The only way we will provably fix this is to make sure EHCI initialize
> fully, first, regardless of kernel config or what userland does.

As Alan noted: no can do, in general.  That's why I've not griped
harder at the distro vendors who are ignoring the fairly simple
recommendation that's been around for six years now:  load EHCI
before other USB controller drivers.  

Admittedly, until you turned up this glitch there was no downside
known beyond the boot slowdown.


> Also, David, you haven't done anything with the feedback I gave to the
> most recent revision of the OHCI hub reset anti-wedge patch.

It's in a different mailbox, sorry.


>	You
> removed the debug logging when the outer-loop timeout expires, and I
> asked that you put that back so that if it happens there is some
> chance to know that this is what happened.  If it's not supposed to
> happen, there is no harm in putting the debugging log message there
> so that if the impossible does happen we find out about it.

It will exit by the inner loop (with diagnostic) before it exits
from the outer one.  Then the hub logic and other code will give
even more messages.


> I really don't think it's appropriate for that bug fix to sit yet
> another week.

The version I sent should just merge.

- Dave


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-16 15:23                               ` Alan Stern
@ 2007-10-16 22:06                                 ` David Miller
  2007-10-16 22:20                                   ` Greg KH
  2007-10-16 22:08                                 ` David Miller
  1 sibling, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-16 22:06 UTC (permalink / raw)
  To: stern; +Cc: david-b, linux-usb-users, linux-kernel, greg

From: Alan Stern <stern@rowland.harvard.edu>
Date: Tue, 16 Oct 2007 11:23:58 -0400 (EDT)

> Unfortunately that simply isn't possible.  No matter what you do, the 
> user can always unload ehci-hcd and then load it back in again.

Yes we can, by making OHCI and EHCI one module with a top-level
dispatch.  If you enable both OHCI and EHCI, the top-level
module will dispatch the host initializations in the correct order.

This is what I've suggested from the beginning.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-16 15:23                               ` Alan Stern
  2007-10-16 22:06                                 ` David Miller
@ 2007-10-16 22:08                                 ` David Miller
  2007-10-17 15:51                                   ` Alan Stern
  1 sibling, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-16 22:08 UTC (permalink / raw)
  To: stern; +Cc: david-b, linux-usb-users, linux-kernel, greg

From: Alan Stern <stern@rowland.harvard.edu>
Date: Tue, 16 Oct 2007 11:23:58 -0400 (EDT)

> Do you have any idea _where_ in ohci_hub_control the hang still occurs?  
> Is it the same unbounded reset loop?

Yes, with post status stuck at 0x111

> Does the patch below satisfy both Davids?

No, there are no warning messages that trigger

My last patch was just fine, it timed out appropriately and warned the
user telling them exactly what event timed out.

I'm reposting it here for reference, this is getting silly:

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
 	u16	reset_done = now + PORT_RESET_MSEC;
+	int	limit_1;
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
 	 * this might need to be deadline-scheduled.
 	 */
-	do {
+	limit_1 = 100;
+	while (--limit_1 >= 0) {
+		int limit_2;
+
 		/* spin until any current reset finishes */
-		for (;;) {
+		limit_2 = PORT_RESET_HW_MSEC * 2;
+		while (--limit_2 >= 0) {
 			temp = ohci_readl (ohci, portstat);
 			/* handle e.g. CardBus eject */
 			if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 				break;
 			udelay (500);
 		}
+		if (limit_2 < 0) {
+			ohci_warn(ohci, "Root port inner-loop reset timeout, "
+				  "portstat[%08x]\n", temp);
+		}
 
 		if (!(temp & RH_PS_CCS))
 			break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
 		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+		if (!tick_before(now, reset_done))
+			break;
+	}
+	if (limit_1 < 0) {
+		ohci_warn(ohci, "Root port outer-loop reset timeout, "
+			  "now[%04x] reset_done[%04x]\n",
+			  now, reset_done);
+	}
 	/* caller synchronizes using PRSC */
 
 	return 0;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-16 22:06                                 ` David Miller
@ 2007-10-16 22:20                                   ` Greg KH
  2007-10-17 15:56                                     ` Alan Stern
  0 siblings, 1 reply; 49+ messages in thread
From: Greg KH @ 2007-10-16 22:20 UTC (permalink / raw)
  To: David Miller; +Cc: stern, david-b, linux-usb-users, linux-kernel

On Tue, Oct 16, 2007 at 03:06:31PM -0700, David Miller wrote:
> From: Alan Stern <stern@rowland.harvard.edu>
> Date: Tue, 16 Oct 2007 11:23:58 -0400 (EDT)
> 
> > Unfortunately that simply isn't possible.  No matter what you do, the 
> > user can always unload ehci-hcd and then load it back in again.
> 
> Yes we can, by making OHCI and EHCI one module with a top-level
> dispatch.  If you enable both OHCI and EHCI, the top-level
> module will dispatch the host initializations in the correct order.
> 
> This is what I've suggested from the beginning.

Wait, you can have hardware with both EHCI and UHCI too.  Does that mean
we should merge all three together?  I don't think so :)

But perhaps we can order the hardware init stuff from all three together
like this into a separate module they all depend on.  In a way, that's
what the lock tried to do, right?  Are we just not catching all places
we could have hardware being talked to by two modules at the same time?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-16 22:08                                 ` David Miller
@ 2007-10-17 15:51                                   ` Alan Stern
  2007-10-17 23:03                                     ` David Miller
  0 siblings, 1 reply; 49+ messages in thread
From: Alan Stern @ 2007-10-17 15:51 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel, greg

On Tue, 16 Oct 2007, David Miller wrote:

> From: Alan Stern <stern@rowland.harvard.edu>
> Date: Tue, 16 Oct 2007 11:23:58 -0400 (EDT)
> 
> > Do you have any idea _where_ in ohci_hub_control the hang still occurs?  
> > Is it the same unbounded reset loop?
> 
> Yes, with post status stuck at 0x111
> 
> > Does the patch below satisfy both Davids?
> 
> No, there are no warning messages that trigger

Okay, below is my patch with a warning message.

> My last patch was just fine, it timed out appropriately and warned the
> user telling them exactly what event timed out.

It isn't just fine.  See below for comments. 

> I'm reposting it here for reference, this is getting silly:
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
> index bb9cc59..9149593 100644
> --- a/drivers/usb/host/ohci-hub.c
> +++ b/drivers/usb/host/ohci-hub.c
> @@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  	u32	temp;
>  	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
>  	u16	reset_done = now + PORT_RESET_MSEC;
> +	int	limit_1;
>  
>  	/* build a "continuous enough" reset signal, with up to
>  	 * 3msec gap between pulses.  scheduler HZ==100 must work;
>  	 * this might need to be deadline-scheduled.
>  	 */
> -	do {
> +	limit_1 = 100;

Where did 100 come from?  Why not rely on the reset_done criterion,
which clearly is what you want?  Why have two tests for end-of-loop?

> +	while (--limit_1 >= 0) {
> +		int limit_2;
> +
>  		/* spin until any current reset finishes */
> -		for (;;) {
> +		limit_2 = PORT_RESET_HW_MSEC * 2;

This also is a somewhat arbitrary limit.  There's no obvious reason why
PORT_RESET_HW_MSEC should be both the limit for this loop and the
length of the msleep below.

> +		while (--limit_2 >= 0) {
>  			temp = ohci_readl (ohci, portstat);
>  			/* handle e.g. CardBus eject */
>  			if (temp == ~(u32)0)
> @@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  				break;
>  			udelay (500);
>  		}
> +		if (limit_2 < 0) {
> +			ohci_warn(ohci, "Root port inner-loop reset timeout, "
> +				  "portstat[%08x]\n", temp);
> +		}
>  
>  		if (!(temp & RH_PS_CCS))
>  			break;
> @@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
>  		ohci_writel (ohci, RH_PS_PRS, portstat);
>  		msleep(PORT_RESET_HW_MSEC);
>  		now = ohci_readl(ohci, &ohci->regs->fmnumber);
> -	} while (tick_before(now, reset_done));
> +		if (!tick_before(now, reset_done))

I realize that you are copying existing code here, but this makes the 
same kind of mistake you are trying to fix.  If a broken controller 
fails to increment its framenumber register, this termination condition 
will never be satisfied.  Better to compare against a jiffies value.

> +			break;
> +	}
> +	if (limit_1 < 0) {
> +		ohci_warn(ohci, "Root port outer-loop reset timeout, "
> +			  "now[%04x] reset_done[%04x]\n",
> +			  now, reset_done);
> +	}

What reason is there for having two warning messages?  One ought to be 
enough.

Alan Stern



Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
 /* called from some task, normally khubd */
 static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 {
-	__hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
-	u32	temp;
-	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	u16	reset_done = now + PORT_RESET_MSEC;
+	__hc32 __iomem	*portstat = &ohci->regs->roothub.portstatus[port];
+	u32		temp;
+	unsigned long	reset_done = jiffies +
+				msecs_to_jiffies(PORT_RESET_MSEC);
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
@@ -578,6 +578,12 @@ static inline int root_port_reset (struc
 				return -ESHUTDOWN;
 			if (!(temp & RH_PS_PRS))
 				break;
+			if (time_after(jiffies, reset_done)) {
+				ohci_warn(ohci, "Port %d reset timeout, "
+						"status %x\n",
+						port + 1, temp);
+				break;
+			}
 			udelay (500);
 		}
 
@@ -589,8 +595,7 @@ static inline int root_port_reset (struc
 		/* start the next reset, sleep till it's probably done */
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
-		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+	} while (time_before_eq(jiffies, reset_done));
 	/* caller synchronizes using PRSC */
 
 	return 0;


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-16 22:20                                   ` Greg KH
@ 2007-10-17 15:56                                     ` Alan Stern
  0 siblings, 0 replies; 49+ messages in thread
From: Alan Stern @ 2007-10-17 15:56 UTC (permalink / raw)
  To: Greg KH; +Cc: David Miller, david-b, linux-usb-users, linux-kernel

On Tue, 16 Oct 2007, Greg KH wrote:

> But perhaps we can order the hardware init stuff from all three together
> like this into a separate module they all depend on.  In a way, that's
> what the lock tried to do, right?  Are we just not catching all places
> we could have hardware being talked to by two modules at the same time?

No, we do catch the one place where it happens.  The problem seems to
be that the hardware update takes some time.  That is, on one side we
take the write lock, talk to the EHCI hardware, and drop the write
lock.  On the other side we take the read lock, talk to the OHCI 
hardware, and drop the read lock.  Nevertheless, the interference 
occurs.  David B.'s interpretation is that the hardware's change of 
state takes more time than the CPU uses in manipulating locks and 
switching tasks.  Hence his suggestion for adding a delay.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-17 15:51                                   ` Alan Stern
@ 2007-10-17 23:03                                     ` David Miller
  2007-10-18 14:28                                       ` Alan Stern
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2007-10-17 23:03 UTC (permalink / raw)
  To: stern; +Cc: david-b, linux-usb-users, linux-kernel, greg

From: Alan Stern <stern@rowland.harvard.edu>
Date: Wed, 17 Oct 2007 11:51:57 -0400 (EDT)

> > +			break;
> > +	}
> > +	if (limit_1 < 0) {
> > +		ohci_warn(ohci, "Root port outer-loop reset timeout, "
> > +			  "now[%04x] reset_done[%04x]\n",
> > +			  now, reset_done);
> > +	}
> 
> What reason is there for having two warning messages?  One ought to be 
> enough.

In my patch it was possible for the inner loop one to succeed, but the
outer one to not do so.

In your's this is not the case so I guess it's OK.

I wonder if it's so wise trying to do two things at once.  Here we are
adding the loop timeouts, and also changing to using jiffies based
timeouts rather than a chip timer register based one.

I preferred my patches because it solved one single problem, the lack
of loop limits.  The timeout mechanism could have been changed in
another followon patch.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...
  2007-10-17 23:03                                     ` David Miller
@ 2007-10-18 14:28                                       ` Alan Stern
  0 siblings, 0 replies; 49+ messages in thread
From: Alan Stern @ 2007-10-18 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: david-b, linux-usb-users, linux-kernel, greg

On Wed, 17 Oct 2007, David Miller wrote:

> From: Alan Stern <stern@rowland.harvard.edu>
> Date: Wed, 17 Oct 2007 11:51:57 -0400 (EDT)
> 
> > > +			break;
> > > +	}
> > > +	if (limit_1 < 0) {
> > > +		ohci_warn(ohci, "Root port outer-loop reset timeout, "
> > > +			  "now[%04x] reset_done[%04x]\n",
> > > +			  now, reset_done);
> > > +	}
> > 
> > What reason is there for having two warning messages?  One ought to be 
> > enough.
> 
> In my patch it was possible for the inner loop one to succeed, but the
> outer one to not do so.
> 
> In your's this is not the case so I guess it's OK.
> 
> I wonder if it's so wise trying to do two things at once.  Here we are
> adding the loop timeouts, and also changing to using jiffies based
> timeouts rather than a chip timer register based one.
> 
> I preferred my patches because it solved one single problem, the lack
> of loop limits.  The timeout mechanism could have been changed in
> another followon patch.

At this point I'd be perfectly happy to sit back and let David B.
choose, among all the possibilities that have been posted, the one he
thinks is best.

Alan Stern


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2007-10-18 14:29 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-10-07  6:53 OHCI root_port_reset() deadly loop David Miller
2007-10-07  7:31 ` David Brownell
2007-10-07  7:51   ` David Miller
2007-10-08 23:54     ` David Miller
2007-10-09  3:10       ` Greg KH
2007-10-09  3:16         ` David Miller
2007-10-09  3:34           ` David Brownell
2007-10-09  3:42             ` David Miller
2007-10-09  4:39               ` Greg KH
2007-10-09  4:47                 ` David Miller
2007-10-09  5:11                   ` Benjamin Herrenschmidt
2007-10-09  6:06                   ` Greg KH
2007-10-09 19:22                     ` [linux-usb-devel] " David Brownell
2007-10-10 15:32                       ` Alan Stern
2007-10-09  5:00                 ` David Brownell
2007-10-09  5:23                   ` David Miller
2007-10-09  6:43                   ` Benjamin Herrenschmidt
2007-10-09 18:48                     ` David Brownell
2007-10-09 16:01           ` [Linux-usb-users] " Alan Stern
2007-10-09 17:39             ` Greg KH
2007-10-09 18:42               ` Alan Stern
2007-10-09 18:59                 ` David Brownell
2007-10-09 21:27                 ` David Miller
2007-10-09 21:43                   ` David Brownell
2007-10-09 22:00                     ` David Miller
2007-10-10  4:35                       ` David Miller
2007-10-15 22:01                         ` David Miller
2007-10-15 23:39                           ` David Brownell
2007-10-15 23:58                             ` David Miller
2007-10-16 15:23                               ` Alan Stern
2007-10-16 22:06                                 ` David Miller
2007-10-16 22:20                                   ` Greg KH
2007-10-17 15:56                                     ` Alan Stern
2007-10-16 22:08                                 ` David Miller
2007-10-17 15:51                                   ` Alan Stern
2007-10-17 23:03                                     ` David Miller
2007-10-18 14:28                                       ` Alan Stern
2007-10-16 18:26                               ` David Brownell
2007-10-09  4:09       ` David Brownell
2007-10-09  5:13         ` Benjamin Herrenschmidt
2007-10-09  5:26           ` David Miller
2007-10-09  6:37             ` Benjamin Herrenschmidt
2007-10-09  4:36       ` David Brownell
2007-10-09  4:44         ` David Miller
2007-10-09 16:38           ` David Brownell
2007-10-09 20:41             ` David Miller
2007-10-09 20:46               ` Greg KH
2007-10-09 21:05                 ` David Brownell
2007-10-09 21:09               ` David Brownell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).