[v3] usb: warm-reset ports on hub resume, if requested

From: Alan Stern <stern@rowland.harvard.edu>
To: Jan-Marek Glogowski <glogow@fbihome.de>
Cc: Mathias Nyman <mathias.nyman@linux.intel.com>,
	linux-usb@vger.kernel.org,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Kai-Heng Feng <kai.heng.feng@canonical.com>,
	Nicolas Boichat <drinkcat@chromium.org>,
	Nicolas Saenz Julienne <nsaenzjulienne@suse.de>,
	Jon Flatley <jflat@chromium.org>,
	Oliver Neukum <oneukum@suse.com>
Subject: [v3] usb: warm-reset ports on hub resume, if requested
Date: Thu, 31 Jan 2019 11:53:13 -0500 (EST)	[thread overview]
Message-ID: <Pine.LNX.4.44L0.1901311133080.1287-100000@iolanthe.rowland.org> (raw)

This thread has been hard to follow.  I'll try to answer some of your
concerns below...

On Thu, 31 Jan 2019, Jan-Marek Glogowski wrote:

> Am 31.01.19 um 15:56 schrieb Mathias Nyman:
> > On 31.01.2019 12:42, Jan-Marek Glogowski wrote:
> >> Am 30.01.19 um 17:56 schrieb Jan-Marek Glogowski:
> >>> Am 30.01.19 um 15:58 schrieb Mathias Nyman:
> >>>> On 30.01.2019 14:38, Jan-Marek Glogowski wrote:
> >>>>
> >>>> The hub->event bits are set when roothub polling timer function gets data from
> >>>> hub_status_data(), writes the data to a status urb, and returns the urb.

You have been concentrating on what happens with root hubs 
(understandable, since that's where your problem occurs).  But in order 
to understand the code, you have to realize that it is meant to apply 
to _all_ hubs, both root and external.

> >>>> Normal operation:
> >>>> usb_hcd_poll_rh_status()        // roothub polling timer function
> >>>>    hcd->driver->hub_status_data(hcd, buffer)   -> xhci_hub_status_data()
> >>>>      xhci_hub_status_data()      //returns buffer with bits set if a port needs attention
> >>>>    if (buffer && urb)
> >>>>      memcpy(urb->transfer_buffer, buffer)       // copy buffer bits to urb
> >>>>      usb_hcd_giveback_urb(urb)  -> hub_irq      // calls urb complete callback function
> >>>>        hub_irq()
> >>>>          hub->event_bits = urb->transfer_buffer //
> >>>>    mod_timer(usb_hcd_poll_rh_status, future)    // reschedule roothub polling timer
> >>>>
> >>>> I'm guessing usb_hcd_poll_rh_status() never properly completes a URB in your case.

It's not just a question of a timer expiring.  If we were dealing with 
an external hub, the code would have to wait until an actual USB 
transfer had completed.  There's no way to avoid that wait.

> >>>> This could be possible if roothub is polled before properly resumed, then the URB wouldn't
> >>>> be available yet, and no hub->event_bits are set.
> >>>> Because no bits are set the hub starts suspending again after resume, probably before roothub
> >>>> timer polls roothub again.
> >>>> suspend fails when it manually checks hcd->driver->hub_status_data() for activity, and
> >>>> it start resuming again, and continues this in a loop.
> >>>>
> >>>> Either roothub polling is stopped at the suspend attempt, or maybe continuously rescheduled
> >>>> into the future because as a part of suspend attempt/resume loop.
> >>
> >> So as you expected the status_urb is always NULL.
> >>
> >> I tried to understand more of the code by adding some more debug message.
> >>
> >> I looked at usb_hcd_resume_root_hub, which is called when we see the "resume root hub" dmesg.
> >> This just queues the hcd->wakeup_work, so indeed, if usb_hcd_poll_rh_status is later called,
> >> the hub might not have resumed yet and polling starts. So my simple approach was to add a
> >> function to call flush_work(&hcd->wakeup_work) before usb_hcd_poll_rh_status. All variants I
> >> tested killed the kernel. I also tried to add a udelay to no avail.
> > 
> > I think the idea is that if usb_hcd_poll_rh_status() finds a port that needs attention, but
> > the is URB missing, then a HCD_FLAG_POLL_PENDING flag is set.
> > 
> > This flag is checked in resume when submitting the hub URB, and idea is to react on it:
> > 
> > hub_activate(HUB_RESUME)
> >     usb_submit_urb(hub->urb)
> >      rh_urb_enqueue()
> >        rh_queue_status()
> >           hcd->status_urb = urb;
> >           if (HCD_POLL_PENDING(hcd)
> >              mod_timer(&hcd->rh_timer, jiffies);
> > 
> > It tries to trigger usb_hcd_poll_rh_status() immediately by modifying the timer,
> > but from your logs it seems that it's not fast enough:
> > 
> > [  485.301806] usbcore:hub_resume: hub 2-0:1.0: hub_resume
> > [  485.301819] xhci_hcd:xhci_hub_control: xhci_hcd 0000:00:14.0: get port status, actual port 0 status  = 0x2a0
> >  ...
> > [  485.301936] usbcore:hub_event: hub 2-0:1.0: state 7 ports 6 chg 0000 evt 0000
> > [  485.301952] xhci_hcd:xhci_hub_control: xhci_hcd 0000:00:14.0: set port remote wake mask, actual port 0 status  = 0xe0002a0
> > ...
> > [  485.302034] usbcore:hub_suspend: hub 2-0:1.0: hub_suspend
> > [  485.302048] usbcore:hcd_bus_suspend: usb usb2: bus auto-suspend, wakeup 1
> > [  485.302067] usbcore:hcd_bus_suspend: usb usb2: suspend raced with wakeup event
> > [  485.302071] usbcore:hcd_bus_resume: usb usb2: usb auto-resume
> > [  485.309765] usb usb2: usb_hcd_poll_rh_status, length 1, hcd->flags 0x18d, status_urb NULL
> > 
> > hcd->flags 0x18d means HCD_FLAG_POLL_PENDING  was set, but hub_event() workqueue and hub_suspend()
> > are run before polling the root hub, and I guess the URB gets removed at some point during suspsend.
> > 
> > This then goes on in a loop.
> > 
> > You could try running usb_hcd_poll_rh_status() direcly instead of just kicking the timer and see if
> > it works:
> > 
> > @@ -822,8 +826,11 @@ static int rh_queue_status (struct usb_hcd *hcd, struct urb *urb)
> >                 mod_timer(&hcd->rh_timer, (jiffies/(HZ/4) + 1) * (HZ/4));
> >  
> >         /* If a status change has already occurred, report it ASAP */
> > -       else if (HCD_POLL_PENDING(hcd))
> > -               mod_timer(&hcd->rh_timer, jiffies);
> > +       else if (HCD_POLL_PENDING(hcd)) {
> > +               spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
> > +               usb_hcd_poll_rh_status(hcd);
> > +               return 0;
> > +       }
> >         retval = 0;
> >   done:
> >         spin_unlock_irqrestore (&hcd_root_hub_lock, flags);
> > 
> > 
> > I didn't check locking or running contexts if it's ok to do this, there's could be a reason
> > why it wasn't done like this in the first place.
> > Worth a shot still.

It's not a good idea.  Deferring the operation to a timer keeps things
decoupled (running in different contexts with different stacks); doing
a direct call could get into a loop of nested calls that would keep
going deeper and deeper until it exhausted the kernel stack.

Perhaps you didn't notice that at the end, hub_activate() calls
kick_hub_wq().  That routine calls
usb_autopm_get_interface_no_resume(), which will prevent the hub from
suspending until hub_event() calls usb_autopm_put_interface().  
Therefore to make things work correctly, hub_activate() has to ensure
that the hub->event_bits and hub->change_bits fields are set correctly,
so that hub_event() will do the right thing when it runs (before the 
hub is allowed to go back into suspend).

Therefore if a port is in a funny state, hub_activate() should detect
this and set one of these bit fields appropriately.  That's one reason
why it contains all those tests for portstatus and portchange.  If a
missing test needs to be added, that's what you should do.

Alan Stern